Dataset Statistics

Statistics of the Skills

The following table summarizes the number of skills, tasks, and instances across each modality and type.

Image

Video

Audio

3D

Language

TOTAL

Comp

Gen

Comp

Gen

Comp

Gen

Comp

Gen

#Skill

Single

40

15

20

6

9

11

13

9

22

145

Sum

55

26

20

22

#Task

Single

45

126

46

24

20

30

22

118

702

Sum

316

170

44

52

#Instance

Single

26,610

44,442

16,430

11,247

9,516

23,705

10,614

58,432

325,876

Sum

151,490

60,872

20,763

34,319

Image

This section provides a detailed breakdown of the skills covered under the Image modality. It includes a wide range of vision tasks such as image classification, object detection, image captioning, and image question answering.

Video

This section provides a detailed breakdown of the skills covered under the Video modality. It covers temporal vision tasks like action recognition, video captioning, event localization, and video question answering.

Audio

This section provides a detailed breakdown of the skills covered under the Audio modality. It includes tasks such as audio classification, speech recognition, sound event detection, and audio captioning.

3D

This section provides a detailed breakdown of the skills covered under the 3D modality. It spans spatial understanding tasks such as 3D object detection, 3D captioning, and point cloud segmentation.

Language

This section provides a detailed breakdown of the skills covered under the Language modality. It includes traditional NLP tasks such as text classification, question answering, , and``named entity recognition``.