Dataset Statistics
Statistics of the Skills
The following table summarizes the number of skills, tasks, and instances across each modality and type.
Image |
Video |
Audio |
3D |
Language |
TOTAL |
||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
Comp |
Gen |
Comp |
Gen |
Comp |
Gen |
Comp |
Gen |
||||
#Skill |
Single |
40 |
15 |
20 |
6 |
9 |
11 |
13 |
9 |
22 |
145 |
Sum |
55 |
26 |
20 |
22 |
|||||||
#Task |
Single |
45 |
126 |
46 |
24 |
20 |
30 |
22 |
118 |
702 |
|
Sum |
316 |
170 |
44 |
52 |
|||||||
#Instance |
Single |
26,610 |
44,442 |
16,430 |
11,247 |
9,516 |
23,705 |
10,614 |
58,432 |
325,876 |
|
Sum |
151,490 |
60,872 |
20,763 |
34,319 |
|||||||
Image
This section provides a detailed breakdown of the skills covered under the Image modality.
It includes a wide range of vision tasks such as image classification, object detection, image captioning, and image question answering.
Video
This section provides a detailed breakdown of the skills covered under the Video modality.
It covers temporal vision tasks like action recognition, video captioning, event localization, and video question answering.
Audio
This section provides a detailed breakdown of the skills covered under the Audio modality.
It includes tasks such as audio classification, speech recognition, sound event detection, and audio captioning.
3D
This section provides a detailed breakdown of the skills covered under the 3D modality.
It spans spatial understanding tasks such as 3D object detection, 3D captioning, and point cloud segmentation.
Language
This section provides a detailed breakdown of the skills covered under the Language modality.
It includes traditional NLP tasks such as text classification, question answering, , and``named entity recognition``.