Dataset Statistics
Statistics of the Skills
The following table summarizes the number of skills, tasks, and instances across each modality and type.
Image |
Video |
Audio |
3D |
Language |
TOTAL |
||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Comp |
Gen |
Comp |
Gen |
Comp |
Gen |
Comp |
Gen |
||||
#Skill |
Single |
40 |
15 |
20 |
6 |
9 |
11 |
13 |
9 |
22 |
145 |
Sum |
55 |
26 |
20 |
22 |
|||||||
#Task |
Single |
45 |
126 |
46 |
24 |
20 |
30 |
22 |
118 |
702 |
|
Sum |
316 |
170 |
44 |
52 |
|||||||
#Instance |
Single |
26,610 |
44,442 |
16,430 |
11,247 |
9,516 |
23,705 |
10,614 |
58,432 |
325,876 |
|
Sum |
151,490 |
60,872 |
20,763 |
34,319 |
Image
This section provides a detailed breakdown of the skills covered under the Image modality.
It includes a wide range of vision tasks such as image classification
, object detection
, image captioning
, and image question answering
.
Video
This section provides a detailed breakdown of the skills covered under the Video modality.
It covers temporal vision tasks like action recognition
, video captioning
, event localization
, and video question answering
.
Audio
This section provides a detailed breakdown of the skills covered under the Audio modality.
It includes tasks such as audio classification
, speech recognition
, sound event detection
, and audio captioning
.
3D
This section provides a detailed breakdown of the skills covered under the 3D modality.
It spans spatial understanding tasks such as 3D object detection
, 3D captioning
, and point cloud segmentation
.
Language
This section provides a detailed breakdown of the skills covered under the Language modality.
It includes traditional NLP tasks such as text classification
, question answering
, , and``named entity recognition``.