Dataset Statistics 

Statistics of the Skills 

The following table summarizes the number of skills, tasks, and instances across each modality and type.

		Image		Video		Audio		3D		Language	TOTAL
		Comp	Gen	Comp	Gen	Comp	Gen	Comp	Gen	Language	TOTAL
#Skill	Single	40	15	20	6	9	11	13	9	22	145
#Skill	Sum	55		26		20		22
#Task	Single	45	126	46	24	20	30	22	118	702
#Task	Sum	316		170		44		52
#Instance	Single	26,610	44,442	16,430	11,247	9,516	23,705	10,614	58,432	325,876
#Instance	Sum	151,490		60,872		20,763		34,319

This section provides a detailed breakdown of the skills covered under the Image modality. It includes a wide range of vision tasks such as image classification, object detection, image captioning, and image question answering.

Video 

This section provides a detailed breakdown of the skills covered under the Video modality. It covers temporal vision tasks like action recognition, video captioning, event localization, and video question answering.

Audio 

This section provides a detailed breakdown of the skills covered under the Audio modality. It includes tasks such as audio classification, speech recognition, sound event detection, and audio captioning.

3D 

This section provides a detailed breakdown of the skills covered under the 3D modality. It spans spatial understanding tasks such as 3D object detection, 3D captioning, and point cloud segmentation.

Language 

This section provides a detailed breakdown of the skills covered under the Language modality. It includes traditional NLP tasks such as text classification, question answering, , and``named entity recognition``.

Statistics of the Language Skills

Dataset Statistics

Statistics of the Skills

Image

Video

Audio

3D

Language