Statistics of the Video Comprehension Skills 

This section presents detailed statistics for skills categorized under the Video modality. The video-related tasks are grouped based on their functional objectives and evaluated using standard multimodal benchmarks.

V-C-1 Video Question Answering (Video QA)

This cluster focuses on understanding and interpreting visual content from videos in order to answer domain-specific questions.

Agriculture Video Question Answering

This task focuses on answering geography-related questions grounded in video observations.

Abbreviation:: Agri-Vid QA
Domain:: Culture
Capability:: Content Recognition, Affective Analysis
Data Source:: WildQA
Number:: 75
SoTA Specialist:: Qwen-2-VL-72B
Metrics:: BLEU-1

Geography Video Question Answering

This task focuses on answering geography-related questions grounded in video observations.

Abbreviation:: Geo-Vid QA
Domain:: General
Capability:: Content Recognition, Affective Analysis
Data Source:: WildQA
Number:: 81
SoTA Specialist:: Qwen-2-VL-72B
Metrics:: BLEU-1

V-C-2 Video Object Recognition (Vid-Obj Recog)

This cluster evaluates the ability of models to identify and classify objects in dynamic scenes.

Pets Video Recognition

This task evaluates the ability of models to identify and classify objects in dynamic scenes.

Abbreviation:: Pets Video Recog
Domain:: General
Capability:: Content Recognition
Data Source:: MMBench-Video
Number:: 87
SoTA Specialist:: Aria
Metrics:: Acc

Science Video Recognition

This task focuses on recognizing scientific objects or phenomena in educational and knowledge-rich videos.

Abbreviation:: Sci-Vid Recog
Domain:: Knowledge
Capability:: Content Recognition
Data Source:: MMBench-Video
Number:: 100
SoTA Specialist:: Aria
Metrics:: Acc

Statistics of the Video Comprehension Skills

V-C-1 Video Question Answering (Video QA)

V-C-2 Video Object Recognition (Vid-Obj Recog)

Statistics of the Video Comprehension Skills 