Statistics of the Video Comprehension Skills
This section presents detailed statistics for skills categorized under the Video
modality.
The video-related tasks are grouped based on their functional objectives and evaluated using standard multimodal benchmarks.
V-C-1 Video Question Answering (Video QA)
This cluster focuses on understanding and interpreting visual content from videos in order to answer domain-specific questions.
- Agriculture Video Question Answering
This task focuses on answering geography-related questions grounded in video observations.
- Abbreviation:
Agri-Vid QA
- Domain:
Culture
- Capability:
Content Recognition, Affective Analysis
- Data Source:
WildQA
- Number:
75
- SoTA Specialist:
Qwen-2-VL-72B
- Metrics:
BLEU-1
- Geography Video Question Answering
This task focuses on answering geography-related questions grounded in video observations.
- Abbreviation:
Geo-Vid QA
- Domain:
General
- Capability:
Content Recognition, Affective Analysis
- Data Source:
WildQA
- Number:
81
- SoTA Specialist:
Qwen-2-VL-72B
- Metrics:
BLEU-1
V-C-2 Video Object Recognition (Vid-Obj Recog)
This cluster evaluates the ability of models to identify and classify objects in dynamic scenes.
- Pets Video Recognition
This task evaluates the ability of models to identify and classify objects in dynamic scenes.
- Abbreviation:
Pets Video Recog
- Domain:
General
- Capability:
Content Recognition
- Data Source:
MMBench-Video
- Number:
87
- SoTA Specialist:
Aria
- Metrics:
Acc
- Science Video Recognition
This task focuses on recognizing scientific objects or phenomena in educational and knowledge-rich videos.
- Abbreviation:
Sci-Vid Recog
- Domain:
Knowledge
- Capability:
Content Recognition
- Data Source:
MMBench-Video
- Number:
100
- SoTA Specialist:
Aria
- Metrics:
Acc