Statistics of the Video Comprehension Skills

This section presents detailed statistics for skills categorized under the Video modality. The video-related tasks are grouped based on their functional objectives and evaluated using standard multimodal benchmarks.

V-C-1 Video Question Answering (Video QA)

This cluster focuses on understanding and interpreting visual content from videos in order to answer domain-specific questions.

Agriculture Video Question Answering

This task focuses on answering geography-related questions grounded in video observations.

Abbreviation:

Agri-Vid QA

Domain:

Culture

Capability:

Content Recognition, Affective Analysis

Data Source:

WildQA

Number:

75

SoTA Specialist:

Qwen-2-VL-72B

Metrics:

BLEU-1

Geography Video Question Answering

This task focuses on answering geography-related questions grounded in video observations.

Abbreviation:

Geo-Vid QA

Domain:

General

Capability:

Content Recognition, Affective Analysis

Data Source:

WildQA

Number:

81

SoTA Specialist:

Qwen-2-VL-72B

Metrics:

BLEU-1

V-C-2 Video Object Recognition (Vid-Obj Recog)

This cluster evaluates the ability of models to identify and classify objects in dynamic scenes.

Pets Video Recognition

This task evaluates the ability of models to identify and classify objects in dynamic scenes.

Abbreviation:

Pets Video Recog

Domain:

General

Capability:

Content Recognition

Data Source:

MMBench-Video

Number:

87

SoTA Specialist:

Aria

Metrics:

Acc

Science Video Recognition

This task focuses on recognizing scientific objects or phenomena in educational and knowledge-rich videos.

Abbreviation:

Sci-Vid Recog

Domain:

Knowledge

Capability:

Content Recognition

Data Source:

MMBench-Video

Number:

100

SoTA Specialist:

Aria

Metrics:

Acc