Path to Multimodal Generalist

Leaderboard under Scope D: Skill-specific Hero 🛠️

In a broader sense, most multimodal generalists tend to prefer evaluation on a smaller subset of data and tasks—this is also the common setting for most existing benchmarks assessing MLLM capabilities. In many cases, a model may excel in a specific type of task (i.e., a particular skill), and a more focused benchmark setting allows such skill-specialized models to stand out—these are what we refer to as skill-specific generalists. For example, GPT-4V, GPT-4o, and Gemini perform exceptionally well in VQA and captioning tasks, while LISA and NExT-Chat are highly proficient in detection, and 3D-VisTA specializes in 3D QA. To support this, we introduce Scope-D: a collection of fine-grained, skill (task-cluster) specific leaderboards tailored for skill-specific partial generalists.

Building upon the 8 groups defined in Scope-C, Scope-D further divides each group into multiple skill clusters (i.e., task clusters) based on task similarity. Each skill group contains tasks with closely related evaluation characteristics, ensuring a consistent capability assessment. As it avoids cross-paradigm scenarios, models under Scope-D can achieve at most General-Level 3. Since each Scope-D leaderboard targets only a small number of tasks, the evaluation cost and participation barrier for participants are the lowest among all scopes. Accordingly, Scope-D also includes the largest number of leaderboards. This type of leaderboards helps identify model strengths and specialization areas, and encourages progressive development toward broader leaderboard participation.

Submit evaluation results of your model on the leaderboard: Submit

Download the corresponding evaluation subset for Classification & Recognition under Image Comprehension (ID: #S-D-IC-CR).

Choose a sub-board:

Image Comprehension

Classification & Recognition

Detection

Segmentation

VQA

Captioning

Spatial Reasoning

Image-to-X Generation

OCR

Matching

Multimodal Translation

Image Generation

Text-to-Image Generation

Conditioned Image Generation

Editing

Enhancement

Video Comprehension

Video Segmentation & Tracking & Grounding

Video QA

Video Recognition

Action Understanding

Depth Estimation

Flow Estimation

Object Matching

Video Generation

Text-to-Video Generation

Conditioned Video Generation

Video Enhancement

Video Editing

3D Comprehension

Classification

Segmentation

QA & Caption

3D Estimation

3D Generation

Point Cloud Generation

Mesh Generation

Motion Generation

Audio Comprehension

Audio & Sound Understanding

Speech Understanding

Music Understanding

Audio Generation

Speech Generation

Conditioned Audio Generation

Audio Editing

Music Generation

Language

Reasoning Ability

Math Ability

Code Ability

Social QA

Science QA

Cross-lingual Ability

Text Generation

Text Classification

Affective Computing

Information Extraction

Linguistic & Semantic Parsing

Time Series

📌 Go to [All-Tasks] page to find all tasks included in Classification & Recognition under Image Comprehension.

Classification & Recognition Hero at Level-2 Ranking in

Classification & Recognition Hero at Level-3 Ranking in