Path to Multimodal Generalist

Leaderboard under Scope C: Comprehension/Generation Hero 💪

Leaderboards categorized by comprehension vs. generation tasks within each modality. Many MLLMs are not equally capable of handling both comprehension and generation paradigms—even within the same modality. GPT4-o, for example, excels at image understanding but remains relatively weak in image generation. Therefore, Scope-C further splits the tasks within each modality into comprehension and generation groups.

Scope-C consists of 2 paradigms × 4 modalities = 8 leaderboards in total. Since the number and types of tasks are reduced, Scope-C is moderately easier and less costly compared to Scope-A and Scope-B. As it avoids cross-paradigm scenarios, models under Scope-C can achieve at most General-Level 3. Nevertheless, Scope-C is a valuable setting for evaluating a model's generalization ability across task types. It also highlights models that are particularly strong in a specific modality-paradigm pair—such as GPT4-o, Qwen-VL, which are SoTAs in image understanding.

Submit evaluation results of your model on the leaderboard: Submit

Download the corresponding evaluation subset for Image Comprehension (ID: #S-C-IC) group.

Choose a sub-board:

Image Comprehension

Image Generation

Video Comprehension

Video Generation

3D Comprehension

3D Generation

Audio Comprehension

Audio Generation

📌 Go to [All-Tasks] page to find all tasks included in Image Comprehension group.

Comprehension in Image Hero at Level-2 Ranking in

Comprehension in Image Hero at Level-3 Ranking in