Leaderboard under Scope D: Skill-specific Hero 🛠️
In a broader sense, most multimodal generalists tend to prefer evaluation on a smaller subset of data and tasks—this is also the common setting for most existing benchmarks assessing MLLM capabilities. In many cases, a model may excel in a specific type of task (i.e., a particular skill), and a more focused benchmark setting allows such skill-specialized models to stand out—these are what we refer to as skill-specific generalists. For example, GPT-4V, GPT-4o, and Gemini perform exceptionally well in VQA and captioning tasks, while LISA and NExT-Chat are highly proficient in detection, and 3D-VisTA specializes in 3D QA. To support this, we introduce Scope-D: a collection of fine-grained, skill (task-cluster) specific leaderboards tailored for skill-specific partial generalists.
Building upon the 8 groups defined in Scope-C, Scope-D further divides each group into multiple skill clusters (i.e., task clusters) based on task similarity. Each skill group contains tasks with closely related evaluation characteristics, ensuring a consistent capability assessment. As it avoids cross-paradigm scenarios, models under Scope-D can achieve at most General-Level 3. Since each Scope-D leaderboard targets only a small number of tasks, the evaluation cost and participation barrier for participants are the lowest among all scopes. Accordingly, Scope-D also includes the largest number of leaderboards. This type of leaderboards helps identify model strengths and specialization areas, and encourages progressive development toward broader leaderboard participation.

Choose a sub-board:
Image Comprehension
Image Generation
Video Comprehension
Video Generation
3D Comprehension
3D Generation
Audio Comprehension
Audio Generation
📌 Go to [All-Tasks] page to find all tasks included in Classification & Recognition under Image Comprehension.