Path to Multimodal Generalist

Model Diagnostics

In this page, we present a comprehensive diagnostic analysis of multimodal generalist models that are included in our General-Bench 🧠 leaderboard. Built upon an exceptionally large-scale, multi-dimensional 📊 evaluation benchmark, General-Bench enables broad and in-depth assessment across diverse modalities, tasks, and paradigms 🔄.

While leaderboard rankings 🏅 offer a high-level view of overall performance, they often mask the nuanced strengths and weaknesses exhibited by each model across different dimensions. To bridge this gap, our Model Diagnostics aims to unpack these subtleties—identifying where each model excels 💪 and where it struggles ⚠️ across modalities, capabilities, and task types.

We believe such fine-grained diagnostics are essential for guiding the future development of stronger and more robust multimodal models 🚀. We believe this effort plays a critical role in advancing the field toward truly universal multimodal generalists—and ultimately, Artificial General Intelligence (AGI) 🤖.

Stay Tuned!

This module is in the middle of development, coming soon!

Experimental Analyses and Findings

✨ Based on the General-Level framework, we conduct comprehensive experimental analysis on the General-Bench dataset using 172 specialist models and 102 generalist models, from two key perspectives: Capability Breakdown and Synergy Analysis, across three tiers: Tasks, Modalities, and Paradigms (Comprehension vs. Generation).

🧪 Capability BreakDown - Task Supporting

👉 Take-away: Current MLLMs generally exhibit limited task support, with a strong bias toward simpler comprehension tasks and significant challenges in covering diverse and complex generation skills across modalities.

📢 Synergy Analysis - Synergy Across Skills

👉 Take-away: Synergy effects in MLLMs are uneven across skills, with stronger synergy observed in generation tasks and among closely related skills, particularly in models with higher Level-3 scores.

🧪 Capability BreakDown - Modality Supporting

👉 Take-away: Most MLLMs support only a single non-language modality, while only a few—like NExT-GPT-V1.5 or Unified-IO-2—demonstrate truly broad, all-modality capabilities.

📢 Synergy Analysis - Synergy Across Modalities

👉 Take-away: Synergy is strongest between image and video modalities, while language shows only one-way synergy toward other modalities; no modalities really significantly enhance language tasks—highlighting a key limitation of current MLLMs.

🧪 Capability BreakDown - Capabilities on Comprehension vs. Generation

👉 Take-away: Most MLLMs are stronger at comprehension than generation, due to the greater complexity and training cost of generation; only a few models, like Vitron-V1, demonstrate balanced capabilities across both paradigms.

📢 Synergy Analysis - Synergy Across Comprehension and Generation

👉 Take-away: Only a few MLLMs exhibit synergy between comprehension and generation, with Mini-Gemini showing the strongest effect—mainly within the image modality.