FAQ
The following we show the frequently questions asked by the users:
1. What is the main goal of this benchmark?
The benchmark is designed to systematically evaluate the generalization, cross-modal reasoning, and fine-grained task-specific capabilities of multimodal large language models (MLLMs). It supports evaluation across diverse tasks, modalities, and difficulty levels, making it suitable for models ranging from skill-specific specialists to general-purpose AGI-level systems.
2. What is the difference between the Closed Set and Open Set?
Closed Set: Only input data is released. Users submit predictions for centralized evaluation. Used for leaderboard ranking.
Open Set: Full data (inputs + ground truths) is released. Suitable for research, analysis, or self-evaluation, but not ranked.
3. How do I choose the right evaluation scope (Scope A–D) for my model?
Choose based on your model’s capabilities:
Scope-A: For generalist models covering all modalities and tasks.
Scope-B: For modality-specific models (e.g., image-only, audio-only).
Scope-C: For models focusing on comprehension vs. generation tasks.
Scope-D: For models targeting specific task clusters (e.g., VQA, ASR).
4. Can I evaluate my model on multiple scopes simultaneously?
Yes. You may submit to multiple scopes if your model supports them. Each scope has an independent leaderboard and evaluation protocol.
5. What types of tasks are included in the benchmark?
Tasks are categorized across several modalities (text, image, audio, video, 3D, etc.) and include both comprehension (e.g., classification, QA) and generation (e.g., captioning, synthesis) tasks. Examples:
Visual Question Answering (VQA)
Image Captioning
Speech Recognition
Video Summarization
3D Object Reconstruction
6. Is participation in the leaderboard mandatory?
No. You can use the Open Set data for internal use, paper experiments, or debugging without submitting to the leaderboard.
7. How is the final score calculated?
For each scope, scores are aggregated based on task-specific metrics. For Scope-A, a General-Level Score is calculated by averaging normalized task scores across all included modalities and tasks. Detailed scoring formulas are available in the Evaluation Documentation.
8. Are there resource or compute requirements for participation?
There are no strict requirements, but:
Scope-A may require high compute due to full-modality coverage.
Scope-C and D are more lightweight and suitable for smaller models or limited resources.
9. Can I use the dataset for commercial or non-academic purposes?
Please check the data usage license associated with each dataset. Most are for research only, but licensing terms may vary by source.
10. How do I submit predictions to the leaderboard?
You must:
Choose a target scope (A–D) and test set (Closed).
Format predictions as per the JSON specification.
Submit your output through the official evaluation platform or contact the organizing team (if offline evaluation is temporarily enabled).
Results will be posted on the leaderboard if formatting and evaluation pass.