Dataset Category

A companion massive multimodal benchmark dataset, encompasses a broader spectrum of skills, modalities, formats, and capabilities. Specifically, it comprises 145 multimodal skills, and contains 702 tasks under 5 major modalities of both comprehension and generation, covering 29 domains across various totally free-form task formats (with various and raw evaluation metrics), with over 325,876 samples in total (will be further increasing).

To accommodate varying computational budgets and model development stages, General-Bench provides a dual-set evaluation protocol by dividing the test data for each task into two distinct subsets: the closed set and the open set.

General-Bench-Openset: This subset is designated specifically for leaderboard submissions. Only the input data is released publicly, while the corresponding ground-truth annotations are kept hidden. Participants are required to submit their model predictions, which will be centrally evaluated and ranked on the leaderboard.

General-Bench-Closeset: The open set includes both the input data and the corresponding ground-truth outputs. This allows researchers and developers to freely explore the dataset, conduct custom analyses, or use the data for publications and internal benchmarking, without participating in the official leaderboard.