Statistics of the Video Generation Skills

This section presents detailed statistics for skills categorized under the Video modality. The video generation tasks are grouped according to their functional objectives, such as text-to-video synthesis, conditional video generation, and video enhancement. Each task is evaluated using standardized multimodal benchmarks and creativity-oriented metrics.

V-G-1 Text-to-Video Generation (Txt2Vid Gen)

This cluster involves generating videos directly from textual prompts, requiring strong visual imagination and alignment with semantic intent.

Subject-Driven Text to Video Generation

This task requires generating videos that preserve subject identity (e.g., a specific person or object) across frames.

Abbreviation:

Subj-Txt2Vid Gen

Domain:

General

Capability:

Creativity and Innovation

Data Source:

VBench

Number:

500

SoTA Specialist:

Zeroscope

Metrics:

DINOScore

Background-Consistency Text to Video Generation

This task focuses on producing videos with consistent and stable background elements, even across scene transitions.

Abbreviation:

BG-ConsisTxt2Vid Gen

Domain:

General

Capability:

Creativity and Innovation

Data Source:

VBench

Number:

500

SoTA Specialist:

LaVie

Metrics:

CLIPScore

V-G-2 Conditional Video Generation (Condt Vid Gen)

This cluster emphasizes controllability in generation by incorporating style, scene, or attribute-level constraints in addition to textual prompts.

Style-Specific Text to Video Generation (Style-Txt2Vid Gen)

This task involves generating videos aligned with a specified visual style (e.g., cartoon, cinematic, or sketch).

Abbreviation:

BG-ConsisTxt2Vid Gen

Domain:

General

Capability:

Creativity and Innovation

Data Source:

Manual

Number:

500

SoTA Specialist:

VideoCrafterv2

Metrics:

CLIPScore

V-G-3 Video Action Generation (Act Txt2Vid Gen)

This cluster focuses on generating dynamic actions (e.g., jumping, dancing, running) as described by textual instructions.

Action-Specific Text to Video Generation

This task requires synthesizing temporally coherent actions aligned with the described behavior.

Abbreviation:

Action-Txt2Vid Gen

Domain:

Sports

Capability:

Creativity and Innovation

Data Source:

Kinetics700

Number:

500

SoTA Specialist:

LaVie

Metrics:

Suc-Rate

V-G-4 Image-to-Video Generation (Img2Vid Gen)

This cluster involves generating temporally extended video sequences based on static image inputs.

(Task details to be added.)

V-G-5 Video Enhancement (Vid Enhance)

This cluster focuses on improving the visual quality of existing video content, including denoising, super-resolution, and temporal smoothing.

(Task details to be added.)

V-G-6 Video Editing (Vid Edit)

This cluster covers tasks such as video object removal, temporal replacement, background editing, and region-based video manipulation.

(Task details to be added.)