Statistics of the Audio Generation Skills

This section presents detailed statistics for skills categorized under the Audio modality. The audio generation tasks are primarily focused on sound synthesis, transformation, and editing, requiring models to generate or manipulate acoustic content under various conditions. Tasks are evaluated using perceptual, semantic, and alignment-based metrics.

A-G-1 Audio Edit (Audio Edit)

This cluster focuses on editing or transforming audio signals for creative or corrective purposes.

AudioEdit

This task involves editing audio based on natural language instructions or other conditions.

Abbreviation:

AudioEdit

Domain:

Art

Capability:

Creativity and Innovation

Data Source:

Song Describer Dataset

Number:

16

SoTA Specialist:

ControlNet for Diffusion Transformer

Metrics:

CLAP

A-G-2 Dialogue Speech Generation (Dialog Gen)

This cluster focuses on generating natural daily dialogue speech through text or prompt-based generation.

Daily Talk Generation

This task involves generating speech for daily conversational scenarios.

Abbreviation:

DailyTalk Gen

Domain:

General

Capability:

Creativity and Innovation

Data Source:

DailyTalk

Number:

500

SoTA Specialist:

FastSpeech2

Metrics:

MOS

A-G-3 Emotional Speech Generation (EmoSpeech Gen)

This cluster focuses on synthesizing or transferring emotional states in speech for affective communication.

Emotional Speech Synthesis

This task focuses on synthesizing speech with specific emotional tones.

Abbreviation:

EmoSpeech Syn

Domain:

General

Capability:

Affective Analysis

Data Source:

Emotional Speech Data (Zhou et al., 2021)

Number:

500

SoTA Specialist:

DeepEST

Metrics:

MCD

Emotion Style Transfer

This task transfers emotional characteristics from one speech sample to another.

Abbreviation:

EmoStyleTransfer

Domain:

General

Capability:

Affective Analysis

Data Source:

EmotionalSpeech

Number:

500

SoTA Specialist:

DeepEST

Metrics:

MOS

A-G-4 Text-To-Speech Synthesis (TTS)

This cluster focuses on converting written text into natural-sounding speech across languages and modalities.

Text-To-Speech

This task converts text to speech for general TTS systems.

Abbreviation:

TTS

Domain:

Linguistics

Capability:

Commonsense Knowledge, Creativity and Innovation

Data Source:

LibriSpeech test-clean

Number:

500

SoTA Specialist:

USLM

Metrics:

WER

Multimodal TTS

This task synthesizes speech using multimodal inputs.

Abbreviation:

MTTS

Domain:

General

Capability:

Creativity and Innovation

Data Source:

MEAD-TTS

Number:

500

SoTA Specialist:

MM-TTS

Metrics:

MOS

A-G-5 Text-to-Audio Synthesis (Txt2Aud)

This cluster focuses on generating general-purpose audio from single or multiple text inputs.

Single Caption To Audio Generation

This task generates audio from a single caption describing a sound or scene.

Abbreviation:

1CapToAudio

Domain:

General

Capability:

Creativity and Innovation

Data Source:

AudioCap

Number:

500

SoTA Specialist:

AudioLDM2

Metrics:

CLAP

Two Captions To Audio Generation

This task uses two captions to guide the synthesis of a single coherent audio clip.

Abbreviation:

2CapsToAudio

Domain:

General

Capability:

Creativity and Innovation

Data Source:

AudioCap

Number:

500

SoTA Specialist:

AudioLDM2

Metrics:

CLAP

A-G-6 Image-to-Audio Synthesis (Img2Aud Gen)

This cluster focuses on synthesizing audio from visual imagery, including speech or ambient sounds.

Image-to-Speech

This task converts image content into speech-based descriptions.

Abbreviation:

ImageToSpeech

Domain:

General

Capability:

Content Recognition

Data Source:

Flickr8k Audio

Number:

500

SoTA Specialist:

Im2Sp

Metrics:

CIDEr

A-G-7 Video-to-Audio Synthesis (V2A)

This cluster focuses on generating realistic or descriptive audio directly from video inputs.

Video-to-Audio

This task involves synthesizing audio (e.g., speech or ambient sounds) based on video scenes.

Abbreviation:

Video2Audio

Domain:

General

Capability:

Commonsense Knowledge

Data Source:

AVSync15

Number:

500

SoTA Specialist:

Diff-Foley

Metrics:

FAD

A-G-8 Speech Style Transfer (Style Trans)

This cluster focuses on converting one voice into another while preserving linguistic content.

Voice Conversion

This task transfers the vocal style of one speaker to another’s voice recording.

Abbreviation:

VoiceConversion

Domain:

General

Capability:

Content Recognition

Data Source:

VCTK Corpus

Number:

500

SoTA Specialist:

USLM

Metrics:

WER

A-G-9 Speech Translation (Speech Trans)

This cluster focuses on translating spoken content between different languages using audio inputs only.

Chinese-to-English Speech Translation

This task translates spoken Chinese into English.

Abbreviation:

SpeechTrans (Zh-En)

Domain:

Multidomain

Capability:

Content Recognition

Data Source:

GigaST

Number:

500

SoTA Specialist:

USLM

Metrics:

WER

English-to-Chinese Speech Translation

This task translates spoken English into Chinese.

Abbreviation:

SpeechTrans (En-Zh)

Domain:

General

Capability:

Content Recognition

Data Source:

CoVoST v1

Number:

500

SoTA Specialist:

USLM

Metrics:

WER