Statistics of the Audio Generation Skills
This section presents detailed statistics for skills categorized under the Audio
modality.
The audio generation tasks are primarily focused on sound synthesis, transformation, and editing, requiring models to generate or manipulate acoustic content under various conditions.
Tasks are evaluated using perceptual, semantic, and alignment-based metrics.
A-G-1 Audio Edit (Audio Edit)
This cluster focuses on editing or transforming audio signals for creative or corrective purposes.
- AudioEdit
This task involves editing audio based on natural language instructions or other conditions.
- Abbreviation:
AudioEdit
- Domain:
Art
- Capability:
Creativity and Innovation
- Data Source:
Song Describer Dataset
- Number:
16
- SoTA Specialist:
ControlNet for Diffusion Transformer
- Metrics:
CLAP
A-G-2 Dialogue Speech Generation (Dialog Gen)
This cluster focuses on generating natural daily dialogue speech through text or prompt-based generation.
- Daily Talk Generation
This task involves generating speech for daily conversational scenarios.
- Abbreviation:
DailyTalk Gen
- Domain:
General
- Capability:
Creativity and Innovation
- Data Source:
DailyTalk
- Number:
500
- SoTA Specialist:
FastSpeech2
- Metrics:
MOS
A-G-3 Emotional Speech Generation (EmoSpeech Gen)
This cluster focuses on synthesizing or transferring emotional states in speech for affective communication.
- Emotional Speech Synthesis
This task focuses on synthesizing speech with specific emotional tones.
- Abbreviation:
EmoSpeech Syn
- Domain:
General
- Capability:
Affective Analysis
- Data Source:
Emotional Speech Data (Zhou et al., 2021)
- Number:
500
- SoTA Specialist:
DeepEST
- Metrics:
MCD
- Emotion Style Transfer
This task transfers emotional characteristics from one speech sample to another.
- Abbreviation:
EmoStyleTransfer
- Domain:
General
- Capability:
Affective Analysis
- Data Source:
EmotionalSpeech
- Number:
500
- SoTA Specialist:
DeepEST
- Metrics:
MOS
A-G-4 Text-To-Speech Synthesis (TTS)
This cluster focuses on converting written text into natural-sounding speech across languages and modalities.
- Text-To-Speech
This task converts text to speech for general TTS systems.
- Abbreviation:
TTS
- Domain:
Linguistics
- Capability:
Commonsense Knowledge, Creativity and Innovation
- Data Source:
LibriSpeech test-clean
- Number:
500
- SoTA Specialist:
USLM
- Metrics:
WER
- Multimodal TTS
This task synthesizes speech using multimodal inputs.
- Abbreviation:
MTTS
- Domain:
General
- Capability:
Creativity and Innovation
- Data Source:
MEAD-TTS
- Number:
500
- SoTA Specialist:
MM-TTS
- Metrics:
MOS
A-G-5 Text-to-Audio Synthesis (Txt2Aud)
This cluster focuses on generating general-purpose audio from single or multiple text inputs.
- Single Caption To Audio Generation
This task generates audio from a single caption describing a sound or scene.
- Abbreviation:
1CapToAudio
- Domain:
General
- Capability:
Creativity and Innovation
- Data Source:
AudioCap
- Number:
500
- SoTA Specialist:
AudioLDM2
- Metrics:
CLAP
- Two Captions To Audio Generation
This task uses two captions to guide the synthesis of a single coherent audio clip.
- Abbreviation:
2CapsToAudio
- Domain:
General
- Capability:
Creativity and Innovation
- Data Source:
AudioCap
- Number:
500
- SoTA Specialist:
AudioLDM2
- Metrics:
CLAP
A-G-6 Image-to-Audio Synthesis (Img2Aud Gen)
This cluster focuses on synthesizing audio from visual imagery, including speech or ambient sounds.
- Image-to-Speech
This task converts image content into speech-based descriptions.
- Abbreviation:
ImageToSpeech
- Domain:
General
- Capability:
Content Recognition
- Data Source:
Flickr8k Audio
- Number:
500
- SoTA Specialist:
Im2Sp
- Metrics:
CIDEr
A-G-7 Video-to-Audio Synthesis (V2A)
This cluster focuses on generating realistic or descriptive audio directly from video inputs.
- Video-to-Audio
This task involves synthesizing audio (e.g., speech or ambient sounds) based on video scenes.
- Abbreviation:
Video2Audio
- Domain:
General
- Capability:
Commonsense Knowledge
- Data Source:
AVSync15
- Number:
500
- SoTA Specialist:
Diff-Foley
- Metrics:
FAD
A-G-8 Speech Style Transfer (Style Trans)
This cluster focuses on converting one voice into another while preserving linguistic content.
- Voice Conversion
This task transfers the vocal style of one speaker to another’s voice recording.
- Abbreviation:
VoiceConversion
- Domain:
General
- Capability:
Content Recognition
- Data Source:
VCTK Corpus
- Number:
500
- SoTA Specialist:
USLM
- Metrics:
WER
A-G-9 Speech Translation (Speech Trans)
This cluster focuses on translating spoken content between different languages using audio inputs only.
- Chinese-to-English Speech Translation
This task translates spoken Chinese into English.
- Abbreviation:
SpeechTrans (Zh-En)
- Domain:
Multidomain
- Capability:
Content Recognition
- Data Source:
GigaST
- Number:
500
- SoTA Specialist:
USLM
- Metrics:
WER
- English-to-Chinese Speech Translation
This task translates spoken English into Chinese.
- Abbreviation:
SpeechTrans (En-Zh)
- Domain:
General
- Capability:
Content Recognition
- Data Source:
CoVoST v1
- Number:
500
- SoTA Specialist:
USLM
- Metrics:
WER