Statistics of the Audio Comprehension Skills
This section presents detailed statistics for skills categorized under the Audio
modality.
The audio comprehension tasks span various domains, including speech understanding
, acoustic perception
, and auditory reasoning
.
These tasks challenge models to align auditory signals with linguistic, semantic, or emotional interpretations.
A-C-1 Speech Accent Understanding (Acnt Analy)
This cluster involves generating videos directly from textual prompts, requiring strong visual imagination and alignment with semantic intent.
- Accent Classification
This task requires generating videos that preserve subject identity (e.g., a specific person or object) across frames.
- Abbreviation:
Accent Recog
- Domain:
Linguistics
- Capability:
Reasoning Ability
- Data Source:
Speech Accent Archive
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc
- Accent Sex Classification
This task requires generating videos that preserve subject identity (e.g., a specific person or object) across frames.
- Abbreviation:
Accent-Sex Recog
- Domain:
Linguistics
- Capability:
Reasoning Ability, Commonsense Knowledge
- Data Source:
Speech Accent Archive
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc
- Speaker Identification
This task focuses on identifying individuals from their voice samples.
- Abbreviation:
Speaker ID
- Domain:
Linguistics
- Capability:
Reasoning Ability
- Data Source:
VoxCeleb
- Number:
279
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
- Vocal Sound Classification
This task involves classifying types of vocal sounds such as humming or whispering.
- Abbreviation:
Vocal-Sound Recog
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
Vocal Sound
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc
A-C-2 Speech Content Understanding (Ctnt Analy)
This cluster focuses on understanding the semantic content and communicative intent in spoken language, including intent recognition, command interpretation, and event extraction.
- Intent Classification
This task focuses on identifying the speaker’s intent from spoken commands.
- Abbreviation:
Intent Recog
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
Fluent Speech Commands
- Number:
500
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
- Speech Command
This task involves recognizing specific spoken commands from short utterances.
- Abbreviation:
Speech Cmd
- Domain:
General
- Capability:
Content Recognition
- Data Source:
speech-commands
- Number:
500
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
- Speech Event Extraction
This task focuses on extracting structured events from speech, combining content understanding and reasoning.
- Abbreviation:
SpeechEE
- Domain:
General
- Capability:
Content Recognition, Reasoning Ability
- Data Source:
CASIE
- Number:
500
- SoTA Specialist:
E2E (T5)
- Metrics:
F1
A-C-3 Speech Emotion Understanding (SpeechEmo Analy)
This cluster focuses on analyzing emotional expressions in speech through affective audio analysis.
- Speech Emotion Recognition
This task identifies the emotional state of the speaker from their voice.
- Abbreviation:
Emotion Recog
- Domain:
General
- Capability:
Affective Analysis
- Data Source:
IEMOCAP
- Number:
500
- SoTA Specialist:
WavLM Large
- Metrics:
Acc
A-C-4 Music Understanding (Music Analy)
This cluster focuses on understanding music content, instruments, genres, and pitch through audio recognition.
- Music Genre Classification
This task classifies the genre of a music piece based on audio signals.
- Abbreviation:
Music-Genre Recog
- Domain:
Art
- Capability:
Content Recognition
- Data Source:
Music Genre (Mus)
- Number:
500
- SoTA Specialist:
Musicset-Sup
- Metrics:
Acc
- Music Instrument Classification
This task classifies musical instruments present in an audio clip using commonsense and acoustic cues.
- Abbreviation:
Instrument Recog
- Domain:
Art
- Capability:
Content Recognition, Commonsense Knowledge
- Data Source:
NS. Instruments
- Number:
500
- SoTA Specialist:
HuBERT-base
- Metrics:
Acc
- Music Instrument Source Analysis
This task analyzes the source of instruments in music to understand their acoustic patterns.
- Abbreviation:
Instrument-Source Anal
- Domain:
Art
- Capability:
Content Recognition
- Data Source:
NS. Instruments Source
- Number:
500
- SoTA Specialist:
Musicset-ULarge
- Metrics:
Acc
- Music Pitch Analysis
This task focuses on analyzing pitch patterns in musical content.
- Abbreviation:
Pitch Anal
- Domain:
Art
- Capability:
Content Recognition
- Data Source:
NS. Pitch
- Number:
500
- SoTA Specialist:
Musicset-ULarge
- Metrics:
Acc
A-C-5 Audio Technique Understanding (Aud-Tech Analy)
This cluster focuses on recognizing technical attributes of music performance, including pitch, vocal technique, and performer identity.
- Note Qualities Analysis
This task analyzes note quality aspects such as timbre and clarity in musical performances.
- Abbreviation:
Note-Quality Anal
- Domain:
Art
- Capability:
Content Recognition
- Data Source:
NS. quality
- Number:
500
- SoTA Specialist:
HuBERT-base
- Metrics:
Acc
- Singer Identification
This task identifies individual singers from vocal recordings.
- Abbreviation:
Singer ID
- Domain:
Art
- Capability:
Reasoning Ability
- Data Source:
VocalSet
- Number:
500
- SoTA Specialist:
HuBERT-base
- Metrics:
Acc
- Vocal Technique Detection
This task detects vocal techniques used in singing, such as vibrato or falsetto.
- Abbreviation:
Vocal-Tech Detect
- Domain:
Art
- Capability:
Reasoning Ability
- Data Source:
VocalSet
- Number:
500
- SoTA Specialist:
HuBERT-base
- Metrics:
Acc
A-C-6 Audio Content Understanding (Aud Analy)
This cluster focuses on reasoning about semantic content in longer audio recordings, such as story narration or spontaneous events.
- Long Audio Captioning
This task generates captions for long-form audio content using reasoning over extended audio sequences.
- Abbreviation:
Long AudioCaps
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
Clotho Caption
- Number:
500
- SoTA Specialist:
PTAAC
- Metrics:
BLUE-1
- Wild Audio Captioning
This task involves generating captions for in-the-wild audio scenes with complex auditory events.
- Abbreviation:
Wild AudioCaps
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
AudioCaps
- Number:
500
- SoTA Specialist:
PTAAC
- Metrics:
BLUE-1
A-C-7 General Audio Question Answering (Aud QA)
This cluster focuses on answering open-ended or fact-based questions from audio recordings, requiring reasoning and background knowledge.
- Open Audio Question Answering
This task involves answering diverse open-ended questions based on general audio input.
- Abbreviation:
OpenAQA
- Domain:
General
- Capability:
Reasoning Ability, Commonsense Knowledge
- Data Source:
OpenAQA (LTU)
- Number:
500
- SoTA Specialist:
LTU
- Metrics:
GPT-Score
- Audio Question Answering
This task focuses on answering fact-based or inference questions grounded in audio input.
- Abbreviation:
AudioQA
- Domain:
General
- Capability:
Reasoning Ability, Commonsense Knowledge
- Data Source:
ClothoAQA
- Number:
500
- SoTA Specialist:
MWAFM
- Metrics:
Acc
#A-C-8 Animal Sound Analysis (Animal-Sound Det)
his cluster focuses on recognizing and analyzing various animal sounds for biological and commonsense understanding.
- Bird Sound Detection
This task focuses on detecting and recognizing bird sounds in audio recordings.
- Abbreviation:
Bird-Sound Detect
- Domain:
Biology, Animal
- Capability:
Content Recognition, Commonsense Knowledge
- Data Source:
Birdsong
- Number:
500
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
- Animal Sound Detection
This task focuses on detecting a broader range of animal sounds from audio data.
- Abbreviation:
AnimalSoundDetect
- Domain:
Biology, Animal
- Capability:
Content Recognition, Commonsense Knowledge
- Data Source:
Animal-Sound Classification
- Number:
500
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
A-C-9 Environment Sound Understanding (Envir-Sound Det)
This cluster focuses on understanding environmental sounds through recognition and reasoning over diverse audio contexts.
- Acoustic Scene Recognition
This task involves recognizing various acoustic scenes from environmental audio recordings.
- Abbreviation:
Acoustic-Scene Recog
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
TUT
- Number:
468
- SoTA Specialist:
CLAP
- Metrics:
Acc
- Environment Sound Recognition
This task involves recognizing general environmental sounds, such as thunder or footsteps, using commonsense understanding.
- Abbreviation:
Env-Sound Recog
- Domain:
General
- Capability:
Reasoning Ability, Commonsense Knowledge
- Data Source:
ESC50
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc
- Sound Event Recognition
This task focuses on identifying specific sound events occurring in audio scenes.
- Abbreviation:
Sound-Event Recog
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
ESC50
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc