Statistics of the Audio Comprehension Skills
This section presents detailed statistics for skills categorized under the Audio modality.
The audio comprehension tasks span various domains, including speech understanding, acoustic perception, and auditory reasoning.
These tasks challenge models to align auditory signals with linguistic, semantic, or emotional interpretations.
A-C-1 Speech Accent Understanding (Acnt Analy)
This cluster involves generating videos directly from textual prompts, requiring strong visual imagination and alignment with semantic intent.
- Accent Classification
This task requires generating videos that preserve subject identity (e.g., a specific person or object) across frames.
- Abbreviation:
Accent Recog
- Domain:
Linguistics
- Capability:
Reasoning Ability
- Data Source:
Speech Accent Archive
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc
- Accent Sex Classification
This task requires generating videos that preserve subject identity (e.g., a specific person or object) across frames.
- Abbreviation:
Accent-Sex Recog
- Domain:
Linguistics
- Capability:
Reasoning Ability, Commonsense Knowledge
- Data Source:
Speech Accent Archive
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc
- Speaker Identification
This task focuses on identifying individuals from their voice samples.
- Abbreviation:
Speaker ID
- Domain:
Linguistics
- Capability:
Reasoning Ability
- Data Source:
VoxCeleb
- Number:
279
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
- Vocal Sound Classification
This task involves classifying types of vocal sounds such as humming or whispering.
- Abbreviation:
Vocal-Sound Recog
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
Vocal Sound
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc
A-C-2 Speech Content Understanding (Ctnt Analy)
This cluster focuses on understanding the semantic content and communicative intent in spoken language, including intent recognition, command interpretation, and event extraction.
- Intent Classification
This task focuses on identifying the speaker’s intent from spoken commands.
- Abbreviation:
Intent Recog
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
Fluent Speech Commands
- Number:
500
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
- Speech Command
This task involves recognizing specific spoken commands from short utterances.
- Abbreviation:
Speech Cmd
- Domain:
General
- Capability:
Content Recognition
- Data Source:
speech-commands
- Number:
500
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
- Speech Event Extraction
This task focuses on extracting structured events from speech, combining content understanding and reasoning.
- Abbreviation:
SpeechEE
- Domain:
General
- Capability:
Content Recognition, Reasoning Ability
- Data Source:
CASIE
- Number:
500
- SoTA Specialist:
E2E (T5)
- Metrics:
F1
A-C-3 Speech Emotion Understanding (SpeechEmo Analy)
This cluster focuses on analyzing emotional expressions in speech through affective audio analysis.
- Speech Emotion Recognition
This task identifies the emotional state of the speaker from their voice.
- Abbreviation:
Emotion Recog
- Domain:
General
- Capability:
Affective Analysis
- Data Source:
IEMOCAP
- Number:
500
- SoTA Specialist:
WavLM Large
- Metrics:
Acc
A-C-4 Music Understanding (Music Analy)
This cluster focuses on understanding music content, instruments, genres, and pitch through audio recognition.
- Music Genre Classification
This task classifies the genre of a music piece based on audio signals.
- Abbreviation:
Music-Genre Recog
- Domain:
Art
- Capability:
Content Recognition
- Data Source:
Music Genre (Mus)
- Number:
500
- SoTA Specialist:
Musicset-Sup
- Metrics:
Acc
- Music Instrument Classification
This task classifies musical instruments present in an audio clip using commonsense and acoustic cues.
- Abbreviation:
Instrument Recog
- Domain:
Art
- Capability:
Content Recognition, Commonsense Knowledge
- Data Source:
NS. Instruments
- Number:
500
- SoTA Specialist:
HuBERT-base
- Metrics:
Acc
- Music Instrument Source Analysis
This task analyzes the source of instruments in music to understand their acoustic patterns.
- Abbreviation:
Instrument-Source Anal
- Domain:
Art
- Capability:
Content Recognition
- Data Source:
NS. Instruments Source
- Number:
500
- SoTA Specialist:
Musicset-ULarge
- Metrics:
Acc
- Music Pitch Analysis
This task focuses on analyzing pitch patterns in musical content.
- Abbreviation:
Pitch Anal
- Domain:
Art
- Capability:
Content Recognition
- Data Source:
NS. Pitch
- Number:
500
- SoTA Specialist:
Musicset-ULarge
- Metrics:
Acc
A-C-5 Audio Technique Understanding (Aud-Tech Analy)
This cluster focuses on recognizing technical attributes of music performance, including pitch, vocal technique, and performer identity.
- Note Qualities Analysis
This task analyzes note quality aspects such as timbre and clarity in musical performances.
- Abbreviation:
Note-Quality Anal
- Domain:
Art
- Capability:
Content Recognition
- Data Source:
NS. quality
- Number:
500
- SoTA Specialist:
HuBERT-base
- Metrics:
Acc
- Singer Identification
This task identifies individual singers from vocal recordings.
- Abbreviation:
Singer ID
- Domain:
Art
- Capability:
Reasoning Ability
- Data Source:
VocalSet
- Number:
500
- SoTA Specialist:
HuBERT-base
- Metrics:
Acc
- Vocal Technique Detection
This task detects vocal techniques used in singing, such as vibrato or falsetto.
- Abbreviation:
Vocal-Tech Detect
- Domain:
Art
- Capability:
Reasoning Ability
- Data Source:
VocalSet
- Number:
500
- SoTA Specialist:
HuBERT-base
- Metrics:
Acc
A-C-6 Audio Content Understanding (Aud Analy)
This cluster focuses on reasoning about semantic content in longer audio recordings, such as story narration or spontaneous events.
- Long Audio Captioning
This task generates captions for long-form audio content using reasoning over extended audio sequences.
- Abbreviation:
Long AudioCaps
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
Clotho Caption
- Number:
500
- SoTA Specialist:
PTAAC
- Metrics:
BLUE-1
- Wild Audio Captioning
This task involves generating captions for in-the-wild audio scenes with complex auditory events.
- Abbreviation:
Wild AudioCaps
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
AudioCaps
- Number:
500
- SoTA Specialist:
PTAAC
- Metrics:
BLUE-1
A-C-7 General Audio Question Answering (Aud QA)
This cluster focuses on answering open-ended or fact-based questions from audio recordings, requiring reasoning and background knowledge.
- Open Audio Question Answering
This task involves answering diverse open-ended questions based on general audio input.
- Abbreviation:
OpenAQA
- Domain:
General
- Capability:
Reasoning Ability, Commonsense Knowledge
- Data Source:
OpenAQA (LTU)
- Number:
500
- SoTA Specialist:
LTU
- Metrics:
GPT-Score
- Audio Question Answering
This task focuses on answering fact-based or inference questions grounded in audio input.
- Abbreviation:
AudioQA
- Domain:
General
- Capability:
Reasoning Ability, Commonsense Knowledge
- Data Source:
ClothoAQA
- Number:
500
- SoTA Specialist:
MWAFM
- Metrics:
Acc
#A-C-8 Animal Sound Analysis (Animal-Sound Det)
his cluster focuses on recognizing and analyzing various animal sounds for biological and commonsense understanding.
- Bird Sound Detection
This task focuses on detecting and recognizing bird sounds in audio recordings.
- Abbreviation:
Bird-Sound Detect
- Domain:
Biology, Animal
- Capability:
Content Recognition, Commonsense Knowledge
- Data Source:
Birdsong
- Number:
500
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
- Animal Sound Detection
This task focuses on detecting a broader range of animal sounds from audio data.
- Abbreviation:
AnimalSoundDetect
- Domain:
Biology, Animal
- Capability:
Content Recognition, Commonsense Knowledge
- Data Source:
Animal-Sound Classification
- Number:
500
- SoTA Specialist:
HuBERT Large
- Metrics:
Acc
A-C-9 Environment Sound Understanding (Envir-Sound Det)
This cluster focuses on understanding environmental sounds through recognition and reasoning over diverse audio contexts.
- Acoustic Scene Recognition
This task involves recognizing various acoustic scenes from environmental audio recordings.
- Abbreviation:
Acoustic-Scene Recog
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
TUT
- Number:
468
- SoTA Specialist:
CLAP
- Metrics:
Acc
- Environment Sound Recognition
This task involves recognizing general environmental sounds, such as thunder or footsteps, using commonsense understanding.
- Abbreviation:
Env-Sound Recog
- Domain:
General
- Capability:
Reasoning Ability, Commonsense Knowledge
- Data Source:
ESC50
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc
- Sound Event Recognition
This task focuses on identifying specific sound events occurring in audio scenes.
- Abbreviation:
Sound-Event Recog
- Domain:
General
- Capability:
Reasoning Ability
- Data Source:
ESC50
- Number:
500
- SoTA Specialist:
CLAP
- Metrics:
Acc