Statistics of the Audio Comprehension Skills

This section presents detailed statistics for skills categorized under the Audio modality. The audio comprehension tasks span various domains, including speech understanding, acoustic perception, and auditory reasoning. These tasks challenge models to align auditory signals with linguistic, semantic, or emotional interpretations.

A-C-1 Speech Accent Understanding (Acnt Analy)

This cluster involves generating videos directly from textual prompts, requiring strong visual imagination and alignment with semantic intent.

Accent Classification

This task requires generating videos that preserve subject identity (e.g., a specific person or object) across frames.

Abbreviation:

Accent Recog

Domain:

Linguistics

Capability:

Reasoning Ability

Data Source:

Speech Accent Archive

Number:

500

SoTA Specialist:

CLAP

Metrics:

Acc

Accent Sex Classification

This task requires generating videos that preserve subject identity (e.g., a specific person or object) across frames.

Abbreviation:

Accent-Sex Recog

Domain:

Linguistics

Capability:

Reasoning Ability, Commonsense Knowledge

Data Source:

Speech Accent Archive

Number:

500

SoTA Specialist:

CLAP

Metrics:

Acc

Speaker Identification

This task focuses on identifying individuals from their voice samples.

Abbreviation:

Speaker ID

Domain:

Linguistics

Capability:

Reasoning Ability

Data Source:

VoxCeleb

Number:

279

SoTA Specialist:

HuBERT Large

Metrics:

Acc

Vocal Sound Classification

This task involves classifying types of vocal sounds such as humming or whispering.

Abbreviation:

Vocal-Sound Recog

Domain:

General

Capability:

Reasoning Ability

Data Source:

Vocal Sound

Number:

500

SoTA Specialist:

CLAP

Metrics:

Acc

A-C-2 Speech Content Understanding (Ctnt Analy)

This cluster focuses on understanding the semantic content and communicative intent in spoken language, including intent recognition, command interpretation, and event extraction.

Intent Classification

This task focuses on identifying the speaker’s intent from spoken commands.

Abbreviation:

Intent Recog

Domain:

General

Capability:

Reasoning Ability

Data Source:

Fluent Speech Commands

Number:

500

SoTA Specialist:

HuBERT Large

Metrics:

Acc

Speech Command

This task involves recognizing specific spoken commands from short utterances.

Abbreviation:

Speech Cmd

Domain:

General

Capability:

Content Recognition

Data Source:

speech-commands

Number:

500

SoTA Specialist:

HuBERT Large

Metrics:

Acc

Speech Event Extraction

This task focuses on extracting structured events from speech, combining content understanding and reasoning.

Abbreviation:

SpeechEE

Domain:

General

Capability:

Content Recognition, Reasoning Ability

Data Source:

CASIE

Number:

500

SoTA Specialist:

E2E (T5)

Metrics:

F1

A-C-3 Speech Emotion Understanding (SpeechEmo Analy)

This cluster focuses on analyzing emotional expressions in speech through affective audio analysis.

Speech Emotion Recognition

This task identifies the emotional state of the speaker from their voice.

Abbreviation:

Emotion Recog

Domain:

General

Capability:

Affective Analysis

Data Source:

IEMOCAP

Number:

500

SoTA Specialist:

WavLM Large

Metrics:

Acc

A-C-4 Music Understanding (Music Analy)

This cluster focuses on understanding music content, instruments, genres, and pitch through audio recognition.

Music Genre Classification

This task classifies the genre of a music piece based on audio signals.

Abbreviation:

Music-Genre Recog

Domain:

Art

Capability:

Content Recognition

Data Source:

Music Genre (Mus)

Number:

500

SoTA Specialist:

Musicset-Sup

Metrics:

Acc

Music Instrument Classification

This task classifies musical instruments present in an audio clip using commonsense and acoustic cues.

Abbreviation:

Instrument Recog

Domain:

Art

Capability:

Content Recognition, Commonsense Knowledge

Data Source:

NS. Instruments

Number:

500

SoTA Specialist:

HuBERT-base

Metrics:

Acc

Music Instrument Source Analysis

This task analyzes the source of instruments in music to understand their acoustic patterns.

Abbreviation:

Instrument-Source Anal

Domain:

Art

Capability:

Content Recognition

Data Source:

NS. Instruments Source

Number:

500

SoTA Specialist:

Musicset-ULarge

Metrics:

Acc

Music Pitch Analysis

This task focuses on analyzing pitch patterns in musical content.

Abbreviation:

Pitch Anal

Domain:

Art

Capability:

Content Recognition

Data Source:

NS. Pitch

Number:

500

SoTA Specialist:

Musicset-ULarge

Metrics:

Acc

A-C-5 Audio Technique Understanding (Aud-Tech Analy)

This cluster focuses on recognizing technical attributes of music performance, including pitch, vocal technique, and performer identity.

Note Qualities Analysis

This task analyzes note quality aspects such as timbre and clarity in musical performances.

Abbreviation:

Note-Quality Anal

Domain:

Art

Capability:

Content Recognition

Data Source:

NS. quality

Number:

500

SoTA Specialist:

HuBERT-base

Metrics:

Acc

Singer Identification

This task identifies individual singers from vocal recordings.

Abbreviation:

Singer ID

Domain:

Art

Capability:

Reasoning Ability

Data Source:

VocalSet

Number:

500

SoTA Specialist:

HuBERT-base

Metrics:

Acc

Vocal Technique Detection

This task detects vocal techniques used in singing, such as vibrato or falsetto.

Abbreviation:

Vocal-Tech Detect

Domain:

Art

Capability:

Reasoning Ability

Data Source:

VocalSet

Number:

500

SoTA Specialist:

HuBERT-base

Metrics:

Acc

A-C-6 Audio Content Understanding (Aud Analy)

This cluster focuses on reasoning about semantic content in longer audio recordings, such as story narration or spontaneous events.

Long Audio Captioning

This task generates captions for long-form audio content using reasoning over extended audio sequences.

Abbreviation:

Long AudioCaps

Domain:

General

Capability:

Reasoning Ability

Data Source:

Clotho Caption

Number:

500

SoTA Specialist:

PTAAC

Metrics:

BLUE-1

Wild Audio Captioning

This task involves generating captions for in-the-wild audio scenes with complex auditory events.

Abbreviation:

Wild AudioCaps

Domain:

General

Capability:

Reasoning Ability

Data Source:

AudioCaps

Number:

500

SoTA Specialist:

PTAAC

Metrics:

BLUE-1

A-C-7 General Audio Question Answering (Aud QA)

This cluster focuses on answering open-ended or fact-based questions from audio recordings, requiring reasoning and background knowledge.

Open Audio Question Answering

This task involves answering diverse open-ended questions based on general audio input.

Abbreviation:

OpenAQA

Domain:

General

Capability:

Reasoning Ability, Commonsense Knowledge

Data Source:

OpenAQA (LTU)

Number:

500

SoTA Specialist:

LTU

Metrics:

GPT-Score

Audio Question Answering

This task focuses on answering fact-based or inference questions grounded in audio input.

Abbreviation:

AudioQA

Domain:

General

Capability:

Reasoning Ability, Commonsense Knowledge

Data Source:

ClothoAQA

Number:

500

SoTA Specialist:

MWAFM

Metrics:

Acc

#A-C-8 Animal Sound Analysis (Animal-Sound Det)

his cluster focuses on recognizing and analyzing various animal sounds for biological and commonsense understanding.

Bird Sound Detection

This task focuses on detecting and recognizing bird sounds in audio recordings.

Abbreviation:

Bird-Sound Detect

Domain:

Biology, Animal

Capability:

Content Recognition, Commonsense Knowledge

Data Source:

Birdsong

Number:

500

SoTA Specialist:

HuBERT Large

Metrics:

Acc

Animal Sound Detection

This task focuses on detecting a broader range of animal sounds from audio data.

Abbreviation:

AnimalSoundDetect

Domain:

Biology, Animal

Capability:

Content Recognition, Commonsense Knowledge

Data Source:

Animal-Sound Classification

Number:

500

SoTA Specialist:

HuBERT Large

Metrics:

Acc

A-C-9 Environment Sound Understanding (Envir-Sound Det)

This cluster focuses on understanding environmental sounds through recognition and reasoning over diverse audio contexts.

Acoustic Scene Recognition

This task involves recognizing various acoustic scenes from environmental audio recordings.

Abbreviation:

Acoustic-Scene Recog

Domain:

General

Capability:

Reasoning Ability

Data Source:

TUT

Number:

468

SoTA Specialist:

CLAP

Metrics:

Acc

Environment Sound Recognition

This task involves recognizing general environmental sounds, such as thunder or footsteps, using commonsense understanding.

Abbreviation:

Env-Sound Recog

Domain:

General

Capability:

Reasoning Ability, Commonsense Knowledge

Data Source:

ESC50

Number:

500

SoTA Specialist:

CLAP

Metrics:

Acc

Sound Event Recognition

This task focuses on identifying specific sound events occurring in audio scenes.

Abbreviation:

Sound-Event Recog

Domain:

General

Capability:

Reasoning Ability

Data Source:

ESC50

Number:

500

SoTA Specialist:

CLAP

Metrics:

Acc