UrbanSound8K Segment-Based Classification

This repository contains three different approaches for segment-based audio classification
on the UrbanSound8K dataset:

  1. GOAL 1 – YAMNet Embeddings → LightGBM (Baseline)
  2. GOAL 2 – ESResNeXt-fbsp Fine-tuning
  3. GOAL 3 – AudioCLIP Fine-tuning

Accuracy and macro-F1 scores are reported at both segment level and clip level for each goal.

Setup

Requirements

  • Python 3.9+
  • GPU recommended (NVIDIA A100/V100 ideal, required for Goals 2 & 3)
  • ~8GB RAM minimum

Environment

# Extract from zip
unzip urbansound_segment_classification.zip
cd urbansound_segment_classification
pip install -r requirements.txt

Dataset

mkdir data
# Place UrbanSound8K dataset here (fold1..fold10 + UrbanSound8K.csv)

Download: Kaggle link

data/
├── fold1/
├── ...
├── fold10/
└── UrbanSound8K.csv

GOAL 1 — YAMNet + LightGBM (Baseline)

YAMNet is used as a fixed feature extractor. Each segment produces a 1024-D embedding,
which is classified with LightGBM. This is not transfer learning, but feature extraction + tabular learning.

Process

  • Audio split into 0.96s windows with 50% overlap
  • 1024-D embeddings extracted using YAMNet
  • Segment predictions with LightGBM
  • Clip scores from averaged segment probabilities
  • Train: Folds 1-8, Val: Fold 9, Test: Fold 10

Run

python -m urbansound_segment_task.goals.goal1_yamnet_lgbm.run \
  --data_dir ./data \
  --out ./urbansound_segment_task/goals/goal1_yamnet_lgbm/results \
  --win_sec 0.96 --overlap 0.5 --seed 42

Results

  • Segment-level: accuracy = 0.744, macro-F1 = 0.758
  • Clip-level: accuracy = 0.802, macro-F1 = 0.819
  • Throughput: ~220 segments/s (GPU)
  • Training time: ~15 minutes

GOAL 2 — ESResNeXt-fbsp Fine-tuning

ESResNeXt-fbsp pretrained on AudioSet is fine-tuned on UrbanSound8K using full backbone training.
This is a true end-to-end deep learning approach.

Setup

git clone --depth 1 https://github.com/AndreyGuzhov/ESResNeXt-fbsp.git ./external/ESResNeXt_fbsp
bash urbansound_segment_task/scripts/download_checkpoint_goal2.sh

Run

python -m urbansound_segment_task.goals.goal2_esresnext.run_finetune \
  --data_dir ./data \
  --out ./urbansound_segment_task/goals/goal2_esresnext/results_finetune_v2 \
  --sr 44100 --win_sec 0.96 --overlap 0.5 \
  --batch_size 128 --warmup_epochs 4 --finetune_epochs 32 \
  --lr_head 1e-3 --lr_backbone 5e-5 --weight_decay 1e-4 \
  --label_smoothing 0.05 \
  --pretrained_ckpt ./urbansound_segment_task/goals/goal2_esresnext/checkpoints/ESResNeXtFBSP_AudioSet.pt

Results

  • Segment-level: accuracy = 0.768, macro-F1 = 0.740
  • Clip-level: accuracy = 0.841, macro-F1 = 0.836
  • Training time: ~44 minutes (NVIDIA A100)

GOAL 3 — AudioCLIP Fine-tuning

AudioCLIP (multi-modal pretrained model) is adapted to UrbanSound8K using only the audio branch.

Setup

git clone --depth 1 https://github.com/AndreyGuzhov/AudioCLIP.git ./external/AudioCLIP

Run

python -m urbansound_segment_task.goals.goal3_audioclip.run \
  --data_dir ./data \
  --out ./urbansound_segment_task/goals/goal3_audioclip/results_ft_q \
  --mode finetune --epochs 40 --warmup_epochs 6 \
  --bs 64 --accum_steps 2 --lr_backbone 2e-5 --lr_head 3e-4 \
  --label_smoothing 0.10 --early_stop 8 --tta_shifts 2 --seed 42

Results

  • Segment-level: accuracy = 0.769, macro-F1 = 0.755
  • Clip-level: accuracy = 0.798, macro-F1 = 0.798
  • Throughput: ~464 segments/s (fastest)
  • Training time: ~60–90 minutes

Comprehensive Comparison

Metric Goal 1 (YAMNet+LGB) Goal 2 (ESResNeXt) Goal 3 (AudioCLIP)
Segment Accuracy 74.4% 76.8% 76.9%
Segment macro-F1 75.8% 74.0% 75.5%
Clip Accuracy 80.2% 84.1% 79.8%
Clip macro-F1 81.9% 83.6% 79.8%
Inference Speed ~220 seg/s ~464 seg/s
Training Time ~15 min ~44 min ~60–90 min
Trainable Params ~134K ~25M ~134M
GPU Required No* Yes Yes

*Goal 1 can run without GPU, but much slower.

Key Insights

  • Best Accuracy: Goal 2 (ESResNeXt) with 84.1% clip accuracy
  • Best Speed/Accuracy Balance: Goal 1 (YAMNet+LGB)
  • Fastest Inference: Goal 3 (AudioCLIP)
  • Practical Choice: Goal 1 for most real-world use cases