This repository contains three different approaches for segment-based audio classification
on the UrbanSound8K dataset:
- GOAL 1 – YAMNet Embeddings → LightGBM (Baseline)
- GOAL 2 – ESResNeXt-fbsp Fine-tuning
- GOAL 3 – AudioCLIP Fine-tuning
Accuracy and macro-F1 scores are reported at both segment level and clip level for each goal.
Setup
Requirements
- Python 3.9+
- GPU recommended (NVIDIA A100/V100 ideal, required for Goals 2 & 3)
- ~8GB RAM minimum
Environment
# Extract from zip
unzip urbansound_segment_classification.zip
cd urbansound_segment_classification
pip install -r requirements.txt
Dataset
mkdir data
# Place UrbanSound8K dataset here (fold1..fold10 + UrbanSound8K.csv)
Download: Kaggle link
data/
├── fold1/
├── ...
├── fold10/
└── UrbanSound8K.csv
GOAL 1 — YAMNet + LightGBM (Baseline)
YAMNet is used as a fixed feature extractor. Each segment produces a 1024-D embedding,
which is classified with LightGBM. This is not transfer learning, but feature extraction + tabular learning.
Process
- Audio split into 0.96s windows with 50% overlap
- 1024-D embeddings extracted using YAMNet
- Segment predictions with LightGBM
- Clip scores from averaged segment probabilities
- Train: Folds 1-8, Val: Fold 9, Test: Fold 10
Run
python -m urbansound_segment_task.goals.goal1_yamnet_lgbm.run \
--data_dir ./data \
--out ./urbansound_segment_task/goals/goal1_yamnet_lgbm/results \
--win_sec 0.96 --overlap 0.5 --seed 42
Results
- Segment-level: accuracy = 0.744, macro-F1 = 0.758
- Clip-level: accuracy = 0.802, macro-F1 = 0.819
- Throughput: ~220 segments/s (GPU)
- Training time: ~15 minutes
GOAL 2 — ESResNeXt-fbsp Fine-tuning
ESResNeXt-fbsp pretrained on AudioSet is fine-tuned on UrbanSound8K using full backbone training.
This is a true end-to-end deep learning approach.
Setup
git clone --depth 1 https://github.com/AndreyGuzhov/ESResNeXt-fbsp.git ./external/ESResNeXt_fbsp
bash urbansound_segment_task/scripts/download_checkpoint_goal2.sh
Run
python -m urbansound_segment_task.goals.goal2_esresnext.run_finetune \
--data_dir ./data \
--out ./urbansound_segment_task/goals/goal2_esresnext/results_finetune_v2 \
--sr 44100 --win_sec 0.96 --overlap 0.5 \
--batch_size 128 --warmup_epochs 4 --finetune_epochs 32 \
--lr_head 1e-3 --lr_backbone 5e-5 --weight_decay 1e-4 \
--label_smoothing 0.05 \
--pretrained_ckpt ./urbansound_segment_task/goals/goal2_esresnext/checkpoints/ESResNeXtFBSP_AudioSet.pt
Results
- Segment-level: accuracy = 0.768, macro-F1 = 0.740
- Clip-level: accuracy = 0.841, macro-F1 = 0.836
- Training time: ~44 minutes (NVIDIA A100)
GOAL 3 — AudioCLIP Fine-tuning
AudioCLIP (multi-modal pretrained model) is adapted to UrbanSound8K using only the audio branch.
Setup
git clone --depth 1 https://github.com/AndreyGuzhov/AudioCLIP.git ./external/AudioCLIP
Run
python -m urbansound_segment_task.goals.goal3_audioclip.run \
--data_dir ./data \
--out ./urbansound_segment_task/goals/goal3_audioclip/results_ft_q \
--mode finetune --epochs 40 --warmup_epochs 6 \
--bs 64 --accum_steps 2 --lr_backbone 2e-5 --lr_head 3e-4 \
--label_smoothing 0.10 --early_stop 8 --tta_shifts 2 --seed 42
Results
- Segment-level: accuracy = 0.769, macro-F1 = 0.755
- Clip-level: accuracy = 0.798, macro-F1 = 0.798
- Throughput: ~464 segments/s (fastest)
- Training time: ~60–90 minutes
Comprehensive Comparison
| Metric | Goal 1 (YAMNet+LGB) | Goal 2 (ESResNeXt) | Goal 3 (AudioCLIP) |
|---|---|---|---|
| Segment Accuracy | 74.4% | 76.8% | 76.9% |
| Segment macro-F1 | 75.8% | 74.0% | 75.5% |
| Clip Accuracy | 80.2% | 84.1% | 79.8% |
| Clip macro-F1 | 81.9% | 83.6% | 79.8% |
| Inference Speed | ~220 seg/s | – | ~464 seg/s |
| Training Time | ~15 min | ~44 min | ~60–90 min |
| Trainable Params | ~134K | ~25M | ~134M |
| GPU Required | No* | Yes | Yes |
*Goal 1 can run without GPU, but much slower.
Key Insights
- Best Accuracy: Goal 2 (ESResNeXt) with 84.1% clip accuracy
- Best Speed/Accuracy Balance: Goal 1 (YAMNet+LGB)
- Fastest Inference: Goal 3 (AudioCLIP)
- Practical Choice: Goal 1 for most real-world use cases