{"id":194,"date":"2025-10-03T21:31:55","date_gmt":"2025-10-03T18:31:55","guid":{"rendered":"https:\/\/www.mkensari.com\/?page_id=194"},"modified":"2025-10-03T21:31:55","modified_gmt":"2025-10-03T18:31:55","slug":"urbansound8k-sound-classification","status":"publish","type":"page","link":"https:\/\/www.mkensari.com\/index.php\/urbansound8k-sound-classification\/","title":{"rendered":"UrbanSound8K Segment-Based Classification"},"content":{"rendered":"<section class=\"research-paper\">\n<p>\n    This repository contains three different approaches for segment-based audio classification<br \/>\n    on the <b>UrbanSound8K<\/b> dataset:\n  <\/p>\n<ol>\n<li><b>GOAL 1<\/b> \u2013 YAMNet Embeddings \u2192 LightGBM (Baseline)<\/li>\n<li><b>GOAL 2<\/b> \u2013 ESResNeXt-fbsp Fine-tuning<\/li>\n<li><b>GOAL 3<\/b> \u2013 AudioCLIP Fine-tuning<\/li>\n<\/ol>\n<p>\n    Accuracy and macro-F1 scores are reported at both segment level and clip level for each goal.\n  <\/p>\n<h2>Setup<\/h2>\n<h3>Requirements<\/h3>\n<ul>\n<li>Python 3.9+<\/li>\n<li>GPU recommended (NVIDIA A100\/V100 ideal, required for Goals 2 &#038; 3)<\/li>\n<li>~8GB RAM minimum<\/li>\n<\/ul>\n<h3>Environment<\/h3>\n<pre><code># Extract from zip\nunzip urbansound_segment_classification.zip\ncd urbansound_segment_classification\npip install -r requirements.txt\n<\/code><\/pre>\n<h3>Dataset<\/h3>\n<pre><code>mkdir data\n# Place UrbanSound8K dataset here (fold1..fold10 + UrbanSound8K.csv)\n<\/code><\/pre>\n<p>\n    Download: <a href=\"https:\/\/www.kaggle.com\/datasets\/chrisfilo\/urbansound8k\" target=\"_blank\">Kaggle link<\/a>\n  <\/p>\n<pre><code>data\/\n\u251c\u2500\u2500 fold1\/\n\u251c\u2500\u2500 ...\n\u251c\u2500\u2500 fold10\/\n\u2514\u2500\u2500 UrbanSound8K.csv\n<\/code><\/pre>\n<hr\/>\n<h2>GOAL 1 \u2014 YAMNet + LightGBM (Baseline)<\/h2>\n<p>\n    YAMNet is used as a <b>fixed feature extractor<\/b>. Each segment produces a 1024-D embedding,<br \/>\n    which is classified with LightGBM. This is not transfer learning, but feature extraction + tabular learning.\n  <\/p>\n<h3>Process<\/h3>\n<ul>\n<li>Audio split into <b>0.96s windows<\/b> with 50% overlap<\/li>\n<li>1024-D embeddings extracted using YAMNet<\/li>\n<li>Segment predictions with LightGBM<\/li>\n<li>Clip scores from averaged segment probabilities<\/li>\n<li>Train: Folds 1-8, Val: Fold 9, Test: Fold 10<\/li>\n<\/ul>\n<h3>Run<\/h3>\n<pre><code>python -m urbansound_segment_task.goals.goal1_yamnet_lgbm.run \\\n  --data_dir .\/data \\\n  --out .\/urbansound_segment_task\/goals\/goal1_yamnet_lgbm\/results \\\n  --win_sec 0.96 --overlap 0.5 --seed 42\n<\/code><\/pre>\n<h3>Results<\/h3>\n<ul>\n<li>Segment-level: accuracy = 0.744, macro-F1 = 0.758<\/li>\n<li>Clip-level: accuracy = 0.802, macro-F1 = 0.819<\/li>\n<li>Throughput: ~220 segments\/s (GPU)<\/li>\n<li>Training time: ~15 minutes<\/li>\n<\/ul>\n<hr\/>\n<h2>GOAL 2 \u2014 ESResNeXt-fbsp Fine-tuning<\/h2>\n<p>\n    ESResNeXt-fbsp pretrained on AudioSet is fine-tuned on UrbanSound8K using full backbone training.<br \/>\n    This is a true end-to-end deep learning approach.\n  <\/p>\n<h3>Setup<\/h3>\n<pre><code>git clone --depth 1 https:\/\/github.com\/AndreyGuzhov\/ESResNeXt-fbsp.git .\/external\/ESResNeXt_fbsp\nbash urbansound_segment_task\/scripts\/download_checkpoint_goal2.sh\n<\/code><\/pre>\n<h3>Run<\/h3>\n<pre><code>python -m urbansound_segment_task.goals.goal2_esresnext.run_finetune \\\n  --data_dir .\/data \\\n  --out .\/urbansound_segment_task\/goals\/goal2_esresnext\/results_finetune_v2 \\\n  --sr 44100 --win_sec 0.96 --overlap 0.5 \\\n  --batch_size 128 --warmup_epochs 4 --finetune_epochs 32 \\\n  --lr_head 1e-3 --lr_backbone 5e-5 --weight_decay 1e-4 \\\n  --label_smoothing 0.05 \\\n  --pretrained_ckpt .\/urbansound_segment_task\/goals\/goal2_esresnext\/checkpoints\/ESResNeXtFBSP_AudioSet.pt\n<\/code><\/pre>\n<h3>Results<\/h3>\n<ul>\n<li>Segment-level: accuracy = 0.768, macro-F1 = 0.740<\/li>\n<li>Clip-level: accuracy = <b>0.841<\/b>, macro-F1 = <b>0.836<\/b><\/li>\n<li>Training time: ~44 minutes (NVIDIA A100)<\/li>\n<\/ul>\n<hr\/>\n<h2>GOAL 3 \u2014 AudioCLIP Fine-tuning<\/h2>\n<p>\n    AudioCLIP (multi-modal pretrained model) is adapted to UrbanSound8K using only the audio branch.\n  <\/p>\n<h3>Setup<\/h3>\n<pre><code>git clone --depth 1 https:\/\/github.com\/AndreyGuzhov\/AudioCLIP.git .\/external\/AudioCLIP\n<\/code><\/pre>\n<h3>Run<\/h3>\n<pre><code>python -m urbansound_segment_task.goals.goal3_audioclip.run \\\n  --data_dir .\/data \\\n  --out .\/urbansound_segment_task\/goals\/goal3_audioclip\/results_ft_q \\\n  --mode finetune --epochs 40 --warmup_epochs 6 \\\n  --bs 64 --accum_steps 2 --lr_backbone 2e-5 --lr_head 3e-4 \\\n  --label_smoothing 0.10 --early_stop 8 --tta_shifts 2 --seed 42\n<\/code><\/pre>\n<h3>Results<\/h3>\n<ul>\n<li>Segment-level: accuracy = 0.769, macro-F1 = 0.755<\/li>\n<li>Clip-level: accuracy = 0.798, macro-F1 = 0.798<\/li>\n<li>Throughput: <b>~464 segments\/s<\/b> (fastest)<\/li>\n<li>Training time: ~60\u201390 minutes<\/li>\n<\/ul>\n<hr\/>\n<h2>Comprehensive Comparison<\/h2>\n<table border=\"1\" cellpadding=\"6\" cellspacing=\"0\">\n<tr>\n<th>Metric<\/th>\n<th>Goal 1 (YAMNet+LGB)<\/th>\n<th>Goal 2 (ESResNeXt)<\/th>\n<th>Goal 3 (AudioCLIP)<\/th>\n<\/tr>\n<tr>\n<td>Segment Accuracy<\/td>\n<td>74.4%<\/td>\n<td><b>76.8%<\/b><\/td>\n<td>76.9%<\/td>\n<\/tr>\n<tr>\n<td>Segment macro-F1<\/td>\n<td><b>75.8%<\/b><\/td>\n<td>74.0%<\/td>\n<td>75.5%<\/td>\n<\/tr>\n<tr>\n<td>Clip Accuracy<\/td>\n<td>80.2%<\/td>\n<td><b>84.1%<\/b><\/td>\n<td>79.8%<\/td>\n<\/tr>\n<tr>\n<td>Clip macro-F1<\/td>\n<td>81.9%<\/td>\n<td><b>83.6%<\/b><\/td>\n<td>79.8%<\/td>\n<\/tr>\n<tr>\n<td>Inference Speed<\/td>\n<td>~220 seg\/s<\/td>\n<td>&#8211;<\/td>\n<td><b>~464 seg\/s<\/b><\/td>\n<\/tr>\n<tr>\n<td>Training Time<\/td>\n<td>~15 min<\/td>\n<td>~44 min<\/td>\n<td>~60\u201390 min<\/td>\n<\/tr>\n<tr>\n<td>Trainable Params<\/td>\n<td>~134K<\/td>\n<td>~25M<\/td>\n<td>~134M<\/td>\n<\/tr>\n<tr>\n<td>GPU Required<\/td>\n<td>No*<\/td>\n<td>Yes<\/td>\n<td>Yes<\/td>\n<\/tr>\n<\/table>\n<p>\n    *Goal 1 can run without GPU, but much slower.\n  <\/p>\n<h2>Key Insights<\/h2>\n<ul>\n<li><b>Best Accuracy:<\/b> Goal 2 (ESResNeXt) with 84.1% clip accuracy<\/li>\n<li><b>Best Speed\/Accuracy Balance:<\/b> Goal 1 (YAMNet+LGB)<\/li>\n<li><b>Fastest Inference:<\/b> Goal 3 (AudioCLIP)<\/li>\n<li><b>Practical Choice:<\/b> Goal 1 for most real-world use cases<\/li>\n<\/ul>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>This repository contains three different approaches for segment-based audio classification on the UrbanSound8K dataset: GOAL 1 \u2013 YAMNet Embeddings \u2192 LightGBM (Baseline) GOAL 2 \u2013 ESResNeXt-fbsp Fine-tuning GOAL 3 \u2013 AudioCLIP Fine-tuning Accuracy and macro-F1 scores are reported at both segment level and clip level for each goal. Setup Requirements Python 3.9+ GPU recommended (NVIDIA [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-194","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.mkensari.com\/index.php\/wp-json\/wp\/v2\/pages\/194","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mkensari.com\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.mkensari.com\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.mkensari.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mkensari.com\/index.php\/wp-json\/wp\/v2\/comments?post=194"}],"version-history":[{"count":2,"href":"https:\/\/www.mkensari.com\/index.php\/wp-json\/wp\/v2\/pages\/194\/revisions"}],"predecessor-version":[{"id":196,"href":"https:\/\/www.mkensari.com\/index.php\/wp-json\/wp\/v2\/pages\/194\/revisions\/196"}],"wp:attachment":[{"href":"https:\/\/www.mkensari.com\/index.php\/wp-json\/wp\/v2\/media?parent=194"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}