Real-Time Target Sound Extraction

Bibliographic Details
Title: Real-Time Target Sound Extraction
Authors: Veluri, Bandhav, Chan, Justin, Itani, Malek, Chen, Tuochao, Yoshioka, Takuya, Gollakota, Shyamnath
Publication Year: 2022
Collection: Computer Science
Subject Terms: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
More Details: We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.
Comment: ICASSP 2023 camera-ready
Document Type: Working Paper
Access URL: http://arxiv.org/abs/2211.02250
Accession Number: edsarx.2211.02250
Database: arXiv
More Details
Description not available.