Member-only story

Pure Audio, No Filters: WaveNet Magic

Raphael Khalid
10 min readJun 1, 2024

--

This article explores an implementation of WaveNets, which were first developed by Google in 2016 (Oord et al.). The aim of this implementation is to determine whether raw audio waveforms can be used as direct input and outputs for a model without conversion to mel cepstrum frequency coefficients (MFCCs). WaveNets are a type of deep neural network that can generate high-quality audio waveforms. They have been used in a variety of applications, including speech synthesis and music generation.

Recap

The previous article explored audio generation by generating MFCCs which are a low dimensional representation of audio data. Though the model training and predictions were rapid, the main constraint was the conversion of MFCCs back into audio waveforms which lead to distorted generated audio, given the significant information loss when compressing to MFCCs.

Data

This project will train the WaveNets on 121 beatbox samples obtained from data augmentation done in the previous article.

Data Processing

These 121 samples were generated from an initial sample of 11 files by shifting the pitch, and stretching/squashing the time between ranges which were audibly reasonable: to represent people who might beatbox fast or slow, and who have deep or high voices.

The shortest audio length was found and used to cut-off the other audio files at that time, in order to ensure that all audio files were…

--

--

Raphael Khalid
Raphael Khalid

Written by Raphael Khalid

Bachelors in CS & Political Science @ Minerva University | Teacher | Machine Learning & Urban Slum Researcher

No responses yet