Member-only story
Pure Audio, No Filters: WaveNet Magic
This article explores an implementation of WaveNets, which were first developed by Google in 2016 (Oord et al.). The aim of this implementation is to determine whether raw audio waveforms can be used as direct input and outputs for a model without conversion to mel cepstrum frequency coefficients (MFCCs). WaveNets are a type of deep neural network that can generate high-quality audio waveforms. They have been used in a variety of applications, including speech synthesis and music generation.
Recap
The previous article explored audio generation by generating MFCCs which are a low dimensional representation of audio data. Though the model training and predictions were rapid, the main constraint was the conversion of MFCCs back into audio waveforms which lead to distorted generated audio, given the significant information loss when compressing to MFCCs.
Data
This project will train the WaveNets on 121 beatbox samples obtained from data augmentation done in the previous article.
Data Processing
These 121 samples were generated from an initial sample of 11 files by shifting the pitch, and stretching/squashing the time between ranges which were audibly reasonable: to represent people who might beatbox fast or slow, and who have deep or high voices.
The shortest audio length was found and used to cut-off the other audio files at that time, in order to ensure that all audio files were…