Rachel Manzelli, Vijay Thakkar, Ali Siahkamari, and Brian Kulis
Department of Electrical and Computer Engineering
Boston University
Boston, MA


Existing automated music generation methods are largely focused on generating music at the note-level, resulting in symbolic outputs such as MIDI. These methods, such as those based on training LSTM models, are effective at capturing medium-scale effects in music, can produce melodies, and feature fast generation times. However, one of the drawbacks of existing symbolic techniques for creating music is that they cannot capture several desirable aspects of music. Emotion and feel of music, for instance, are not simple to control or modify with symbolic approaches. Adding computer-generated vocals is also not possible in a fully symbolic setting, nor is creating new instruments or sound effects.

An alternative is to directly train on and produce raw audio by adapting speech synthesis models, resulting in a richer palette of potential musical outputs, albeit at a higher computational cost. WaveNet, a model developed at DeepMind primarily targeted towards speech synthesis, has been applied directly on music; the model is trained to predict the next sample of 8-bit audio (typically sampled at 16 kHz) given the previous samples. Initially, this was shown to produce rich piano music when generated directly on raw piano samples. Follow-up work has developed faster generation times, methods to use WaveNet-type architectures to generate new sounds (e.g., combining multiple instruments together), and generating synthetic vocals for music. Needless to say, this approach to music generation, while very new, is showing tremendous potential for automated music generation. However, while WaveNet produces more realistic and interesting sounds, the model does not handle medium or long-range dependencies such as melody in music.

We believe that combining these two approaches will yield superior music generation, and open the door to a host of new music generation tools. We explore the combination of LSTM components with WaveNet-style raw audio generation, to yield models that generate realistic-sounding music with long-term melodic structure. We believe that such combined deep learning models open the door to producing increasingly realistic music.

For details on our model architecture, we refer the reader to our corresponding paper, referenced at the bottom of the page.

Empirical Results

Unconditioned model

We first trained unconditioned WaveNet models on piano music from the MusicNet database and Chopin's Nocturnes from YouTube, and generated unique music from those models. As an aside, we have included examples of the raw audio output of those models. This model was trained ~300,000 iterations.

Unconditioned Piano I

Raw Audio

Unconditioned Piano II

Raw Audio

The generations locally make sense in some areas, and some techniques and moods are learned. However, the music has no global structure.

Editing Existing Audio

We evaluate our conditioned model for the purpose of editing existing raw audio; i.e., taking music that already exists, slightly editing some corresponding symbolic representation, and regenerating from the conditioned model. This allows us to evaluate the expressive nature of the model, and how the raw audio changes when the MIDI file is changed. The samples below demonstrate this. This conditioned raw audio model was trained ~100,000 iterations on cello music.

Before Editing

MIDI       Raw Audio

After Editing

MIDI       Raw Audio

Popular melodies

We trained a model using our architecture, training on cello songs and their corresponding MIDI sequences obtained from the MusicNet database. This 31-layer network was trained for ~40,000 iterations. We fed 3 different timeseries to this model, and achieved the following structured results. The conditioning sequence is represented first in the form of a MIDI file, and the raw audio output follows.

C Major Scale

MIDI       Raw Audio

Happy Birthday

MIDI       Raw Audio

Though a bit noisy due to training artifacts and the shallowness of the network, the raw audio output closely follows the conditioning melody. We hope to train these networks further to obtain clearer results.

Unique LSTM-generated Melodies

We then showcase the results of the end-to-end model, namely unique melodies from the biaxial LSTM fed to the conditioned raw audio model.

Cello Melody

MIDI       Raw Audio

Piano Melody

MIDI       Raw Audio

The cello melody was generated by an LSTM trained on the MusicNet Solo Cello dataset. Although the melody quickly becomes complex, the cello model is able to grasp the signal and replicate it (despite the slight squeaking, probably due to training artifacts). We note that although the MIDI fades to silence after about 6 seconds, the cello model fills in that silence with its own note for the remaining 4 seconds of audio. The piano melody was generated by an LSTM trained on the MusicNet Solo Piano dataset. Since it quickly becomes complex, with many notes playing at once, the cello model has difficulties following the conditioning signal in raw audio form. The model was trained on cello music, and thus is not well suited to play complex chords. However, it does show much potential and we are currently working to better the quality of the LSTM generations as well as better the ability of the raw audio to generalize to and express any melody.


Rachel Manzelli, Vijay Thakkar, Ali Siahkamari, Brian Kulis. "Combining Deep Generative Raw Audio Models for Structured Automatic Music." Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2018.