This repository contains code, notebooks and a GUI application for exploring and experimenting with various deep learning architectures to generate piano music. The experiments include recurrent architectures (GRU, LSTM), encoder-decoder models, transformer-based fine-tuning (DistilGPT-2), and GAN-based approaches.
Here is a sample of the DistilGPT-2 model
- Project Overview
- Features
- Installation
- Usage
- Data Representation
- Model Architectures
- Results
- Future Work
This project investigates several families of deep learning models:
- Recurrent Neural Networks (GRU, LSTM) in many-to-one, many-to-many and encoder-decoder setups using piano-roll data representation
- Transformer-based fine-tuning of DistilGPT-2 with musical tokenization (REMI)
- Generative Adversarial Networks (GAN), combining LSTM-based generator and discriminator Each approach is implemented in a separate Jupyter notebook, showing all of the experiments, conducted step-by-step.
- Preprocessing routines for piano-roll data representation
- Implementation of multiple RNN-based architectures
- Encoder-Decoder model with bidirectional LSTM encoder and teacher forcing in decoder
- Fine-tuning DistilGPT-2 on musical tokens
- GAN experiments, following C-RNN-GAN design
- Objective and subjective evaluation scripts for pitch diversity, rhythmic consistency, note density and audio quality
- Interactive GUI to generate and listen to piano sequences
If you wish to run the application, you would have to follow these steps:
-
Clone the repository and navigate to the "App/" directory
git clone https://github.com/GecataGoranov/Piano_Generation_with_GUI.git cd Piano_Generation_with_GUI/App
-
Install the necessary packages
- On Debian/Ubuntu systems:
sudo apt install git-lfs ffmpeg fluidsynth
- On RedHat/Fedora systems:
sudo dnf install git-lfs ffmpeg fluidsynth
- On Arch systems:
sudo pacman -S git-lfs ffmpeg fluidsynth
- On Windows:
- Using Chocolatey (recommended)
choco install git-lfs ffmpeg fluidsynth -y
- Using Winget
winget install --id=FFmpeg.FFmpeg winget install --id=Fluidsynth.Fluidsynth winget install --id=Github.GitLFS
- Using Chocolatey (recommended)
- On MacOS
- Using Homebrew (recommended)
brew install git-lfs ffmpeg fluidsynth
- Using MacPorts
sudo port -y install git-lfs ffmpeg fluidsynth
- Using Homebrew (recommended)
- On Debian/Ubuntu systems:
-
Download the large files
git lfs install git lfs pull
-
Create and activate a Python environment
python3 -m venv .venv source .venv/bin/activate
-
Install Python dependencies
pip install -r requirements.txt
In order to use the application, after completing the installation process, you just have to execute the following command in the "App/" directory:
streamlit run main.py
This project uses the MAESTRO dataset. For the various experiments, different data representation forms were used:
- Piano-roll representation - The MIDI files were partitioned into timesteps, with each timestep containing a 128-dimensional vector, representing every possible pitch
- Word tokenization - The MIDI files were converted into string tokens, using the
miditok
library with the REMI tokenizer - Quadruplet representation - The MIDI files were split into 4 continuous values per note: tone length, frequency, intensity and time spent since the previous tone.
- Input: Sequence of length seq_len with 128-dimensional piano-roll vectors
- Architecture: 2 layers of GRU/LSTM with a hidden size of 256
- Output: Single 128-dimensional vector predicting the next timestep
- Input/Output: Seqeunces of length seq_len
- Architecture: Same as many-to-one, but returns outputs at every timestep
- Note: Did not converge; experiments discontinued
- Encoder: 2 layers of bidirectional LSTM with a hidden size of 1024, input dimension - 128
- Decoder: LSTM + linear projection to 128 dimensions with teacher forcing
- Workflow: Encode full sequence, then decode iteratively to generate music
- Model: DistilGPT-2 from HuggingFace Transformers
- Tokenizer: REMI musical tokens
- Outcome: Best performance on objective/subjective metrics
- Generator: 100-dimensional input -> 2 LSTM layers with a hidden size of 350 -> linear layer to 4-dimensional representation
- Discriminator: 2 bidirectional LSTM layers with a hidden size of 350 -> linear output
- Status: Training script implemented, but unresolved errors prevented training
Measured against test set on:
- Pitch diversity
- Rhythmic consistency
- Note density
The DistilGPT-2 model achieved the highest overall scores. The Many-to-One GRU/LSTM models performed competitively on rhythmic entropy and note density, respectively.
Listening tests revealed, that the DistilGPT-2 model produced the most coherent samples, while the RNN-based models exhibited random pauses and noise bursts. Every user can evaluate the models for themselves by using the GUI application, containing the models.
Here are some of the ideas to work on in the future:
- Integrate attention mechanisms into the encoder-decoder framework
- Resolve and extend GAN training
- Explore larger transformer variants
-
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. "Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset." In International Conference on Learning Representations, 2019.
-
Olof Morgen. "C-RNN-GAN: Continuous recurrent neural networks with adversarial training"
-
Yu-Siang Huang, Yi-Hsuan Yang. "Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions"
-
Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter"