AFTER is a diffusion-based generative model that creates new audio by blending two sources: one audio stream to set the style or timbre, and another input (either audio or MIDI) to shape the structure over time.
This repository is a real-time implementation of the research paper Combining audio control and style transfer using latent diffusion (read it here) by Nils Demerlé, P. Esling, G. Doras, and D. Genova. Some transfer examples can be found on the project webpage. This real-time version integrates with MaxMSP and Ableton Live through nn_tilde, an external that embeds PyTorch models into MaxMSP.
You can find pretrained models and max patches for realtime inference in the last section of this page.
git clone https://github.com/acids-ircam/AFTER.git
cd AFTER/
pip install -e .
If you want to use the model in MaxMSP or PureData for real-time generation, please refer to the nn_tilde external documentation and follow the installation steps.
Training AFTER involves 3 separate steps, autoencoder training, model training and model export.
If you already have a streamable audio codec such as a pretrained RAVE model, you can directly skip to the next section. Also, we provide four audio codecs already trained on different datasets here.
Before training the autoencoder, you need to preprocess your audio files into an lmdb database :
after prepare_dataset --input_path /audio/folder --output_path /dataset/path --save_waveform True --waveform_augmentation none
Then, you can start the model training
after train_autoencoder --name AE_model_name --db_path /audio/folder --config baseAE --gpu 0
where db_path
refers to the prepared dataset location. The tensorboard logs and checkpoints are saved by default to ./autoencoder_runs/
.
After training, the model has to be exported to a torchscript file using
after export_autoencoder --model_path autoencoder_runs/AE_model_name --step 1000000
This will save two .ts files in the run folder, one for streaming and one for offline inference (export_stream.ts and export.ts respectively).
First, you need to prepare your dataset before training. Since our diffusion model works in the latent space of the autoencoder, we pre-compute the latent embeddings to speed up training :
after prepare_dataset --input_path /audio/folder --output_path /dataset/path --emb_model_path AE_model_run_path/export.ts
num_signal
flag sets the duration of the audio chunks for training in number of samples (must be a power of 2). (default: 524288 ~ 11 seconds)sample_rate
flag sets the resampling rate. (default: 44100)gpu
device to use for computing the embeddings. Use -1 for cpu (default: 0)
To train a midi-to-audio AFTER model you need to either use the flag --basic_pitch_midi
to transcript the midi from the audio files or define your own file parsing function in ./after/dataset/parsers.py
.
If you plan to have more advanced use of the models, please refer to the help function for all the arguments.
Then, a training is started with
after train --name diff_model_name --db_path /dataset/path --emb_model_path AE_model_run_path/export.ts --config CONFIG_NAME
Different configurations are available in diffusion/configs
and can be combined :
Category | Config | Description |
---|---|---|
Model | base | Default audio-to-audio timbre and structure separation model. |
midi | Uses MIDI as input for the structure encoder | |
Additional | tiny | Reduces the model's capacity for faster inference. Useful for testing and low-resource environments. |
cycle | Experimental: adds a cycle consistency phase during training, which can improve timbre and structure disentanglement. |
The tensorboard logs and checkpoints are saved to /diffusion/runs/model_name
, and you can experiment with you trained model using the notebooks notebooks/audio_to_audio_demo.ipynb
and notebooks/midi_to_audio_demo.ipynb
.
Once the training is complete, you can export the model to an nn_tilde torchscript file for inference in MaxMSP and PureData.
For an audio-to-audio model :
after export --model_path diff_model_name --emb_model_path AE_model_run_path/export_stream.ts --step 800000
For a MIDI-to-audio model :
after export_midi --model_path diff_model_name --emb_model_path AE_model_run_path/export_stream.ts --npoly 4 --step 800000
where npoly
sets the number for voices for polyphony. Make sure to use the streaming version of the exported autoencoder (denoted by _stream.ts).
You can experiment with inference in MaxMSP using the patches in ./patchs
and the pretrained models available here.
AFTER has been applied in several projects:
- The Call by Holly Herndon and Mat Dryhurst, an interactive sound installation with singing voice transfer, at Serpentine Gallery in London until February 2, 2025.
- A live performance by French electronic artist Canblaster for Forum Studio Session at IRCAM. The full concert is available on YouTube.
- Nature Manifesto, an immersive sound installation by Björk and Robin Meier, at Centre Pompidou in Paris from November 20 to December 9, 2024.
We look forward to seeing new projects and creative uses of AFTER.