SAO autoencoder dataset clarification #206

lucystrauss · 2025-06-12T21:58:56Z

Hi!

I am attempting to train the autoencoder architecture of Stable Audio 2 (stable_audio_2_0_vae.json) on my own dataset. I have 2 questions:

To my understanding, stable_audio_2_0_vae.json is the config file for the autoencoder that was used for Stable Audio Open (SAO). Am I correct?
For SAO autoencoder training, were the samples in the dataset 5 seconds long or 1.5 seconds long? In the paper, it states in Section 3.1, "We gathered 5 sec chunks of diverse, high fidelity audio", but in Section 4.1 "Each batch is made of ≈1.5 sec chunks".

Thanks!

980202006 · 2025-06-13T07:34:11Z

I can answer the second question. 5 seconds refers to the sample segmentation length. The training length is 1.5s, so you can randomly extract 1.5s from 5 seconds to ensure the randomness of the start and end positions. @lucystrauss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SAO autoencoder dataset clarification #206

SAO autoencoder dataset clarification #206

lucystrauss commented Jun 12, 2025

980202006 commented Jun 13, 2025

Uh oh!

SAO autoencoder dataset clarification #206

SAO autoencoder dataset clarification #206

Comments

lucystrauss commented Jun 12, 2025

980202006 commented Jun 13, 2025

Uh oh!