You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am attempting to train the autoencoder architecture of Stable Audio 2 (stable_audio_2_0_vae.json) on my own dataset. I have 2 questions:
To my understanding, stable_audio_2_0_vae.json is the config file for the autoencoder that was used for Stable Audio Open (SAO). Am I correct?
For SAO autoencoder training, were the samples in the dataset 5 seconds long or 1.5 seconds long? In the paper, it states in Section 3.1, "We gathered 5 sec chunks of diverse, high fidelity audio", but in Section 4.1 "Each batch is made of ≈1.5 sec chunks".
Thanks!
The text was updated successfully, but these errors were encountered:
I can answer the second question. 5 seconds refers to the sample segmentation length. The training length is 1.5s, so you can randomly extract 1.5s from 5 seconds to ensure the randomness of the start and end positions. @lucystrauss
Hi!
I am attempting to train the autoencoder architecture of Stable Audio 2 (stable_audio_2_0_vae.json) on my own dataset. I have 2 questions:
To my understanding, stable_audio_2_0_vae.json is the config file for the autoencoder that was used for Stable Audio Open (SAO). Am I correct?
For SAO autoencoder training, were the samples in the dataset 5 seconds long or 1.5 seconds long? In the paper, it states in Section 3.1, "We gathered 5 sec chunks of diverse, high fidelity audio", but in Section 4.1 "Each batch is made of ≈1.5 sec chunks".
Thanks!
The text was updated successfully, but these errors were encountered: