ACE-Step/ace-step-v1.5-1d-vae-stable-audio-format
library_name: stable-audio-tools license: mit pipeline_tag: text-to-audio tags:
- audio
- music
- vae
- autoencoder
- ace-step
- stable-audio-tools
<h1 align="center">ACE-Step v1.5 1D VAE</h1> <h1 align="center">Stable Audio Tools Format</h1> <p align="center"> <a href="https://github.com/ACE-Step/ACE-Step-1.5">GitHub</a> | <a href="https://ace-step.github.io/ace-step-v1.5.github.io/">Project</a> | <a href="https://huggingface.co/collections/ACE-Step/ace-step-15">Hugging Face</a> | <a href="https://huggingface.co/spaces/ACE-Step/Ace-Step-v1.5">Space Demo</a> | <a href="https://discord.gg/PeWDxrkdj7">Discord</a> | <a href="https://arxiv.org/abs/2602.00744">Tech Report</a> </p>
Model Details
This is the 1D Variational Autoencoder (VAE) used in ACE-Step v1.5 for music generation. The weights are provided in stable-audio-tools compatible format, making it easy to load, fine-tune, and integrate into your own training pipelines.
| Parameter | Value | |-----------|-------| | Architecture | Oobleck Autoencoder (VAE) | | Audio Channels | 2 (Stereo) | | Sampling Rate | 48,000 Hz | | Latent Dim | 64 | | Encoder Latent Dim | 128 | | Downsampling Ratio | 1,920 | | Encoder/Decoder Channels | 128 | | Channel Multipliers | [1, 2, 4, 8, 16] | | Strides | [2, 4, 4, 6, 10] | | Activation | Snake |
🏗️ Architecture
The VAE is a core component of the ACE-Step v1.5 pipeline, responsible for compressing raw stereo audio (48kHz) into a compact latent representation with a 1920x downsampling ratio and 64-dimensional latent space. The DiT operates in this latent space to generate music.
Quick Start
Installation
pip install stable-audio-tools torchaudio
Load and Use
from stable_audio_vae import StableAudioVAE
# Load model
vae = StableAudioVAE(
config_path="config.json",
checkpoint_path="checkpoint.ckpt",
)
vae = vae.cuda().eval()
# Encode audio
wav = vae.load_wav("input.wav")
wav = wav.cuda()
latent = vae.encode(wav)
print(f"Latent shape: {latent.shape}") # [batch, 64, time/1920]
# Decode back to audio
output = vae.decode(latent)
Command Line
python stable_audio_vae.py -i input.wav -o output.wav
# For long audio, use chunked processing
python stable_audio_vae.py -i input.wav -o output.wav --chunked
Fine-Tuning
This checkpoint is compatible with stable-audio-tools training pipelines. The config.json includes full training configuration (optimizer, loss, discriminator settings) that you can use as a starting point for fine-tuning.
File Structure
.
├── config.json # Model architecture and training config
├── checkpoint.ckpt # Model weights (PyTorch checkpoint)
├── stable_audio_vae.py # Inference script with StableAudioVAE wrapper
└── README.md
🦁 Related Models
| Model | Description | Hugging Face |
|-------|-------------|--------------|
| acestep-v15-base | DiT base model (CFG, 50 steps) | Link |
| acestep-v15-sft | DiT SFT model (CFG, 50 steps) | Link |
| acestep-v15-turbo | DiT turbo model (8 steps) | Link |
| acestep-v15-xl-base | XL DiT base (4B, CFG, 50 steps) | Link |
| acestep-v15-xl-sft | XL DiT SFT (4B, CFG, 50 steps) | Link |
| acestep-v15-xl-turbo | XL DiT turbo (4B, 8 steps) | Link |
🙏 Acknowledgements
This project is co-led by ACE Studio and StepFun.
📖 Citation
If you find this project useful for your research, please consider citing:
@misc{gong2026acestep,
title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
year={2026},
note={GitHub repository}
}
Author: ACE-Step
Likes: 6
Downloads: 0
Tags: stable-audio-tools, autoencoder, audio, music, vae, ace-step, text-to-audio, arxiv:2602.00744, license:mit, region:us



