Skip to content

Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and GPU/CPU execution.

License

Notifications You must be signed in to change notification settings

devnen/Dia-TTS-Server

Repository files navigation

Dia TTS Server: OpenAI-Compatible API with Web UI, Large Text Handling & Built-in Voices

Self-host the powerful Nari Labs Dia TTS model with this enhanced FastAPI server! Features an intuitive Web UI, flexible API endpoints (including OpenAI-compatible /v1/audio/speech), support for realistic dialogue ([S1]/[S2]), improved voice cloning, large text processing via intelligent chunking, and consistent, reproducible voices using 43 built-in ready-to-use voices and generation seeds feature.

Now with improved speed and reduced VRAM usage. Defaults to efficient BF16 SafeTensors for reduced VRAM and faster inference, with support for original .pth weights. Runs accelerated on NVIDIA GPUs (CUDA) with CPU fallback.

License: MIT Python Version Framework Model Format Docker Web UI CUDA Compatible API

Dia TTS Server Web UI - Dark Mode Dia TTS Server Web UI - Light Mode

πŸ—£οΈ Overview: Enhanced Dia TTS Access

The original Dia 1.6B TTS model by Nari Labs provides incredible capabilities for generating realistic dialogue, complete with speaker turns and non-verbal sounds like (laughs) or (sighs). This project builds upon that foundation by providing a robust FastAPI server that makes Dia significantly easier to use and integrate.

We solve the complexity of setting up and running the model by offering:

  • An OpenAI-compatible API endpoint, allowing you to use Dia TTS with tools expecting OpenAI's API structure.
  • A modern Web UI for easy experimentation, preset loading, reference audio management, and generation parameter tuning. The interface design draws inspiration from Lex-au's Orpheus-FastAPI project, adapting its intuitive layout and user experience for Dia TTS.
  • Large Text Handling: Intelligently splits long text inputs into manageable chunks based on sentence structure and speaker tags, processes them sequentially, and seamlessly concatenates the audio.
  • Predefined Voices: Select from 43 curated, ready-to-use synthetic voices for consistent and reliable output without cloning setup.
  • Improved Voice Cloning: Enhanced pipeline with automatic audio processing and transcript handling (local .txt file or experimental Whisper fallback).
  • Consistent Generation: Achieve consistent voice output across multiple generations or text chunks by using the "Predefined Voices" or "Voice Cloning" modes, optionally combined with a fixed integer Seed.
  • Support for both original .pth weights and modern, secure SafeTensors, defaulting to a BF16 SafeTensors version which uses roughly half the VRAM and offers improved speed.
  • Automatic GPU (CUDA) acceleration detection with fallback to CPU.
  • Configuration primarily via config.yaml, with .env used for initial setup/reset.
  • Docker support for easy containerized deployment with Docker.

This server is your gateway to leveraging Dia's advanced TTS capabilities seamlessly, now with enhanced stability, voice consistency, and large text support.

✨ What's New (v1.4.0 vs v1.0.0)

This version introduces significant improvements and new features:

πŸš€ New Features:

  • Large Text Processing (Chunking):
    • Automatically handles long text inputs by intelligently splitting them into smaller chunks based on sentence boundaries and speaker tags ([S1]/[S2]).
    • Processes each chunk individually and seamlessly concatenates the resulting audio, overcoming previous generation limits.
    • Configurable via UI toggle ("Split text into chunks") and chunk size slider.
  • Predefined Voices:
    • Added support for using 43 curated, ready-to-use synthetic voices stored in the ./voices directory.
    • Selectable via UI dropdown ("Predefined Voices" mode). Server automatically uses required transcripts.
    • Provides reliable voice output without manual cloning setup and avoids potential licensing issues.
  • Enhanced Voice Cloning:
    • Improved backend pipeline for robustness.
    • Automatic reference audio processing: mono conversion, resampling to 44.1kHz, truncation (~20s).
    • Automatic transcript handling: Prioritizes local .txt file (recommended for accuracy) -> experimental Whisper generation if .txt is missing. Backend handles transcript prepending automatically.
    • Robust reference file finding handles case-insensitivity and extensions.
  • Whisper Integration: Added openai-whisper for automatic transcript generation as an experimental fallback during cloning. Configurable model (WHISPER_MODEL_NAME in config.yaml).
  • API Enhancements:
    • /tts endpoint now supports transcript (for explicit clone transcript), split_text, chunk_size, and seed.
    • /v1/audio/speech endpoint now supports seed.
  • Generation Seed: Added seed parameter to UI and API for influencing generation results. Using a fixed integer seed in combination with Predefined Voices or Voice Cloning helps maintain consistency across chunks or separate generations. Use -1 for random variation.
  • Terminal Progress: Generation of long text (using chunking) now displays a tqdm progress bar in the server's terminal window.
  • UI Configuration Management: Added UI section to view/edit config.yaml settings and save generation defaults.
  • Configuration System: Migrated to config.yaml for primary runtime configuration, managed via config.py. .env is now used mainly for initial seeding or resetting defaults.

πŸ”§ Fixes & Enhancements:

  • VRAM Usage Fixed & Optimized: Resolved memory leaks during inference and significantly reduced VRAM usage (approx. 14GB+ down to ~7GB) through code optimizations, fixing memory leaks, and BF16 default.
  • Performance: Significant speed improvements reported (approaching 95% real-time on tested hardware: AMD Ryzen 9 9950X3D + NVIDIA RTX 3090).
  • Audio Post-Processing: Automatically applies silence trimming (leading/trailing), internal silence reduction, and unvoiced segment removal (using Parselmouth) to improve audio quality and remove artifacts.
  • UI State Persistence: Web UI now saves/restores text input, voice mode selection, file selections, and generation parameters (seed, chunking, sliders) in config.yaml.
  • UI Improvements: Better loading indicators (shows chunk processing), refined chunking controls, seed input field, theme toggle, dynamic preset loading from ui/presets.yaml, warning modals for chunking/generation quality.
  • Cloning Workflow: Backend now handles transcript prepending automatically. UI workflow simplified (user selects file, enters target text).
  • Dependency Management: Added tqdm, PyYAML, openai-whisper, parselmouth to requirements.txt.
  • Code Refactoring: Aligned internal engine code with refactored dia library structure. Updated config.py to use YamlConfigManager.

βœ… Features

  • Core Dia Capabilities (via Nari Labs Dia):
    • πŸ—£οΈ Generate multi-speaker dialogue using [S1] / [S2] tags.
    • πŸ˜‚ Include non-verbal sounds like (laughs), (sighs), (clears throat).
    • 🎭 Perform voice cloning using reference audio prompts.
  • Enhanced Server & API:
    • ⚑ Built with the high-performance FastAPI framework.
    • πŸ€– OpenAI-Compatible API Endpoint (/v1/audio/speech) for easy integration (now includes seed).
    • βš™οΈ Custom API Endpoint (/tts) exposing all Dia generation parameters (now includes seed, split_text, chunk_size, transcript).
    • πŸ“„ Interactive API documentation via Swagger UI (/docs).
    • 🩺 Health check endpoint (/health).
  • Advanced Generation Features:
    • πŸ“š Large Text Handling: Intelligently splits long inputs into chunks based on sentences and speaker tags, generates audio for each, and concatenates the results seamlessly. Configurable via split_text and chunk_size.
    • 🎀 Predefined Voices: Select from 43 curated, ready-to-use synthetic voices in the ./voices directory for consistent output without cloning setup.
    • ✨ Improved Voice Cloning: Robust pipeline with automatic audio processing and transcript handling (local .txt or Whisper fallback). Backend handles transcript prepending.
    • 🌱 Consistent Generation: Use Predefined Voices or Voice Cloning modes, optionally with a fixed integer Seed, for consistent voice output across chunks or multiple requests.
    • πŸ”‡ Audio Post-Processing: Automatic steps to trim silence, fix internal pauses, and remove long unvoiced segments/artifacts.
  • Intuitive Web User Interface:
    • πŸ–±οΈ Modern, easy-to-use interface inspired by Lex-au's Orpheus-FastAPI project.
    • πŸ’‘ Presets: Load example text and settings dynamically from ui/presets.yaml. Customize by editing the file.
    • 🎀 Reference Audio Upload: Easily upload .wav/.mp3 files for voice cloning.
    • πŸ—£οΈ Voice Mode Selection: Choose between Predefined Voices, Voice Cloning, or Random/Dialogue modes.
    • πŸŽ›οΈ Parameter Control: Adjust generation settings (CFG Scale, Temperature, Speed, Seed, etc.) via sliders and inputs.
    • πŸ’Ύ Configuration Management: View and save server settings (config.yaml) and default generation parameters directly in the UI.
    • πŸ’Ύ Session Persistence: Remembers your last used settings via config.yaml.
    • βœ‚οΈ Chunking Controls: Enable/disable text splitting and adjust approximate chunk size.
    • ⚠️ Warning Modals: Optional warnings for chunking voice consistency and general generation quality.
    • πŸŒ“ Light/Dark Mode: Toggle between themes with preference saved locally.
    • πŸ”Š Audio Player: Integrated waveform player (WaveSurfer.js) for generated audio with download option.
    • ⏳ Loading Indicator: Shows status, including chunk processing information.
  • Flexible & Efficient Model Handling:
    • ☁️ Downloads models automatically from Hugging Face Hub.
    • πŸ”’ Supports loading secure .safetensors weights (default).
    • πŸ’Ύ Supports loading original .pth weights.
    • πŸš€ Defaults to BF16 SafeTensors for reduced memory footprint (~half size) and potentially faster inference. (Credit: ttj/dia-1.6b-safetensors)
    • πŸ”„ Easily switch between model formats/versions via config.yaml.
  • Performance & Configuration:
    • πŸ’» GPU Acceleration: Automatically uses NVIDIA CUDA if available, falls back to CPU. Optimized VRAM usage (~7GB typical).
    • πŸ“Š Terminal Progress: Displays tqdm progress bar when processing text chunks.
    • βš™οΈ Primary configuration via config.yaml, initial seeding via .env.
    • πŸ“¦ Uses standard Python virtual environments.
  • Docker Support:
    • 🐳 Containerized deployment via Docker and Docker Compose.
    • πŸ”Œ NVIDIA GPU acceleration with Container Toolkit integration.
    • πŸ’Ύ Persistent volumes for models, reference audio, predefined voices, outputs, and config.
    • πŸš€ One-command setup and deployment (docker compose up -d).

πŸ”© System Prerequisites

  • Operating System: Windows 10/11 (64-bit) or Linux (Debian/Ubuntu recommended).
  • Python: Version 3.10 or later (Download).
  • Git: For cloning the repository (Download).
  • Internet: For downloading dependencies and models.
  • (Optional but HIGHLY Recommended for Performance):
    • NVIDIA GPU: CUDA-compatible (Maxwell architecture or newer). Check NVIDIA CUDA GPUs. Optimized VRAM usage (~7GB typical), but more helps.
    • NVIDIA Drivers: Latest version for your GPU/OS (Download).
    • CUDA Toolkit: Compatible version (e.g., 11.8, 12.1) matching the PyTorch build you install.
  • (Linux Only):
    • libsndfile1: Audio library needed by soundfile. Install via package manager (e.g., sudo apt install libsndfile1).
    • ffmpeg: Required by openai-whisper. Install via package manager (e.g., sudo apt install ffmpeg).

πŸ’» Installation and Setup

Follow these steps carefully to get the server running.

1. Clone the Repository

git clone https://github.com/devnen/dia-tts-server.git
cd dia-tts-server

2. Set up Python Virtual Environment

Using a virtual environment is crucial!

  • Windows (PowerShell):

    # In the dia-tts-server directory
    python -m venv venv
    .\venv\Scripts\activate
    # Your prompt should now start with (venv)
  • Linux (Bash - Debian/Ubuntu Example):

    # Ensure prerequisites are installed
    sudo apt update && sudo apt install python3 python3-venv python3-pip libsndfile1 ffmpeg -y
    
    # In the dia-tts-server directory
    python3 -m venv venv
    source venv/bin/activate
    # Your prompt should now start with (venv)

3. Install Dependencies

Make sure your virtual environment is activated ((venv) prefix visible).

# Upgrade pip (recommended)
pip install --upgrade pip

# Install project requirements (includes tqdm, yaml, parselmouth etc.)
pip install -r requirements.txt

⭐ Note: This installation includes large libraries like PyTorch. The download and installation process may take some time depending on your internet speed and system performance.

⭐ Important: This installs the CPU-only version of PyTorch by default. If you have an NVIDIA GPU, proceed to Step 4 before running the server for GPU acceleration.

4. NVIDIA Driver and CUDA Setup (for GPU Acceleration)

Skip this step if you only have a CPU.

  • Step 4a: Check/Install NVIDIA Drivers

    • Run nvidia-smi in your terminal/command prompt.
    • If it works, note the CUDA Version listed (e.g., 12.1, 11.8). This is the maximum your driver supports.
    • If it fails, download and install the latest drivers from NVIDIA Driver Downloads and reboot. Verify with nvidia-smi again.
  • Step 4b: Install PyTorch with CUDA Support

    • Go to the Official PyTorch Website.
    • Use the configuration tool: Select Stable, Windows/Linux, Pip, Python, and the CUDA version that is equal to or lower than the one shown by nvidia-smi (e.g., if nvidia-smi shows 12.4, choose CUDA 12.1).
    • Copy the generated command (it will include --index-url https://download.pytorch.org/whl/cuXXX).
    • In your activated (venv):
      # Uninstall the CPU version first!
      pip uninstall torch torchvision torchaudio -y
      
      # Paste and run the command copied from the PyTorch website
      # Example (replace with your actual command):
      pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
  • Step 4c: Verify PyTorch CUDA Installation

    • In your activated (venv), run python and execute the following single line:
      import torch; print(f"PyTorch version: {torch.__version__}"); print(f"CUDA available: {torch.cuda.is_available()}"); print(f"Device name: {torch.cuda.get_device_name(0)}") if torch.cuda.is_available() else None; exit()
    • If CUDA available: shows True, the setup was successful. If False, double-check driver installation and the PyTorch install command.

βš™οΈ Configuration

The server now primarily uses config.yaml for runtime configuration.

  • config.yaml: Located in the project root. This file stores all server settings, model paths, generation defaults, and UI state. It is created automatically on the first run if it doesn't exist. This is the main file to edit for persistent configuration changes.
  • .env File: Used only for the initial creation of config.yaml if it's missing, or when using the "Reset All Settings" button in the UI. Values in .env override hardcoded defaults during this initial seeding/reset process. It is not read during normal server operation once config.yaml exists.
  • UI Configuration: The "Server Configuration" and "Generation Parameters" sections in the Web UI allow direct editing and saving of values into config.yaml.

Key Configuration Areas (in config.yaml or UI):

  • server: host, port
  • model: repo_id, config_filename, weights_filename, whisper_model_name
  • paths: model_cache, reference_audio, output, voices (for predefined)
  • generation_defaults: Default values for sliders/seed in the UI (speed_factor, cfg_scale, temperature, top_p, cfg_filter_top_k, seed, split_text, chunk_size).
  • ui_state: Stores the last used text, voice mode, file selections, etc., for UI persistence.

⭐ Remember: Changes made to server, model, or paths sections in config.yaml (or via the UI) require a server restart to take effect. Changes to generation_defaults or ui_state are applied dynamically or on the next page load.

▢️ Running the Server

Note on Model Downloads: The first time you run the server (or after changing model settings in config.yaml), it will download the required Dia and Whisper model files (~3-7GB depending on selection). Monitor the terminal logs for progress. The server starts fully after downloads complete.

  1. Activate the virtual environment (if not activated):
    • Linux/macOS: source venv/bin/activate
    • Windows: .\venv\Scripts\activate
  2. Run the server:
    python server.py
  3. Access the UI: The server should automatically attempt to open the Web UI in your default browser after startup. If it doesn't for any reason, manually navigate to http://localhost:PORT (e.g., http://localhost:8003).
  4. Access API Docs: Open http://localhost:PORT/docs.
  5. Stop the server: Press CTRL+C in the terminal.

Okay, here is a revised Docker installation section for your README.md, incorporating the recent changes and decisions. It prioritizes using Docker Compose with the pre-built image from GitHub Container Registry (GHCR) as the recommended method.


🐳 Docker Installation

Run Dia TTS Server easily using Docker. The recommended method uses Docker Compose with pre-built images from GitHub Container Registry (GHCR).

Prerequisites

Option 1: Using Docker Compose (Recommended)

This method uses docker-compose.yml to manage the container, volumes, and configuration easily. It leverages pre-built images hosted on GHCR.

  1. Clone the repository: (You only need the docker-compose.yml and env.example.txt files from it)

    git clone https://github.com/devnen/dia-tts-server.git
    cd dia-tts-server
  2. (Optional) Initial Configuration via .env:

    • If this is your very first time running the container and you want to override the default settings before config.yaml is created inside the container, copy the example environment file:
      cp env.example.txt .env
    • Edit the .env file with your desired initial settings (e.g., PORT, model filenames).
    • Note: This .env file is only used to seed the initial config.yaml on the very first container start if /app/config.yaml doesn't already exist inside the container's volume (which it won't initially). Subsequent configuration changes should be made via the UI or by editing config.yaml directly (see Configuration Note below).
  3. Review docker-compose.yml:

    • The repository includes a docker-compose.yml file configured to use the pre-built image and recommended settings. Ensure it looks similar to this:

      # docker-compose.yml
      version: '3.8'
      
      services:
        dia-tts-server:
          # Use the pre-built image from GitHub Container Registry
          image: ghcr.io/devnen/dia-tts-server:latest
          # Alternatively, to build locally (e.g., for development):
          # build:
          #   context: .
          #   dockerfile: Dockerfile
          ports:
            # Map host port (default 8003) to container port 8003
            # You can change the host port via .env (e.g., PORT=8004)
            - "${PORT:-8003}:8003"
          volumes:
            # Mount local directories into the container for persistent data
            - ./model_cache:/app/model_cache
            - ./reference_audio:/app/reference_audio
            - ./outputs:/app/outputs
            - ./voices:/app/voices
            # DO NOT mount config.yaml - let the app create it inside
      
          # --- GPU Access ---
          # Modern method (Recommended for newer Docker/NVIDIA setups)
          devices:
            - nvidia.com/gpu=all
          device_cgroup_rules:
            - "c 195:* rmw" # Needed for some NVIDIA container toolkit versions
            - "c 236:* rmw" # Needed for some NVIDIA container toolkit versions
      
          # Legacy method (Alternative for older Docker/NVIDIA setups)
          # If the 'devices' block above doesn't work, comment it out and uncomment
          # the 'deploy' block below. Do not use both simultaneously.
          # deploy:
          #   resources:
          #     reservations:
          #       devices:
          #         - driver: nvidia
          #           count: 1 # Or specify specific GPUs e.g., "device=0,1"
          #           capabilities: [gpu]
          # --- End GPU Access ---
      
          restart: unless-stopped
          env_file:
            # Load environment variables from .env file for initial config seeding
            - .env
          environment:
            # Enable faster Hugging Face downloads inside the container
            - HF_HUB_ENABLE_HF_TRANSFER=1
            # Pass GPU capabilities (may be needed for legacy method if uncommented)
            - NVIDIA_VISIBLE_DEVICES=all
            - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      
      # Optional: Define named volumes if you prefer them over host mounts
      # volumes:
      #   model_cache:
      #   reference_audio:
      #   outputs:
      #   voices:
  4. Start the container:

    docker compose up -d
    • This command will:
      • Pull the latest ghcr.io/devnen/dia-tts-server:latest image.
      • Create the local directories (model_cache, reference_audio, outputs, voices) if they don't exist.
      • Start the container in detached mode (-d).
    • The first time you run this, it will download the TTS models into ./model_cache, which may take some time depending on your internet speed.
  5. Access the UI: Open your web browser to http://localhost:8003 (or the host port you configured in .env).

  6. View logs:

    docker compose logs -f
  7. Stop the container:

    docker compose down

Option 2: Using docker run (Alternative)

This method runs the container directly without Docker Compose, requiring manual specification of ports, volumes, and GPU flags.

# Ensure local directories exist first:
# mkdir -p model_cache reference_audio outputs voices

docker run -d \
  --name dia-tts-server \
  -p 8003:8003 \
  -v ./model_cache:/app/model_cache \
  -v ./reference_audio:/app/reference_audio \
  -v ./outputs:/app/outputs \
  -v ./voices:/app/voices \
  --env HF_HUB_ENABLE_HF_TRANSFER=1 \
  --gpus all \
  ghcr.io/devnen/dia-tts-server:latest
  • Replace 8003:8003 with <your_host_port>:8003 if needed.
  • --gpus all enables GPU access; consult NVIDIA Container Toolkit documentation for alternatives if needed.
  • Initial configuration relies on model defaults unless you pass environment variables using multiple -e VAR=VALUE flags (more complex than using .env with Compose).

Configuration Note

  • The server uses config.yaml inside the container (/app/config.yaml) for its settings.
  • On the very first start, if /app/config.yaml doesn't exist, the server creates it using defaults from the code, potentially overridden by variables in the .env file (if using Docker Compose and .env exists).
  • After the first start, changes should be made by:
    • Using the Web UI's settings page (if available).
    • Editing the config.yaml file inside the container (e.g., docker compose exec dia-tts-server nano /app/config.yaml). Changes require a container restart (docker compose restart dia-tts-server) to take effect for server/model/path settings. UI state changes are saved live.

Performance Optimizations

  • Faster Model Downloads: hf-transfer is enabled by default in the provided docker-compose.yml and image, significantly speeding up initial model downloads from Hugging Face.
  • GPU Acceleration: The docker-compose.yml and docker run examples include flags (devices or --gpus) to enable NVIDIA GPU acceleration if available. The Docker image uses a CUDA runtime base for efficiency.

Docker Volumes

Persistent data is stored on your host machine via volume mounts:

  • ./model_cache:/app/model_cache (Downloaded TTS and Whisper models)
  • ./reference_audio:/app/reference_audio (Your uploaded reference audio files for cloning)
  • ./outputs:/app/outputs (Generated audio files)
  • ./voices:/app/voices (Predefined voice audio files)

Available Images

  • GitHub Container Registry: ghcr.io/devnen/dia-tts-server:latest (Automatically built from the main branch)

πŸ’‘ Usage

Web UI (http://localhost:PORT)

The most intuitive way to use the server:

  • Text Input: Enter your script. Use [S1]/[S2] for dialogue and non-verbals like (laughs). Content is saved automatically.
  • Generate Button & Chunking: Click "Generate Speech". Below the text box:
    • Split text into chunks: Toggle checkbox (enabled by default). Enables splitting for long text (> ~2x chunk size).
    • Chunk Size: Adjust the slider (visible when splitting is possible) for approximate chunk character length (default 120).
  • Voice Mode: Choose:
    • Predefined Voices: Select a curated, ready-to-use synthetic voice from the ./voices directory.
    • Voice Cloning: Select an uploaded reference file from ./reference_audio. Requires a corresponding .txt transcript (recommended) or relies on experimental Whisper fallback. Backend handles transcript automatically.
    • Random Single / Dialogue: Uses [S1]/[S2] tags or generates a random voice if no tags. Use a fixed Seed for consistency.
  • Presets: Click buttons (loaded from ui/presets.yaml) to populate text and parameters. Customize by editing the YAML file.
  • Reference Audio (Clone Mode): Select an existing .wav/.mp3 or click "Import" to upload new files to ./reference_audio.
  • Generation Parameters: Adjust sliders/inputs for Speed, CFG, Temperature, Top P, Top K, and Seed. Settings are saved automatically. Click "Save Generation Parameters" to update the defaults in config.yaml. Use -1 seed for random, integer for specific results.
  • Server Configuration: View/edit config.yaml settings (requires server restart for some changes).
  • Loading Overlay: Appears during generation, showing chunk progress if applicable.
  • Audio Player: Appears on success with waveform, playback controls, download link, and generation info.
  • Theme Toggle: Switch between light/dark modes.

API Endpoints (/docs for details)

  • /v1/audio/speech (POST): OpenAI-compatible.
    • input: Text.
    • voice: 'S1', 'S2', 'dialogue', 'predefined_voice_filename.wav', or 'reference_filename.wav'.
    • response_format: 'opus' or 'wav'.
    • speed: Playback speed factor (0.5-2.0).
    • seed: (Optional) Integer seed, -1 for random.
  • /tts (POST): Custom endpoint with full control.
    • text: Target text.
    • voice_mode: 'dialogue', 'single_s1', 'single_s2', 'clone', 'predefined'.
    • clone_reference_filename: Filename in ./reference_audio (for clone) or ./voices (for predefined).
    • transcript: (Optional, Clone Mode Only) Explicit transcript text to override file/Whisper lookup.
    • output_format: 'opus' or 'wav'.
    • max_tokens: (Optional) Max tokens per chunk.
    • cfg_scale, temperature, top_p, cfg_filter_top_k: Generation parameters.
    • speed_factor: Playback speed factor (0.5-2.0).
    • seed: (Optional) Integer seed, -1 for random.
    • split_text: (Optional) Boolean, enable/disable chunking (default: True).
    • chunk_size: (Optional) Integer, target chunk size (default: 120).

πŸ” Troubleshooting

  • CUDA Not Available / Slow: Check NVIDIA drivers (nvidia-smi), ensure correct CUDA-enabled PyTorch is installed (Installation Step 4).
  • VRAM Out of Memory (OOM):
    • Ensure you are using the BF16 model (dia-v0_1_bf16.safetensors in config.yaml) if VRAM is limited (~7GB needed).
    • Close other GPU-intensive applications. VRAM optimizations and leak fixes have significantly reduced requirements.
    • If processing very long text even with chunking, try reducing chunk_size (e.g., 100).
  • CUDA Out of Memory (OOM) During Startup: This can happen due to temporary overhead. The server loads weights to CPU first to mitigate this. If it persists, check VRAM usage (nvidia-smi), ensure BF16 model is used, or try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True environment variable before starting.
  • Import Errors (dac, tqdm, yaml, whisper, parselmouth): Activate venv, run pip install -r requirements.txt. Ensure descript-audio-codec installed correctly.
  • libsndfile / ffmpeg Error (Linux): Run sudo apt install libsndfile1 ffmpeg.
  • Model Download Fails (Dia or Whisper): Check internet, config.yaml settings (model.repo_id, model.weights_filename, model.whisper_model_name), Hugging Face status, cache path permissions (paths.model_cache).
  • Voice Cloning Fails / Poor Quality:
    • Ensure accurate .txt transcript exists alongside the reference audio in ./reference_audio. Format: [S1] text... or [S1] text... [S2] text.... This is the most reliable method.
    • Whisper fallback is experimental and may be inaccurate.
    • Use clean, clear reference audio (5-20s).
    • Check server logs for specific errors during _prepare_cloning_inputs.
  • Permission Errors (Saving Files/Config): Check write permissions for paths.output, paths.reference_audio, paths.voices, paths.model_cache (for Whisper transcript saves), and config.yaml.
  • UI Issues / Settings Not Saving: Clear browser cache/local storage. Check developer console (F12) for JS errors. Ensure config.yaml is writable by the server process.
  • Inconsistent Voice with Chunking: Use "Predefined Voices" or "Voice Cloning" mode. If using "Random/Dialogue" mode with splitting, use a fixed integer seed (not -1) for consistency across chunks. The UI provides a warning otherwise.
  • Port Conflict (Address already in use / Errno 98): Another process is using the port (default 8003). Stop the other process or change the server.port in config.yaml (requires restart).
    • Explanation: This usually happens if a previous server instance didn't shut down cleanly or another application is bound to the same port.
    • Linux: Find/kill process: sudo lsof -i:PORT | grep LISTEN | awk '{print $2}' | xargs kill -9 (Replace PORT, e.g., 8003).
    • Windows: Find/kill process: for /f "tokens=5" %i in ('netstat -ano ^| findstr :PORT') do taskkill /F /PID %i (Replace PORT, e.g., 8003). Use with caution.
  • Generation Cancel Button: This is a "UI Cancel" - it stops the frontend from waiting but doesn't instantly halt ongoing backend model inference. Clicking Generate again cancels the previous UI wait.

Selecting GPUs on Multi-GPU Systems

Set the CUDA_VISIBLE_DEVICES environment variable before running python server.py to specify which GPU(s) PyTorch should see. The server uses the first visible one (cuda:0).

  • Example (Use only physical GPU 1):

    • Linux/macOS: CUDA_VISIBLE_DEVICES="1" python server.py
    • Windows CMD: set CUDA_VISIBLE_DEVICES=1 && python server.py
    • Windows PowerShell: $env:CUDA_VISIBLE_DEVICES="1"; python server.py
  • Example (Use physical GPUs 6 and 7 - server uses GPU 6):

    • Linux/macOS: CUDA_VISIBLE_DEVICES="6,7" python server.py
    • Windows CMD: set CUDA_VISIBLE_DEVICES=6,7 && python server.py
    • Windows PowerShell: $env:CUDA_VISIBLE_DEVICES="6,7"; python server.py

Note: CUDA_VISIBLE_DEVICES selects GPUs; it does not fix OOM errors if the chosen GPU lacks sufficient memory.

🀝 Contributing

Contributions are welcome! Please feel free to open an issue to report bugs or suggest features, or submit a Pull Request for improvements.

πŸ“œ License

This project is licensed under the MIT License.

You can find it here: https://opensource.org/licenses/MIT

πŸ™ Acknowledgements