Home > Tech > Content

Deploying a Chinese-Optimized VITS2 Model on Windows

Tech Apr 14 20

This guide details the setup process for a Chinese-optimized VITS2 model on a Windows 11 system with an NVIDIA 2080 Ti GPU. The process involves environment configuration, model acquisition, dataset preparation, and training execution.

Environment Setup

A Python 3.9 environment is recommended to avoid dependency conflicts. After setting up the enviroment, install PyTorch from the official website, selecting the appropriate command for your CUDA version.

Note on PyTorch Installation: If you encounter a TBB-related error during PyTorch installation, deactivate your Conda environment and run conda uninstall TBB before retrying.

Install FFmpeg and other project dependencies:

pip install FFmpeg
pip install -r requirements.txt

Ensure all packages install without errors.

Model Acquisition

Clone the target repository. Within the project directory, create a new folder under Data/ (e.g., MySpeaker). Inside this folder, create the following subdirectories: configs, models, raw, and wavs. Copy the config.json file from the project's root config folder into your new configs directory.

Download the required pre-trained models. Since Hugging Face may be inaccessible, use a mirror by replacing huggingface.co with hf-mirror.com in the download URLs. Replace MySpeaker with your folder name.

# Download BERT model
wget -P bert/Erlangshen-MegatronBert-1.3B-Chinese/ https://hf-mirror.com/IDEA-CCNL/Erlangshen-MegatronBert-1.3B/resolve/main/pytorch_model.bin

# Download WavLM model
wget -P slm/wavlm-base-plus/ https://hf-mirror.com/microsoft/wavlm-base-plus/resolve/main/pytorch_model.bin

# Download emotion models
wget -P emotional/clap-htsat-fused/ https://hf-mirror.com/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://hf-mirror.com/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin

# Download pre-trained VITS2 checkpoints
wget -P Data/MySpeaker/models/ https://hf-mirror.com/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/D_0.pth
wget -P Data/MySpeaker/models/ https://hf-mirror.com/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/G_0.pth
wget -P Data/MySpeaker/models/ https://hf-mirror.com/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/WD_0.pth

Configuration

Modify two key configuration files: config.yml in the project root and config.json in your Data/MySpeaker/configs/ folder.

In config.yml:

Set dataset_path to your data folder (e.g., "./Data/MySpeaker").
Adjust train_ms.num_workers based on your CPU cores (e.g., 4).
Verify in_dir and out_dir paths.
Set the webui.model path to your target generator checkpoint (e.g., "models/G_10050.pth").

In Data/MySpeaker/configs/config.json:

Update "training_files" and "validation_files" paths.
Set "epochs" (e.g., 1000).
Configure "batch_size" based on your GPU memory (e.g., 10 for ~14GB usage).
Define the speaker in "spk2id" (e.g., { "MySpeaker": 0 }).

Dataset Preparation

Place your raw .wav audio files into the Data/MySpeaker/raw/ directory.

1. Audio Slicing

The provided audio_slicer.py may need modification to process all files in a directory. The following script slices all .wav files in the raw folder.

import librosa
import soundfile
import os
from slicer2 import Slicer
import yaml

with open('config.yml', mode="r", encoding="utf-8") as f:
    configyml = yaml.load(f, Loader=yaml.FullLoader)

speaker_name = configyml["dataset_path"].replace("Data/", "")
raw_path = configyml['dataset_path'] + "/raw"
file_list = os.listdir(raw_path)
slice_index = 0

for audio_file in file_list:
    print(f"Processing: {audio_file}")
    audio_data, sample_rate = librosa.load(f'./Data/{speaker_name}/raw/{audio_file}', sr=None, mono=False)
    
    slicer = Slicer(
        sr=sample_rate,
        threshold=-40,
        min_length=2000,
        min_interval=300,
        hop_size=10,
        max_sil_kept=500
    )
    audio_segments = slicer.slice(audio_data)
    
    for segment in audio_segments:
        if len(segment.shape) > 1:
            segment = segment.T
        output_name = f'./Data/{speaker_name}/raw/{speaker_name}_{slice_index}.wav'
        soundfile.write(output_name, segment, sample_rate)
        slice_index += 1
    
    # Remove the original long audio file
    os.remove(f'./Data/{speaker_name}/raw/{audio_file}')

Run the script:

python audio_slicer.py

2. Automatic Transcription

Install Whisper for automatic speech recognition:

pip install git+https://github.com/openai/whisper.git
pip install zhconv

Use the following script (short_audio_transcribe.py) to generate transcriptions and convert tradisional Chinese to simplified.

import whisper
import os
import json
import argparse
import torch
import zhconv
import yaml
from config import config

with open('config.yml', mode="r", encoding="utf-8") as f:
    configyml = yaml.load(f, Loader=yaml.FullLoader)

speaker_id = configyml["dataset_path"].replace("./Data/", "")
lang_to_token = {'zh': "ZH|", 'ja': "JP|", "en": "EN|"}

parser = argparse.ArgumentParser()
parser.add_argument("--languages", default="CJ")
parser.add_argument("--whisper_size", default="medium")
args = parser.parse_args()

if args.languages == "CJE":
    lang_to_token = {'zh': "ZH|", 'ja': "JP|", "en": "EN|"}
elif args.languages == "CJ":
    lang_to_token = {'zh': "ZH|", 'ja': "JP|"}
elif args.languages == "C":
    lang_to_token = {'zh': "ZH|"}

assert torch.cuda.is_available(), "GPU required for Whisper."
model = whisper.load_model(args.whisper_size)
input_dir = config.resample_config.in_dir
annotation_lines = []

print(f"Speaker: {speaker_id}")
for wav_file in list(os.walk(input_dir))[0][2]:
    try:
        detected_lang, transcribed_text = transcribe_one(f"./Data/{speaker_id}/raw/{wav_file}")
        
        if detected_lang not in lang_to_token:
            print(f"Language {detected_lang} not supported. Skipping.")
            continue
        
        if detected_lang == "zh":
            transcribed_text = zhconv.convert(transcribed_text, 'zh-hans')
        
        line_entry = f"./Data/{speaker_id}/wavs/{wav_file}|{speaker_id}|{lang_to_token[detected_lang]}{transcribed_text}\n"
        annotation_lines.append(line_entry)
        
    except Exception as error:
        print(f"Error processing {wav_file}: {error}")
        continue

if len(annotation_lines) == 0:
    print("Warning: No valid audio segments found for transcription.")

with open(config.preprocess_text_config.transcription_path, 'w', encoding='utf-8') as f:
    for line in annotation_lines:
        f.write(line)

def transcribe_one(audio_path):
    audio = whisper.load_audio(audio_path)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    _, lang_probs = model.detect_language(mel)
    detected_lang = max(lang_probs, key=lang_probs.get)
    options = whisper.DecodingOptions(beam_size=5)
    result = whisper.decode(model, mel, options)
    return detected_lang, result.text

Execute the transcription:

python short_audio_transcribe.py

3. Audio Resampling

Resample audio to the target 44.1 kHz rate:

python resample.py --sr 44100 --in_dir ./Data/MySpeaker/raw/ --out_dir ./Data/MySpeaker/wavs/

4. Text Preprocessing

Generate the final training and validation lists:

python preprocess_text.py --transcription-path ./Data/MySpeaker/esd.list --train-path ./Data/MySpeaker/train.list --val-path ./Data/MySpeaker/val.list --config-path ./Data/MySpeaker/configs/config.json

Potential Error: If you encounter Resource punkt not found, download the necessary NLTK data:

import nltk
nltk.download('punkt')

5. Feature Extraction

Generate BERT embeddings for the text:

python bert_gen.py --config-path ./Data/MySpeaker/configs/config.json

Generate CLAP audio features:

python clap_gen.py --config-path ./Data/MySpeaker/configs/config.json

Model Training

Begin training with the configured batch size. Monitor GPU memory usage.

python train_ms.py

Troubleshooting: If you see an error like 'HParams' object has no attribute 'MySpeaker', double-check that the "spk2id" feild in your config.json exactly matches your speaker folder name.

Inference

Before running the web interface, update the model path in config.yml under webui.model to point to your trained checkpoint (e.g., "models/G_10050.pth").

python webui.py

Back to List

Prev: Introduction to the C Programming Language

Next: Implementing Chain of Responsibility and Command Patterns for Flexible Request Handling

Fading Coder

Deploying a Chinese-Optimized VITS2 Model on Windows

Environment Setup

Model Acquisition

Configuration

Dataset Preparation

1. Audio Slicing

2. Automatic Transcription

3. Audio Resampling

4. Text Preprocessing

5. Feature Extraction

Model Training

Inference

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Deploying a Chinese-Optimized VITS2 Model on Windows

Environment Setup

Model Acquisition

Configuration

Dataset Preparation

1. Audio Slicing

2. Automatic Transcription

3. Audio Resampling

4. Text Preprocessing

5. Feature Extraction

Model Training

Inference

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment