Deploying a Chinese-Optimized VITS2 Model on Windows
This guide details the setup process for a Chinese-optimized VITS2 model on a Windows 11 system with an NVIDIA 2080 Ti GPU. The process involves environment configuration, model acquisition, dataset preparation, and training execution.
Environment Setup
A Python 3.9 environment is recommended to avoid dependency conflicts. After setting up the enviroment, install PyTorch from the official website, selecting the appropriate command for your CUDA version.
Note on PyTorch Installation: If you encounter a TBB-related error during PyTorch installation, deactivate your Conda environment and run conda uninstall TBB before retrying.
Install FFmpeg and other project dependencies:
pip install FFmpeg
pip install -r requirements.txt
Ensure all packages install without errors.
Model Acquisition
Clone the target repository. Within the project directory, create a new folder under Data/ (e.g., MySpeaker). Inside this folder, create the following subdirectories: configs, models, raw, and wavs. Copy the config.json file from the project's root config folder into your new configs directory.
Download the required pre-trained models. Since Hugging Face may be inaccessible, use a mirror by replacing huggingface.co with hf-mirror.com in the download URLs. Replace MySpeaker with your folder name.
# Download BERT model
wget -P bert/Erlangshen-MegatronBert-1.3B-Chinese/ https://hf-mirror.com/IDEA-CCNL/Erlangshen-MegatronBert-1.3B/resolve/main/pytorch_model.bin
# Download WavLM model
wget -P slm/wavlm-base-plus/ https://hf-mirror.com/microsoft/wavlm-base-plus/resolve/main/pytorch_model.bin
# Download emotion models
wget -P emotional/clap-htsat-fused/ https://hf-mirror.com/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://hf-mirror.com/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin
# Download pre-trained VITS2 checkpoints
wget -P Data/MySpeaker/models/ https://hf-mirror.com/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/D_0.pth
wget -P Data/MySpeaker/models/ https://hf-mirror.com/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/G_0.pth
wget -P Data/MySpeaker/models/ https://hf-mirror.com/v3ucn/Bert-vits2-Extra-Pretrained_models/resolve/main/WD_0.pth
Configuration
Modify two key configuration files: config.yml in the project root and config.json in your Data/MySpeaker/configs/ folder.
In config.yml:
- Set
dataset_pathto your data folder (e.g.,"./Data/MySpeaker"). - Adjust
train_ms.num_workersbased on your CPU cores (e.g.,4). - Verify
in_dirandout_dirpaths. - Set the
webui.modelpath to your target generator checkpoint (e.g.,"models/G_10050.pth").
In Data/MySpeaker/configs/config.json:
- Update
"training_files"and"validation_files"paths. - Set
"epochs"(e.g.,1000). - Configure
"batch_size"based on your GPU memory (e.g.,10for ~14GB usage). - Define the speaker in
"spk2id"(e.g.,{ "MySpeaker": 0 }).
Dataset Preparation
Place your raw .wav audio files into the Data/MySpeaker/raw/ directory.
1. Audio Slicing
The provided audio_slicer.py may need modification to process all files in a directory. The following script slices all .wav files in the raw folder.
import librosa
import soundfile
import os
from slicer2 import Slicer
import yaml
with open('config.yml', mode="r", encoding="utf-8") as f:
configyml = yaml.load(f, Loader=yaml.FullLoader)
speaker_name = configyml["dataset_path"].replace("Data/", "")
raw_path = configyml['dataset_path'] + "/raw"
file_list = os.listdir(raw_path)
slice_index = 0
for audio_file in file_list:
print(f"Processing: {audio_file}")
audio_data, sample_rate = librosa.load(f'./Data/{speaker_name}/raw/{audio_file}', sr=None, mono=False)
slicer = Slicer(
sr=sample_rate,
threshold=-40,
min_length=2000,
min_interval=300,
hop_size=10,
max_sil_kept=500
)
audio_segments = slicer.slice(audio_data)
for segment in audio_segments:
if len(segment.shape) > 1:
segment = segment.T
output_name = f'./Data/{speaker_name}/raw/{speaker_name}_{slice_index}.wav'
soundfile.write(output_name, segment, sample_rate)
slice_index += 1
# Remove the original long audio file
os.remove(f'./Data/{speaker_name}/raw/{audio_file}')
Run the script:
python audio_slicer.py
2. Automatic Transcription
Install Whisper for automatic speech recognition:
pip install git+https://github.com/openai/whisper.git
pip install zhconv
Use the following script (short_audio_transcribe.py) to generate transcriptions and convert tradisional Chinese to simplified.
import whisper
import os
import json
import argparse
import torch
import zhconv
import yaml
from config import config
with open('config.yml', mode="r", encoding="utf-8") as f:
configyml = yaml.load(f, Loader=yaml.FullLoader)
speaker_id = configyml["dataset_path"].replace("./Data/", "")
lang_to_token = {'zh': "ZH|", 'ja': "JP|", "en": "EN|"}
parser = argparse.ArgumentParser()
parser.add_argument("--languages", default="CJ")
parser.add_argument("--whisper_size", default="medium")
args = parser.parse_args()
if args.languages == "CJE":
lang_to_token = {'zh': "ZH|", 'ja': "JP|", "en": "EN|"}
elif args.languages == "CJ":
lang_to_token = {'zh': "ZH|", 'ja': "JP|"}
elif args.languages == "C":
lang_to_token = {'zh': "ZH|"}
assert torch.cuda.is_available(), "GPU required for Whisper."
model = whisper.load_model(args.whisper_size)
input_dir = config.resample_config.in_dir
annotation_lines = []
print(f"Speaker: {speaker_id}")
for wav_file in list(os.walk(input_dir))[0][2]:
try:
detected_lang, transcribed_text = transcribe_one(f"./Data/{speaker_id}/raw/{wav_file}")
if detected_lang not in lang_to_token:
print(f"Language {detected_lang} not supported. Skipping.")
continue
if detected_lang == "zh":
transcribed_text = zhconv.convert(transcribed_text, 'zh-hans')
line_entry = f"./Data/{speaker_id}/wavs/{wav_file}|{speaker_id}|{lang_to_token[detected_lang]}{transcribed_text}\n"
annotation_lines.append(line_entry)
except Exception as error:
print(f"Error processing {wav_file}: {error}")
continue
if len(annotation_lines) == 0:
print("Warning: No valid audio segments found for transcription.")
with open(config.preprocess_text_config.transcription_path, 'w', encoding='utf-8') as f:
for line in annotation_lines:
f.write(line)
def transcribe_one(audio_path):
audio = whisper.load_audio(audio_path)
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
_, lang_probs = model.detect_language(mel)
detected_lang = max(lang_probs, key=lang_probs.get)
options = whisper.DecodingOptions(beam_size=5)
result = whisper.decode(model, mel, options)
return detected_lang, result.text
Execute the transcription:
python short_audio_transcribe.py
3. Audio Resampling
Resample audio to the target 44.1 kHz rate:
python resample.py --sr 44100 --in_dir ./Data/MySpeaker/raw/ --out_dir ./Data/MySpeaker/wavs/
4. Text Preprocessing
Generate the final training and validation lists:
python preprocess_text.py --transcription-path ./Data/MySpeaker/esd.list --train-path ./Data/MySpeaker/train.list --val-path ./Data/MySpeaker/val.list --config-path ./Data/MySpeaker/configs/config.json
Potential Error: If you encounter Resource punkt not found, download the necessary NLTK data:
import nltk
nltk.download('punkt')
5. Feature Extraction
Generate BERT embeddings for the text:
python bert_gen.py --config-path ./Data/MySpeaker/configs/config.json
Generate CLAP audio features:
python clap_gen.py --config-path ./Data/MySpeaker/configs/config.json
Model Training
Begin training with the configured batch size. Monitor GPU memory usage.
python train_ms.py
Troubleshooting: If you see an error like 'HParams' object has no attribute 'MySpeaker', double-check that the "spk2id" feild in your config.json exactly matches your speaker folder name.
Inference
Before running the web interface, update the model path in config.yml under webui.model to point to your trained checkpoint (e.g., "models/G_10050.pth").
python webui.py