Setting Up and Using GPT2-Chinese in Anaconda Environment
Environment Preparation
1. Installing PyTorch
Choose the appropriate version based on your needs. The CPU version is simpler to set up.
# Install PyTorch with conda (replace with your preferred channel if needed)
conda install pytorch torchvision cpuonly -c pytorch-stable
# Alternatively, use pip with a specified source for faster downloads
pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
2. Configuring Anaconda Sources
To speed up package downloads, configure Anaconda to use mirrors:
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes
The configuration file is located at ~/.condarc and can be edited manually.
3. Installing TensorFlow (Optional for Custom Vocab)
If you plan to create a custom vocabulary, install TensorFlow with a compatible version:
pip install transformers==2.1.1
Verify the installation:
python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
If you encounter errors like DLL load failed: 找不到指定模块, install VC++2019 (x64 for 64-bit or x86 for 32-bit systems).
4. Managing Conda Environments
Use these commands to handle environments:
# List existing environments
conda env list
conda info -e
# Create a new environment
conda create -n my_env python=3.8
# Activate/deactivate environments
conda activate my_env
conda deactivate
# Remove an environment
conda env remove -n my_env
# Install packages
conda install numpy=1.21.0
conda remove numpy
# List installed packages
conda list
# Export/import environments
pip freeze > requirements.txt
pip install -r requirements.txt
conda env export --file environment.yml
conda env create -f environment.yml
conda create -n new_env --clone old_env
5. GitBash Environment Activation
In GitBash, you may face issues activating environments. Use these commands:
# Activate environment
source activate my_env
# Deactivate environment
source deactivate
Training and Generation
1. Handling Encoding Issues
If your training data (e.g., train.json) is in UTF-8, ensure proper encoding handling. For small datasets, generating a custom vocabulary reduces garbled text.
2. Custom Vocabulary Generation
The default vocabulary (cache/vocab_small.txt) may cause issues with small datsaets. Generate a custom vocabulary:
# Navigate to cache directory
cd cache/
# Generate vocabulary
bash make_vocab.sh
# Adjust vocab_size in config/model_config_small.json based on the word count in vocab_user.txt
3. Training and Generating Text
Run training and generation with the custom vocabulary:
# Training
python train.py --raw --min_length 4 --tokenizer_path cache/vocab_user.txt
# Generating text
python generate.py --length=50 --nsamples=4 --prefix="你好" --fast_pattern --tokenizer_path cache/vocab_user.txt
Note: Training on CPU, even with small datasets, can be time-consuming. Consider using Colab for faster execution.