Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Setting Up and Using GPT2-Chinese in Anaconda Environment

Tech May 13 4

Environment Preparation

1. Installing PyTorch

Choose the appropriate version based on your needs. The CPU version is simpler to set up.

# Install PyTorch with conda (replace with your preferred channel if needed)
conda install pytorch torchvision cpuonly -c pytorch-stable

# Alternatively, use pip with a specified source for faster downloads
pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

2. Configuring Anaconda Sources

To speed up package downloads, configure Anaconda to use mirrors:

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes

The configuration file is located at ~/.condarc and can be edited manually.

3. Installing TensorFlow (Optional for Custom Vocab)

If you plan to create a custom vocabulary, install TensorFlow with a compatible version:

pip install transformers==2.1.1

Verify the installation:

python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

If you encounter errors like DLL load failed: 找不到指定模块, install VC++2019 (x64 for 64-bit or x86 for 32-bit systems).

4. Managing Conda Environments

Use these commands to handle environments:

# List existing environments
conda env list
conda info -e

# Create a new environment
conda create -n my_env python=3.8

# Activate/deactivate environments
conda activate my_env
conda deactivate

# Remove an environment
conda env remove -n my_env

# Install packages
conda install numpy=1.21.0
conda remove numpy

# List installed packages
conda list

# Export/import environments
pip freeze > requirements.txt
pip install -r requirements.txt
conda env export --file environment.yml
conda env create -f environment.yml
conda create -n new_env --clone old_env

5. GitBash Environment Activation

In GitBash, you may face issues activating environments. Use these commands:

# Activate environment
source activate my_env

# Deactivate environment
source deactivate

Training and Generation

1. Handling Encoding Issues

If your training data (e.g., train.json) is in UTF-8, ensure proper encoding handling. For small datasets, generating a custom vocabulary reduces garbled text.

2. Custom Vocabulary Generation

The default vocabulary (cache/vocab_small.txt) may cause issues with small datsaets. Generate a custom vocabulary:

# Navigate to cache directory
cd cache/

# Generate vocabulary
bash make_vocab.sh

# Adjust vocab_size in config/model_config_small.json based on the word count in vocab_user.txt

3. Training and Generating Text

Run training and generation with the custom vocabulary:

# Training
python train.py --raw --min_length 4 --tokenizer_path cache/vocab_user.txt

# Generating text
python generate.py --length=50 --nsamples=4 --prefix="你好" --fast_pattern --tokenizer_path cache/vocab_user.txt

Note: Training on CPU, even with small datasets, can be time-consuming. Consider using Colab for faster execution.

Tags: pytorch

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.