Home > Tech > Content

Setting Up and Using GPT2-Chinese in Anaconda Environment

Tech May 13 4

Environment Preparation

1. Installing PyTorch

Choose the appropriate version based on your needs. The CPU version is simpler to set up.

# Install PyTorch with conda (replace with your preferred channel if needed)
conda install pytorch torchvision cpuonly -c pytorch-stable

# Alternatively, use pip with a specified source for faster downloads
pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

2. Configuring Anaconda Sources

To speed up package downloads, configure Anaconda to use mirrors:

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes

The configuration file is located at ~/.condarc and can be edited manually.

3. Installing TensorFlow (Optional for Custom Vocab)

If you plan to create a custom vocabulary, install TensorFlow with a compatible version:

pip install transformers==2.1.1

Verify the installation:

python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

If you encounter errors like DLL load failed: 找不到指定模块, install VC++2019 (x64 for 64-bit or x86 for 32-bit systems).

4. Managing Conda Environments

Use these commands to handle environments:

# List existing environments
conda env list
conda info -e

# Create a new environment
conda create -n my_env python=3.8

# Activate/deactivate environments
conda activate my_env
conda deactivate

# Remove an environment
conda env remove -n my_env

# Install packages
conda install numpy=1.21.0
conda remove numpy

# List installed packages
conda list

# Export/import environments
pip freeze > requirements.txt
pip install -r requirements.txt
conda env export --file environment.yml
conda env create -f environment.yml
conda create -n new_env --clone old_env

5. GitBash Environment Activation

In GitBash, you may face issues activating environments. Use these commands:

# Activate environment
source activate my_env

# Deactivate environment
source deactivate

Training and Generation

1. Handling Encoding Issues

If your training data (e.g., train.json) is in UTF-8, ensure proper encoding handling. For small datasets, generating a custom vocabulary reduces garbled text.

2. Custom Vocabulary Generation

The default vocabulary (cache/vocab_small.txt) may cause issues with small datsaets. Generate a custom vocabulary:

# Navigate to cache directory
cd cache/

# Generate vocabulary
bash make_vocab.sh

# Adjust vocab_size in config/model_config_small.json based on the word count in vocab_user.txt

3. Training and Generating Text

Run training and generation with the custom vocabulary:

# Training
python train.py --raw --min_length 4 --tokenizer_path cache/vocab_user.txt

# Generating text
python generate.py --length=50 --nsamples=4 --prefix="你好" --fast_pattern --tokenizer_path cache/vocab_user.txt

Note: Training on CPU, even with small datasets, can be time-consuming. Consider using Colab for faster execution.

Tags: pytorch

Back to List

Prev: Maximizing Advantage in Paired Comparisons with the Greedy Algorithm

Next: Validating Data with Yup and Testing Schemas Using Jest

Fading Coder

Setting Up and Using GPT2-Chinese in Anaconda Environment

Environment Preparation

1. Installing PyTorch

2. Configuring Anaconda Sources

3. Installing TensorFlow (Optional for Custom Vocab)

4. Managing Conda Environments

5. GitBash Environment Activation

Training and Generation

1. Handling Encoding Issues

2. Custom Vocabulary Generation

3. Training and Generating Text

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Setting Up and Using GPT2-Chinese in Anaconda Environment

Environment Preparation

1. Installing PyTorch

2. Configuring Anaconda Sources

3. Installing TensorFlow (Optional for Custom Vocab)

4. Managing Conda Environments

5. GitBash Environment Activation

Training and Generation

1. Handling Encoding Issues

2. Custom Vocabulary Generation

3. Training and Generating Text

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment