Local Deployment Guide for ChatGLM3-6B Bilingual Language Model
ChatGLM3-6B is a powerful open-source bilingual (Chinese-English) dialogue model based on the General Language Model (GLM) architecture. Developed by Zhipu AI and Tsinghua University, it features 6.2 billion parameters and offers lower deployment requirements compared to larger models. Running this model locally ensures data privacy and full control over the inference environment.
Environment Configuration
Python Environment Setup
ChatGLM3 requires Python 3.7 or higher. Using a package manager like Anaconda or Miniconda is recommended to manage environments and avoid conflicts with system-wide libraries.
# Download the Miniconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Execute the installation
bash Miniconda3-latest-Linux-x86_64.sh
# Initialize the shell environment
source ~/.bashrc
Installing Git LFS
Because the pre-trained model weights are large (exceeding 10GB), Git Large File Storage (LFS) is required to clone the repository successfully.
# For RHEL/CentOS systems
sudo yum install git git-lfs -y
# Initialize Git LFS
git lfs install
Repository and Model Installation
Clone the Codebase
First, clone the official implementation from GitHub and set up a dedicated virtual environment.
git clone https://github.com/THUDM/ChatGLM3.git
cd ChatGLM3
# Create a Python 3.10 environment named 'glm-env'
conda create -n glm-env python=3.10 -y
conda activate glm-env
# Install required dependencies
pip install -r requirements.txt
Retrieve Pre-trained Weights
You can download the weights from Hugging Face or use the ModelScope mirror for faster speeds in specific regions.
# Option A: Hugging Face
git clone https://huggingface.co/THUDM/chatglm3-6b
# Option B: ModelScope (Alternative)
git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git
Model Initialization and Usage
Before running the model, locate the script you intend to use (cli_demo.py, web_demo_plus.py, or openai_api.py). You must update the model path and device configuration. The standard loading logic looks like this:
from transformers import AutoTokenizer, AutoModel
# Update 'local_path' to your downloaded weights folder
local_path = "./chatglm3-6b"
tokenizer = AutoTokenizer.from_pretrained(local_path, trust_remote_code=True)
# Use .cuda() for GPU or .float() for CPU inference
chat_model = AutoModel.from_pretrained(local_path, trust_remote_code=True).half().cuda()
chat_model = chat_model.eval()
Console-Based Interaction (CLI)
To interact with the model via a terminal, modify the cli_demo.py script to point to your local model path and run:
python cli_demo.py
Users can type queries directly. Commands like clear reset the history, and stop exits the session.
Web Interface
ChatGLM3 includes a Gradio or Streamlit-based web interface for a more user-friendly experience. Update the path in web_demo.py and execute:
python web_demo.py
The script will provide a locall URL (e.g., http://127.0.0.1:8501) that can be opened in any browser.
Deploying as an OpenAI-Compatible API
For integration with third-party tools like ChatGPT-Next-Web, use the provided OpenAI format API server. Modify the model path in openai_api.py and start the service:
python openai_api.py
Once the server is running (defaulting to port 8000), you can configure your frontend applications to point to your server IP. In ChatGPT-Next-Web, set the API endpoint to http://<your-ip>:8000 and select chatglm3 as the model name.
Hardware Acceleration Notes
- GPU Deployment: Requires NVIDIA drivers and CUDA. Ensure you use
.cuda()in the loading script. The 6B model typically requires ~13GB of VRAM in FP16 mode. - CPU Deployment: If VRAM is insufficient, use
.float()or.quantize(bits=4).float(). Note that CPU inference is significantly slower.