Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Local Deployment Guide for ChatGLM3-6B Bilingual Language Model

Tech 1

ChatGLM3-6B is a powerful open-source bilingual (Chinese-English) dialogue model based on the General Language Model (GLM) architecture. Developed by Zhipu AI and Tsinghua University, it features 6.2 billion parameters and offers lower deployment requirements compared to larger models. Running this model locally ensures data privacy and full control over the inference environment.

Environment Configuration

Python Environment Setup

ChatGLM3 requires Python 3.7 or higher. Using a package manager like Anaconda or Miniconda is recommended to manage environments and avoid conflicts with system-wide libraries.

# Download the Miniconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Execute the installation
bash Miniconda3-latest-Linux-x86_64.sh

# Initialize the shell environment
source ~/.bashrc

Installing Git LFS

Because the pre-trained model weights are large (exceeding 10GB), Git Large File Storage (LFS) is required to clone the repository successfully.

# For RHEL/CentOS systems
sudo yum install git git-lfs -y

# Initialize Git LFS
git lfs install

Repository and Model Installation

Clone the Codebase

First, clone the official implementation from GitHub and set up a dedicated virtual environment.

git clone https://github.com/THUDM/ChatGLM3.git
cd ChatGLM3

# Create a Python 3.10 environment named 'glm-env'
conda create -n glm-env python=3.10 -y
conda activate glm-env

# Install required dependencies
pip install -r requirements.txt

Retrieve Pre-trained Weights

You can download the weights from Hugging Face or use the ModelScope mirror for faster speeds in specific regions.

# Option A: Hugging Face
git clone https://huggingface.co/THUDM/chatglm3-6b

# Option B: ModelScope (Alternative)
git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git

Model Initialization and Usage

Before running the model, locate the script you intend to use (cli_demo.py, web_demo_plus.py, or openai_api.py). You must update the model path and device configuration. The standard loading logic looks like this:

from transformers import AutoTokenizer, AutoModel

# Update 'local_path' to your downloaded weights folder
local_path = "./chatglm3-6b"
tokenizer = AutoTokenizer.from_pretrained(local_path, trust_remote_code=True)

# Use .cuda() for GPU or .float() for CPU inference
chat_model = AutoModel.from_pretrained(local_path, trust_remote_code=True).half().cuda()
chat_model = chat_model.eval()

Console-Based Interaction (CLI)

To interact with the model via a terminal, modify the cli_demo.py script to point to your local model path and run:

python cli_demo.py

Users can type queries directly. Commands like clear reset the history, and stop exits the session.

Web Interface

ChatGLM3 includes a Gradio or Streamlit-based web interface for a more user-friendly experience. Update the path in web_demo.py and execute:

python web_demo.py

The script will provide a locall URL (e.g., http://127.0.0.1:8501) that can be opened in any browser.

Deploying as an OpenAI-Compatible API

For integration with third-party tools like ChatGPT-Next-Web, use the provided OpenAI format API server. Modify the model path in openai_api.py and start the service:

python openai_api.py

Once the server is running (defaulting to port 8000), you can configure your frontend applications to point to your server IP. In ChatGPT-Next-Web, set the API endpoint to http://<your-ip>:8000 and select chatglm3 as the model name.

Hardware Acceleration Notes

  • GPU Deployment: Requires NVIDIA drivers and CUDA. Ensure you use .cuda() in the loading script. The 6B model typically requires ~13GB of VRAM in FP16 mode.
  • CPU Deployment: If VRAM is insufficient, use .float() or .quantize(bits=4).float(). Note that CPU inference is significantly slower.
Tags: ChatGLM3llm

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.