High-Performance LLM Deployment and Quantization Guide with LMDeploy
Environment Configuration
To begin working with LMDeploy, a compatible development environment must be established. This process involves setting up a Conda environment and installing the necessary dependencies.
Creating the Conda Environment
Initialize a new Conda environment named llm-inference with Python 3.10. This isolates the dependencies required for the deployment tools.
conda create -n llm-inference -y python=3.10
Once the environment is ready, activate it:
conda activate llm-inference
Installing LMDeploy
Install the LMDeploy package. Ensure version compatibility if specific features are required. For this guide, version 0.3.0 is used.
pip install lmdeploy[all]==0.3.0
Model Inference Comparison
Understanding the difference between standard Hugging Face inference and the optimized TurboMind engine is crucial for performance tuning.
Native Hugging Face Inference
Models hosted on platforms like Hugging Face or OpenXLab are typically stored in HF format. To test baseline performance, use the transformers library.
Create a file named native_hf_test.py:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Define model path
model_path = "/root/models/internlm2-chat-1_8b"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
llm = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
trust_remote_code=True
).cuda()
llm.eval()
# First interaction
query = "hello"
print(f"[User]: {query}")
output, history = llm.chat(tokenizer, query, history=[])
print(f"[Assistant]: {output}")
# Second interaction with context
follow_up = "please provide three suggestions about time management"
print(f"[User]: {follow_up}")
output, history = llm.chat(tokenizer, follow_up, history=history)
print(f"[Assistant]: {output}")
Execute the script to observe the inference speed:
python native_hf_test.py
Optimized Inference with LMDeploy
LMDeploy utilizes the TurboMind engine, which supports continuous batching and optimized KV cache management. It automatically converts HF models to TurboMind format upon first run.
To interact with the model via CLI:
lmdeploy chat /root/models/internlm2-chat-1_8b
Users will notice a significant reduction in latency compared to the native transformers approach. Type exit to terminate the session.
Quantization Strategies
Quantization reduces model size and memory bandwidth requirements, which is vital for memory-bound scenarios typical in LLM decoding.
Key Concepts
- Compute-bound: Performance limited by calculation speed. Optimized via faster hardware.
- Memory-bound: Performance limited by data transfer speed. Optimized via quantization (reducing data size) and KV cache management.
LMDeploy supports KV8 Quantization (INT8 for KV cache) and W4A16 Quantization (INT4 for weights, FP16 for activation).
Managing KV Cache
The --cache-max-entry-count flag controls the proportion of free GPU memory allocated to KV cache. The default is 0.8.
To reduce memory footprint:
lmdeploy chat /root/models/internlm2-chat-1_8b --cache-max-entry-count 0.5
Setting this value lower (e.g., 0.01) minimizes VRAM usage but may impact generation speed due to swapping.
W4A16 Weight Quantization
Using the AWQ algorithm, weights can be quantized to 4-bit. This requires the einops library.
pip install einops==0.7.0
Run the quantization process:
lmdeploy lite auto_awq \
/root/models/internlm2-chat-1_8b \
--calib-dataset 'ptb' \
--calib-samples 128 \
--calib-seqlen 1024 \
--w-bits 4 \
--w-group-size 128 \
--work-dir /root/models/internlm2-chat-1_8b-4bit
Once completed, launch the quantized model:
lmdeploy chat /root/models/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.01
This configuration significantly lowers VRAM consumption.
API Serving Architecture
For production use, models are often exposed via API servers rather than local CLI.
Starting the API Server
Launch the server component using the following command. This exposes the model on port 23333.
lmdeploy serve api_server \
/root/models/internlm2-chat-1_8b \
--model-format hf \
--quant-policy 0 \
--server-name 0.0.0.0 \
--server-port 23333 \
--tp 1
Ensure the terminal remains open. Remote access may require SSH port forwarding:
ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.server.com -p <PORT>
Connecting Clients
CLI Client: Open a new terminal and connect to the running server:
lmdeploy serve api_client http://localhost:23333
Web Client (Gradio): For a graphical interface, launch the Gradio server:
lmdeploy serve gradio http://localhost:23333 \
--server-name 0.0.0.0 \
--server-port 6006
Forward port 6006 via SSH and access http://127.0.0.1:6006 in a browser.
Python SDK Integration
Developers can integrate LMDeploy directly into Python applications using the pipeline API.
Basic Pipeline Usage
Create a file named lmdeploy_integration.py:
from lmdeploy import pipeline
# Initialize inference engine
engine = pipeline('/root/models/internlm2-chat-1_8b')
# Batch inference
prompts = ['Hi, pls intro yourself', 'Shanghai is']
outputs = engine(prompts)
for out in outputs:
print(out.text)
Configuring Backend Parameters
Advanced configurations, such as KV cache limits, can be passed via TurbomindEngineConfig.
Create configurable_pipeline.py:
from lmdeploy import pipeline, TurbomindEngineConfig
# Configure engine to use 20% of VRAM for KV cache
engine_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
# Initialize with config
engine = pipeline('/root/models/internlm2-chat-1_8b', backend_config=engine_config)
# Run inference
results = engine(['Hi, pls intro yourself', 'Shanghai is'])
print(results)
Advanced Features
Multimodal Support (VLM)
LMDeploy supports Vision-Language Models like LLaVA. Ensure sufficient GPU resources (e.g., 30% A100) before proceeding.
Install dependencies:
pip install git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874
Create vlm_inference.py:
from lmdeploy import pipeline
from lmdeploy.vl import load_image
# Load VLM
vlm_engine = pipeline('/share/models/liuhaotian/llava-v1.6-vicuna-7b')
# Load image
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
# Infer
result = vlm_engine(('describe this image', img))
print(result.text)
Performance Benchmarking
To quantify performance gains, compare token generation speeds between transformers and LMDeploy.
Transformers Benchmark (hf_latency_test.py):
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "/root/models/internlm2-chat-1_8b"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
model.eval()
# Warmup
for _ in range(5):
model.chat(tokenizer, "hello", history=[])
# Measure
prompt = "Please introduce yourself."
iterations = 10
total_chars = 0
start_ts = time.time()
for _ in range(iterations):
resp, _ = model.chat(tokenizer, prompt, history=[])
total_chars += len(resp)
duration = time.time() - start_ts
print(f"Speed: {total_chars / duration:.3f} chars/s")
LMDeploy Benchmark (lmdeploy_latency_test.py):
import time
from lmdeploy import pipeline
engine = pipeline('/root/models/internlm2-chat-1_8b')
# Warmup
for _ in range(5):
engine(["hello"])
# Measure
prompt = "Please introduce yourself."
iterations = 10
total_chars = 0
start_ts = time.time()
for _ in range(iterations):
resp = engine([prompt])
total_chars += len(resp[0].text)
duration = time.time() - start_ts
print(f"Speed: {total_chars / duration:.3f} chars/s")
LMDeploy typically demonstrates significantly higher throughput due to optimized kernels and memory management.