Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

High-Performance LLM Deployment and Quantization Guide with LMDeploy

Tech May 10 2

Environment Configuration

To begin working with LMDeploy, a compatible development environment must be established. This process involves setting up a Conda environment and installing the necessary dependencies.

Creating the Conda Environment

Initialize a new Conda environment named llm-inference with Python 3.10. This isolates the dependencies required for the deployment tools.

conda create -n llm-inference -y python=3.10

Once the environment is ready, activate it:

conda activate llm-inference

Installing LMDeploy

Install the LMDeploy package. Ensure version compatibility if specific features are required. For this guide, version 0.3.0 is used.

pip install lmdeploy[all]==0.3.0

Model Inference Comparison

Understanding the difference between standard Hugging Face inference and the optimized TurboMind engine is crucial for performance tuning.

Native Hugging Face Inference

Models hosted on platforms like Hugging Face or OpenXLab are typically stored in HF format. To test baseline performance, use the transformers library.

Create a file named native_hf_test.py:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Define model path
model_path = "/root/models/internlm2-chat-1_8b"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
llm = AutoModelForCausalLM.from_pretrained(
    model_path, 
    torch_dtype=torch.float16, 
    trust_remote_code=True
).cuda()
llm.eval()

# First interaction
query = "hello"
print(f"[User]: {query}")
output, history = llm.chat(tokenizer, query, history=[])
print(f"[Assistant]: {output}")

# Second interaction with context
follow_up = "please provide three suggestions about time management"
print(f"[User]: {follow_up}")
output, history = llm.chat(tokenizer, follow_up, history=history)
print(f"[Assistant]: {output}")

Execute the script to observe the inference speed:

python native_hf_test.py

Optimized Inference with LMDeploy

LMDeploy utilizes the TurboMind engine, which supports continuous batching and optimized KV cache management. It automatically converts HF models to TurboMind format upon first run.

To interact with the model via CLI:

lmdeploy chat /root/models/internlm2-chat-1_8b

Users will notice a significant reduction in latency compared to the native transformers approach. Type exit to terminate the session.

Quantization Strategies

Quantization reduces model size and memory bandwidth requirements, which is vital for memory-bound scenarios typical in LLM decoding.

Key Concepts

  • Compute-bound: Performance limited by calculation speed. Optimized via faster hardware.
  • Memory-bound: Performance limited by data transfer speed. Optimized via quantization (reducing data size) and KV cache management.

LMDeploy supports KV8 Quantization (INT8 for KV cache) and W4A16 Quantization (INT4 for weights, FP16 for activation).

Managing KV Cache

The --cache-max-entry-count flag controls the proportion of free GPU memory allocated to KV cache. The default is 0.8.

To reduce memory footprint:

lmdeploy chat /root/models/internlm2-chat-1_8b --cache-max-entry-count 0.5

Setting this value lower (e.g., 0.01) minimizes VRAM usage but may impact generation speed due to swapping.

W4A16 Weight Quantization

Using the AWQ algorithm, weights can be quantized to 4-bit. This requires the einops library.

pip install einops==0.7.0

Run the quantization process:

lmdeploy lite auto_awq \
   /root/models/internlm2-chat-1_8b \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/models/internlm2-chat-1_8b-4bit

Once completed, launch the quantized model:

lmdeploy chat /root/models/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.01

This configuration significantly lowers VRAM consumption.

API Serving Architecture

For production use, models are often exposed via API servers rather than local CLI.

Starting the API Server

Launch the server component using the following command. This exposes the model on port 23333.

lmdeploy serve api_server \
    /root/models/internlm2-chat-1_8b \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

Ensure the terminal remains open. Remote access may require SSH port forwarding:

ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.server.com -p <PORT>

Connecting Clients

CLI Client: Open a new terminal and connect to the running server:

lmdeploy serve api_client http://localhost:23333

Web Client (Gradio): For a graphical interface, launch the Gradio server:

lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006

Forward port 6006 via SSH and access http://127.0.0.1:6006 in a browser.

Python SDK Integration

Developers can integrate LMDeploy directly into Python applications using the pipeline API.

Basic Pipeline Usage

Create a file named lmdeploy_integration.py:

from lmdeploy import pipeline

# Initialize inference engine
engine = pipeline('/root/models/internlm2-chat-1_8b')

# Batch inference
prompts = ['Hi, pls intro yourself', 'Shanghai is']
outputs = engine(prompts)

for out in outputs:
    print(out.text)

Configuring Backend Parameters

Advanced configurations, such as KV cache limits, can be passed via TurbomindEngineConfig.

Create configurable_pipeline.py:

from lmdeploy import pipeline, TurbomindEngineConfig

# Configure engine to use 20% of VRAM for KV cache
engine_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

# Initialize with config
engine = pipeline('/root/models/internlm2-chat-1_8b', backend_config=engine_config)

# Run inference
results = engine(['Hi, pls intro yourself', 'Shanghai is'])
print(results)

Advanced Features

Multimodal Support (VLM)

LMDeploy supports Vision-Language Models like LLaVA. Ensure sufficient GPU resources (e.g., 30% A100) before proceeding.

Install dependencies:

pip install git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874

Create vlm_inference.py:

from lmdeploy import pipeline
from lmdeploy.vl import load_image

# Load VLM
vlm_engine = pipeline('/share/models/liuhaotian/llava-v1.6-vicuna-7b')

# Load image
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')

# Infer
result = vlm_engine(('describe this image', img))
print(result.text)

Performance Benchmarking

To quantify performance gains, compare token generation speeds between transformers and LMDeploy.

Transformers Benchmark (hf_latency_test.py):

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "/root/models/internlm2-chat-1_8b"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
model.eval()

# Warmup
for _ in range(5):
    model.chat(tokenizer, "hello", history=[])

# Measure
prompt = "Please introduce yourself."
iterations = 10
total_chars = 0
start_ts = time.time()

for _ in range(iterations):
    resp, _ = model.chat(tokenizer, prompt, history=[])
    total_chars += len(resp)

duration = time.time() - start_ts
print(f"Speed: {total_chars / duration:.3f} chars/s")

LMDeploy Benchmark (lmdeploy_latency_test.py):

import time
from lmdeploy import pipeline

engine = pipeline('/root/models/internlm2-chat-1_8b')

# Warmup
for _ in range(5):
    engine(["hello"])

# Measure
prompt = "Please introduce yourself."
iterations = 10
total_chars = 0
start_ts = time.time()

for _ in range(iterations):
    resp = engine([prompt])
    total_chars += len(resp[0].text)

duration = time.time() - start_ts
print(f"Speed: {total_chars / duration:.3f} chars/s")

LMDeploy typically demonstrates significantly higher throughput due to optimized kernels and memory management.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.