Optimizing GPT OSS Private Deployment with vLLM for High-Performance Inference
Introduction
OpenAI recently released two open-source models: GPT OSS 120B and GPT OSS 20B. While official vLLM inference requires complex installation steps, this guide demonstrates production deployment using GPUStack with a custom vLLM installation. Performence comparisons with Ollama using EvalScope reveal significant throughput advantages for vLLM in concurrent scenarios.
GPUStack Installation
Begin by installing GPUStack following the official documentation. The containerized deployment appproach is recommended for NVIDIA GPU servers. Ensure appropriate NVIDIA drivers, Docker, and NVIDIA Container Toolkit are installed before launching the GPUStack service.
Tested on NVIDIA H20 GPU:
docker run -d --name gpustack-service \
--restart=unless-stopped \
--gpus all \
--network=host \
--ipc=host \
-e HF_ENDPOINT="https://hf-mirror.com" \
-v gpustack-storage:/var/lib/gpustack \
gpustack/gpustack \
--port 9090
Verify service status:
docker logs -f gpustack-service
Retrieve initial admin password:
docker exec -it gpustack-service cat /var/lib/gpustack/initial_admin_password
Access the GPUStack console at http://YOUR_HOST_IP:9090 using admin credentials.
Custom vLLM Installation
GPUStack 0.7.0 includes vLLM 0.9.2, but GPT OSS models require vLLM 0.10.1+gptoss branch. Install this development version while maintaining stable default versions.
Enter container and create Python virtual environment:
docker exec -it gpustack-service bash
mkdir -p /var/lib/gpustack/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /var/lib/gpustack/miniconda3/miniconda.sh
bash /var/lib/gpustack/miniconda3/miniconda.sh -b -u -p /var/lib/gpustack/miniconda3
rm /var/lib/gpustack/miniconda3/miniconda.sh
source /var/lib/gpustack/miniconda3/bin/activate
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
Create Python 3.12 environment:
conda create -n vllm_gptoss python=3.12 -y
conda activate vllm_gptoss
python -V
Install vLLM 0.10.1+gptoss:
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
ln -sf /var/lib/gpustack/miniconda3/envs/vllm_gptoss/bin/vllm /var/lib/gpustack/bin/vllm_0.10.1+gptoss
Model Deployment
Download GPT OSS models through GPUStack UI from Hugging Face or ModelScope. For Chinese networks, ModelScope is recommended.
Deploy models with advanced configuraton:
- Model type: LLM
- Backend version: 0.10.1+gptoss
- Backend parameters: --max-model-len=32768
- Environment variables: VLLM_ATTENTION_BACKEND=FLASH_ATTN, VLLM_USE_FLASHINFER_SAMPLER=0
Performance Benchmarking
Using EvalScope framework to compare throughput between Ollama and vLLM backends.
Install EvalScope:
conda create -n benchmark python=3.10 -y
conda activate benchmark
pip install -U 'evalscope[perf]' plotly gradio wandb
Benchmark commands for GPT OSS 20B (single GPU):
10 requests, 1 concurrent:
# vLLM
evalscope perf \
--url "https://gpustack.xxx.xx/v1/chat/completions" \
--api-key "your_api_key" \
--model gpt-oss-20b \
--number 10 \
--parallel 1 \
--api openai \
--dataset openqa \
--stream
# Ollama
evalscope perf \
--url "http://192.168.0.1:11434/v1/chat/completions" \
--model gpt-oss:20b \
--number 10 \
--parallel 1 \
--api openai \
--dataset openqa \
--stream
100 requests, 10 concurrent:
# vLLM
evalscope perf \
--url "https://gpustack.xxx.xx/v1/chat/completions" \
--api-key "your_api_key" \
--model gpt-oss-20b \
--number 100 \
--parallel 10 \
--api openai \
--dataset openqa \
--stream
# Ollama
evalscope perf \
--url "http://192.168.0.1:11434/v1/chat/completions" \
--model gpt-oss:20b \
--number 100 \
--parallel 10 \
--api openai \
--dataset openqa \
--stream
Performance results summary:
vLLM demonstrates superior resource utilization and scaling efficiency. While Ollama's multi-instance architecture consumes excessive GPU memory, vLLM supports higher concurrency with lower memory footprint, providing better ROI for enterprise deployments.