Home > Tech > Content

Optimizing GPT OSS Private Deployment with vLLM for High-Performance Inference

Tech May 11 5

Introduction

OpenAI recently released two open-source models: GPT OSS 120B and GPT OSS 20B. While official vLLM inference requires complex installation steps, this guide demonstrates production deployment using GPUStack with a custom vLLM installation. Performence comparisons with Ollama using EvalScope reveal significant throughput advantages for vLLM in concurrent scenarios.

GPUStack Installation

Begin by installing GPUStack following the official documentation. The containerized deployment appproach is recommended for NVIDIA GPU servers. Ensure appropriate NVIDIA drivers, Docker, and NVIDIA Container Toolkit are installed before launching the GPUStack service.

Tested on NVIDIA H20 GPU:

docker run -d --name gpustack-service \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -v gpustack-storage:/var/lib/gpustack \
    gpustack/gpustack \
    --port 9090

Verify service status:

docker logs -f gpustack-service

Retrieve initial admin password:

docker exec -it gpustack-service cat /var/lib/gpustack/initial_admin_password

Access the GPUStack console at http://YOUR_HOST_IP:9090 using admin credentials.

Custom vLLM Installation

GPUStack 0.7.0 includes vLLM 0.9.2, but GPT OSS models require vLLM 0.10.1+gptoss branch. Install this development version while maintaining stable default versions.

Enter container and create Python virtual environment:

docker exec -it gpustack-service bash
mkdir -p /var/lib/gpustack/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /var/lib/gpustack/miniconda3/miniconda.sh
bash /var/lib/gpustack/miniconda3/miniconda.sh -b -u -p /var/lib/gpustack/miniconda3
rm /var/lib/gpustack/miniconda3/miniconda.sh
source /var/lib/gpustack/miniconda3/bin/activate
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

Create Python 3.12 environment:

conda create -n vllm_gptoss python=3.12 -y
conda activate vllm_gptoss
python -V

Install vLLM 0.10.1+gptoss:

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install --pre vllm==0.10.1+gptoss \
  --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
  --extra-index-url https://download.pytorch.org/whl/nightly/cu128
ln -sf /var/lib/gpustack/miniconda3/envs/vllm_gptoss/bin/vllm /var/lib/gpustack/bin/vllm_0.10.1+gptoss

Model Deployment

Download GPT OSS models through GPUStack UI from Hugging Face or ModelScope. For Chinese networks, ModelScope is recommended.

Deploy models with advanced configuraton:

Model type: LLM
Backend version: 0.10.1+gptoss
Backend parameters: --max-model-len=32768
Environment variables: VLLM_ATTENTION_BACKEND=FLASH_ATTN, VLLM_USE_FLASHINFER_SAMPLER=0

Performance Benchmarking

Using EvalScope framework to compare throughput between Ollama and vLLM backends.

Install EvalScope:

conda create -n benchmark python=3.10 -y
conda activate benchmark
pip install -U 'evalscope[perf]' plotly gradio wandb

Benchmark commands for GPT OSS 20B (single GPU):

10 requests, 1 concurrent:

# vLLM
evalscope perf \
    --url "https://gpustack.xxx.xx/v1/chat/completions" \
    --api-key "your_api_key" \
    --model gpt-oss-20b \
    --number 10 \
    --parallel 1 \
    --api openai \
    --dataset openqa \
    --stream

# Ollama
evalscope perf \
    --url "http://192.168.0.1:11434/v1/chat/completions" \
    --model gpt-oss:20b \
    --number 10 \
    --parallel 1 \
    --api openai \
    --dataset openqa \
    --stream

100 requests, 10 concurrent:

# vLLM
evalscope perf \
    --url "https://gpustack.xxx.xx/v1/chat/completions" \
    --api-key "your_api_key" \
    --model gpt-oss-20b \
    --number 100 \
    --parallel 10 \
    --api openai \
    --dataset openqa \
    --stream

# Ollama
evalscope perf \
    --url "http://192.168.0.1:11434/v1/chat/completions" \
    --model gpt-oss:20b \
    --number 100 \
    --parallel 10 \
    --api openai \
    --dataset openqa \
    --stream

Performance results summary:

vLLM demonstrates superior resource utilization and scaling efficiency. While Ollama's multi-instance architecture consumes excessive GPU memory, vLLM supports higher concurrency with lower memory footprint, providing better ROI for enterprise deployments.

Tags: vLLM GPUStack GPT-OSS

Back to List

Prev: Fundamental Concepts and Architecture of Universal Serial Bus

Next: Union-Find Data Structure Implementation for Graph Connectivity

Fading Coder

Optimizing GPT OSS Private Deployment with vLLM for High-Performance Inference

Introduction

GPUStack Installation

Custom vLLM Installation

Model Deployment

Performance Benchmarking

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Optimizing GPT OSS Private Deployment with vLLM for High-Performance Inference

Introduction

GPUStack Installation

Custom vLLM Installation

Model Deployment

Performance Benchmarking

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment