Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Optimizing GPT OSS Private Deployment with vLLM for High-Performance Inference

Tech May 11 5

Introduction

OpenAI recently released two open-source models: GPT OSS 120B and GPT OSS 20B. While official vLLM inference requires complex installation steps, this guide demonstrates production deployment using GPUStack with a custom vLLM installation. Performence comparisons with Ollama using EvalScope reveal significant throughput advantages for vLLM in concurrent scenarios.

GPUStack Installation

Begin by installing GPUStack following the official documentation. The containerized deployment appproach is recommended for NVIDIA GPU servers. Ensure appropriate NVIDIA drivers, Docker, and NVIDIA Container Toolkit are installed before launching the GPUStack service.

Tested on NVIDIA H20 GPU:

docker run -d --name gpustack-service \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -v gpustack-storage:/var/lib/gpustack \
    gpustack/gpustack \
    --port 9090

Verify service status:

docker logs -f gpustack-service

Retrieve initial admin password:

docker exec -it gpustack-service cat /var/lib/gpustack/initial_admin_password

Access the GPUStack console at http://YOUR_HOST_IP:9090 using admin credentials.

Custom vLLM Installation

GPUStack 0.7.0 includes vLLM 0.9.2, but GPT OSS models require vLLM 0.10.1+gptoss branch. Install this development version while maintaining stable default versions.

Enter container and create Python virtual environment:

docker exec -it gpustack-service bash
mkdir -p /var/lib/gpustack/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /var/lib/gpustack/miniconda3/miniconda.sh
bash /var/lib/gpustack/miniconda3/miniconda.sh -b -u -p /var/lib/gpustack/miniconda3
rm /var/lib/gpustack/miniconda3/miniconda.sh
source /var/lib/gpustack/miniconda3/bin/activate
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

Create Python 3.12 environment:

conda create -n vllm_gptoss python=3.12 -y
conda activate vllm_gptoss
python -V

Install vLLM 0.10.1+gptoss:

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install --pre vllm==0.10.1+gptoss \
  --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
  --extra-index-url https://download.pytorch.org/whl/nightly/cu128
ln -sf /var/lib/gpustack/miniconda3/envs/vllm_gptoss/bin/vllm /var/lib/gpustack/bin/vllm_0.10.1+gptoss

Model Deployment

Download GPT OSS models through GPUStack UI from Hugging Face or ModelScope. For Chinese networks, ModelScope is recommended.

Deploy models with advanced configuraton:

  • Model type: LLM
  • Backend version: 0.10.1+gptoss
  • Backend parameters: --max-model-len=32768
  • Environment variables: VLLM_ATTENTION_BACKEND=FLASH_ATTN, VLLM_USE_FLASHINFER_SAMPLER=0

Performance Benchmarking

Using EvalScope framework to compare throughput between Ollama and vLLM backends.

Install EvalScope:

conda create -n benchmark python=3.10 -y
conda activate benchmark
pip install -U 'evalscope[perf]' plotly gradio wandb

Benchmark commands for GPT OSS 20B (single GPU):

10 requests, 1 concurrent:

# vLLM
evalscope perf \
    --url "https://gpustack.xxx.xx/v1/chat/completions" \
    --api-key "your_api_key" \
    --model gpt-oss-20b \
    --number 10 \
    --parallel 1 \
    --api openai \
    --dataset openqa \
    --stream

# Ollama
evalscope perf \
    --url "http://192.168.0.1:11434/v1/chat/completions" \
    --model gpt-oss:20b \
    --number 10 \
    --parallel 1 \
    --api openai \
    --dataset openqa \
    --stream

100 requests, 10 concurrent:

# vLLM
evalscope perf \
    --url "https://gpustack.xxx.xx/v1/chat/completions" \
    --api-key "your_api_key" \
    --model gpt-oss-20b \
    --number 100 \
    --parallel 10 \
    --api openai \
    --dataset openqa \
    --stream

# Ollama
evalscope perf \
    --url "http://192.168.0.1:11434/v1/chat/completions" \
    --model gpt-oss:20b \
    --number 100 \
    --parallel 10 \
    --api openai \
    --dataset openqa \
    --stream

Performance results summary:

vLLM demonstrates superior resource utilization and scaling efficiency. While Ollama's multi-instance architecture consumes excessive GPU memory, vLLM supports higher concurrency with lower memory footprint, providing better ROI for enterprise deployments.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.