Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Deploying AgentCPM with DeepResearch Capabilities on AMD AI Max 395

Tech May 15 1

The AMD AI Max+ 395 (codename Strix Halo) is a powerful APU featuring the RDNA 3.5 architecture with an integrated Radeon 8060S graphics processor. With 40 compute units and 256GB/s memory bandwidth, it delivers performance comparable to a mobile RTX 4060. When paired with a device like the GTR9 Pro mini主机 featuring 128GB LPDDR5X memory at 8000MT/s and a 2TB NVMe SSD, this unified memory architecture allows allocating up to 96GB of system memory as video memory through BIOS configuration.

This guide covers the complete setup process for running AgentCPM-Explore and AgentCPM-Report models locally on the AI Max 395, enabling web search and DeepResearch capabilities without cloud dependencies.

System Preparation

Operating System Setup

For optimal AI component compatibility, Ubuntu 24.04 is recommended over Windows. The unified memory architecture works particularly well with Linux due to better ROCm driver support.

Kernel Upgrade

The default Ubuntu 6.12 kernel may cause intermittent issues with LLM inference workloads on the AI Max 395 (gfx1151). Upgrade to kernel 6.14:

sudo apt update && sudo apt install linux-image-6.14.0-1017-oem

After installation, reboot into the new kernel and verify:

uname -r
sudo apt upgrade -y

Installling ROCm

Download and install the AMD GPU installer:

sudo apt update
wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/noble/amdgpu-install_7.1.1.70101-1_all.deb
sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb

Install ROCm with the following command:

amdgpu-install -y --usecase=rocm --no-dkms

Configure user permissions:

groups
sudo usermod -a -G render,video $LOGNAME
sudo reboot

Verify the installation:

groups
rocminfo
rocm-smi

Memory Configuration

BIOS Allocation

For supported devices, access BIOS using F2 or Delete during boot. Navigate to:

  • Advanced → AMD CBS → iGPU Memory Configuration
  • Set mode to Custom
  • Allocate up to 96GB for iGPU Memory Size

Extended Memory via GTT Parameters

For memory beyond 96GB, modify kernel parameters via GRUB:

sudo nano /etc/default/grub

Update the GRUB_CMDLINE_LINUX_DEFAULT line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash ttm.pages_limit=27648000 ttm.page_pool_size=27648000 amdttm.pages_limit=27648000 amdttm.page_pool_size=27648000 apparmor=0 amd_iommu=off"

Apply changes:

sudo update-grub
sudo grub-install
sudo reboot

Verify with:

cat /proc/cmdline

Shared Memorry Configuration

ROCm uses shared system memory pool. Install the debugging tools:

sudo apt install pipx
pipx ensurepath
pipx install amd-debug-tools

Check and adjust shared memory:

amd-ttm
amd-ttm --set 16
sudo reboot

Setting Up GPUStack

GPUStack is an open-source GPU cluster manager designed for efficient AI model deployment. It supports multiple inference backends including vLLM, SGLang, and MindIE.

Installing AMD Container Toolkit

sudo apt update
sudo usermod -a -G render,video $LOGNAME
sudo apt install vim wget nano gpg
sudo mkdir --parents --mode=0755 /etc/apt/keyrings
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null

# Ubuntu 24.04
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amd-container-toolkit/apt/ noble main" | sudo tee /etc/apt/sources.list.d/amd-container-toolkit.list

sudo apt update
sudo apt install amd-container-toolkit
sudo amd-ctk runtime configure
sudo systemctl restart docker

Deploying GPUStack Server

sudo docker run -d --name gpustack \
    --restart unless-stopped \
    -p 80:80 \
    --volume gpustack-data:/var/lib/gpustack \
    gpustack/gpustack

Rertieve the initial admin password:

sudo docker exec gpustack cat /var/lib/gpustack/initial_admin_password

Access the GPUStack UI at http://your_host_ip and complete the setup wizard:

  1. Login with admin credentials
  2. Create a new cluster (select Docker for self-hosted)
  3. Add a node, select AMD as the GPU vendor
  4. Run the environment check command:
amd-smi static >/dev/null 2>&1 && echo "AMD driver OK" || (echo "AMD driver issue"; exit 1) && sudo docker info 2>/dev/null | grep -q "amd" && echo "AMD Container Toolkit OK" || (echo "AMD Container Toolkit not configured"; exit 1)
  1. Enter the node's LAN IP address and complete registration

Once configured, the AI Max 395 GPU appears in the GPU dashboard.

Deploying AgentCPM-Explore

Model Deployment

In GPUStack UI:

  1. Navigate to Models → Deploy Model
  2. Select ModelScope as the source
  3. Search for AgentCPM and select AgentCPM-Explore
  4. Configure backend: vLLM version 0.13.0
  5. Add advanced parameters:
--gpu-memory-utilization=0.6
--max-model-len=262144

The first parameter allocates 60% of available memory (~60GB), while the second sets the maximum context length to 256K tokens.

Wait for model download and service startup. Monitor progress via the logs viewer. The service is ready when the status shows "Running" and API endpoint information appears.

API Key Creation

Generate an API key through the GPUStack interface for authentication when accessing the model endpoint.

Using AgentCPM-Explore for Web Search

Configure Cherry Studio or any compatible client:

  1. Add GPUStack as the model provider
  2. Enter the endpoint URL, model name, and API key
  3. Test the connection

With a search plugin configured, AgentCPM-Explore can perform web search queries and synthesize responses. Performance metrics show approximately 867ms for first-token prefill and 18 tokens/second generation speed—adequate for single-user local queries.

Deploying AgentCPM-Report

AgentCPM-Report is an 8B parameter model specialized for DeepResearch tasks. Two model formats are available:

Safetensors Format (vLLM)

Deploy using the same procedure as AgentCPM-Explore, selecting vLLM as the backend.

GGUF Format (llama.cpp)

For unified memory architectures, the GGUF format enables efficient CPU+GPU combined inference.

Add a custom llama.cpp backend:

  1. Navigate to Inference Backends → Add Backend
  2. Use YAML configuration:
backend_name: llama-cpp-custom
version_configs:
  v1:
    image_name: ghcr.io/ggml-org/llama.cpp:server-vulkan
    custom_framework: rocm
default_version: v1
default_backend_param: []
default_run_command: '-m {{model_path}} --host 0.0.0.0 --port {{port}}'
default_entrypoint: ''
is_built_in: false
framework_index_map:
  rocm:
    - v1

Deploy the GGUF format model using this custom backend. Configure environment variables:

LLAMA_ARG_THREADS=16
LLAMA_ARG_CTX_SIZE=65536
LLAMA_ARG_N_PREDICT=512
LLAMA_ARG_TEMP=0.1
LLAMA_ARG_MLOCK=1

The context size of 65536 (64K) matches AgentCPM-Report's maximum supported length. Adjust based on actual requirements.

Once the service shows "Running" status, integrate it with DeepResearch frameworks like Dify or MiroThinker for comprehensive research task execution.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Overview In a recent project, I utilized the SBUS protocol with the Fus remote controller to control a vehicle's basic operations, including movement, lights, and mode switching. This article is aimed...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.