Deploying AgentCPM with DeepResearch Capabilities on AMD AI Max 395
The AMD AI Max+ 395 (codename Strix Halo) is a powerful APU featuring the RDNA 3.5 architecture with an integrated Radeon 8060S graphics processor. With 40 compute units and 256GB/s memory bandwidth, it delivers performance comparable to a mobile RTX 4060. When paired with a device like the GTR9 Pro mini主机 featuring 128GB LPDDR5X memory at 8000MT/s and a 2TB NVMe SSD, this unified memory architecture allows allocating up to 96GB of system memory as video memory through BIOS configuration.
This guide covers the complete setup process for running AgentCPM-Explore and AgentCPM-Report models locally on the AI Max 395, enabling web search and DeepResearch capabilities without cloud dependencies.
System Preparation
Operating System Setup
For optimal AI component compatibility, Ubuntu 24.04 is recommended over Windows. The unified memory architecture works particularly well with Linux due to better ROCm driver support.
Kernel Upgrade
The default Ubuntu 6.12 kernel may cause intermittent issues with LLM inference workloads on the AI Max 395 (gfx1151). Upgrade to kernel 6.14:
sudo apt update && sudo apt install linux-image-6.14.0-1017-oem
After installation, reboot into the new kernel and verify:
uname -r
sudo apt upgrade -y
Installling ROCm
Download and install the AMD GPU installer:
sudo apt update
wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/noble/amdgpu-install_7.1.1.70101-1_all.deb
sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
Install ROCm with the following command:
amdgpu-install -y --usecase=rocm --no-dkms
Configure user permissions:
groups
sudo usermod -a -G render,video $LOGNAME
sudo reboot
Verify the installation:
groups
rocminfo
rocm-smi
Memory Configuration
BIOS Allocation
For supported devices, access BIOS using F2 or Delete during boot. Navigate to:
- Advanced → AMD CBS → iGPU Memory Configuration
- Set mode to Custom
- Allocate up to 96GB for iGPU Memory Size
Extended Memory via GTT Parameters
For memory beyond 96GB, modify kernel parameters via GRUB:
sudo nano /etc/default/grub
Update the GRUB_CMDLINE_LINUX_DEFAULT line:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash ttm.pages_limit=27648000 ttm.page_pool_size=27648000 amdttm.pages_limit=27648000 amdttm.page_pool_size=27648000 apparmor=0 amd_iommu=off"
Apply changes:
sudo update-grub
sudo grub-install
sudo reboot
Verify with:
cat /proc/cmdline
Shared Memorry Configuration
ROCm uses shared system memory pool. Install the debugging tools:
sudo apt install pipx
pipx ensurepath
pipx install amd-debug-tools
Check and adjust shared memory:
amd-ttm
amd-ttm --set 16
sudo reboot
Setting Up GPUStack
GPUStack is an open-source GPU cluster manager designed for efficient AI model deployment. It supports multiple inference backends including vLLM, SGLang, and MindIE.
Installing AMD Container Toolkit
sudo apt update
sudo usermod -a -G render,video $LOGNAME
sudo apt install vim wget nano gpg
sudo mkdir --parents --mode=0755 /etc/apt/keyrings
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
# Ubuntu 24.04
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amd-container-toolkit/apt/ noble main" | sudo tee /etc/apt/sources.list.d/amd-container-toolkit.list
sudo apt update
sudo apt install amd-container-toolkit
sudo amd-ctk runtime configure
sudo systemctl restart docker
Deploying GPUStack Server
sudo docker run -d --name gpustack \
--restart unless-stopped \
-p 80:80 \
--volume gpustack-data:/var/lib/gpustack \
gpustack/gpustack
Rertieve the initial admin password:
sudo docker exec gpustack cat /var/lib/gpustack/initial_admin_password
Access the GPUStack UI at http://your_host_ip and complete the setup wizard:
- Login with admin credentials
- Create a new cluster (select Docker for self-hosted)
- Add a node, select AMD as the GPU vendor
- Run the environment check command:
amd-smi static >/dev/null 2>&1 && echo "AMD driver OK" || (echo "AMD driver issue"; exit 1) && sudo docker info 2>/dev/null | grep -q "amd" && echo "AMD Container Toolkit OK" || (echo "AMD Container Toolkit not configured"; exit 1)
- Enter the node's LAN IP address and complete registration
Once configured, the AI Max 395 GPU appears in the GPU dashboard.
Deploying AgentCPM-Explore
Model Deployment
In GPUStack UI:
- Navigate to Models → Deploy Model
- Select ModelScope as the source
- Search for AgentCPM and select AgentCPM-Explore
- Configure backend: vLLM version 0.13.0
- Add advanced parameters:
--gpu-memory-utilization=0.6
--max-model-len=262144
The first parameter allocates 60% of available memory (~60GB), while the second sets the maximum context length to 256K tokens.
Wait for model download and service startup. Monitor progress via the logs viewer. The service is ready when the status shows "Running" and API endpoint information appears.
API Key Creation
Generate an API key through the GPUStack interface for authentication when accessing the model endpoint.
Using AgentCPM-Explore for Web Search
Configure Cherry Studio or any compatible client:
- Add GPUStack as the model provider
- Enter the endpoint URL, model name, and API key
- Test the connection
With a search plugin configured, AgentCPM-Explore can perform web search queries and synthesize responses. Performance metrics show approximately 867ms for first-token prefill and 18 tokens/second generation speed—adequate for single-user local queries.
Deploying AgentCPM-Report
AgentCPM-Report is an 8B parameter model specialized for DeepResearch tasks. Two model formats are available:
Safetensors Format (vLLM)
Deploy using the same procedure as AgentCPM-Explore, selecting vLLM as the backend.
GGUF Format (llama.cpp)
For unified memory architectures, the GGUF format enables efficient CPU+GPU combined inference.
Add a custom llama.cpp backend:
- Navigate to Inference Backends → Add Backend
- Use YAML configuration:
backend_name: llama-cpp-custom
version_configs:
v1:
image_name: ghcr.io/ggml-org/llama.cpp:server-vulkan
custom_framework: rocm
default_version: v1
default_backend_param: []
default_run_command: '-m {{model_path}} --host 0.0.0.0 --port {{port}}'
default_entrypoint: ''
is_built_in: false
framework_index_map:
rocm:
- v1
Deploy the GGUF format model using this custom backend. Configure environment variables:
LLAMA_ARG_THREADS=16
LLAMA_ARG_CTX_SIZE=65536
LLAMA_ARG_N_PREDICT=512
LLAMA_ARG_TEMP=0.1
LLAMA_ARG_MLOCK=1
The context size of 65536 (64K) matches AgentCPM-Report's maximum supported length. Adjust based on actual requirements.
Once the service shows "Running" status, integrate it with DeepResearch frameworks like Dify or MiroThinker for comprehensive research task execution.