AMD 9060 XT | Agentic LLM Dashboard

Command nohup llama-server -m /home/cfvasqeuz/models/gemma-4-31B-it-Q3_K_M.gguf --port 8080 -c 128000 -ngl 99 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 -np 2 -cb > llama.log 2>&1 &

-ngl 99 Full GPU
Keeps the 31B model entirely on VRAM for maximum token speed.

-c 128000 Ultra High Context
Allows for massive document analysis but consumes significant VRAM.

-cache-type-k/v q8_0 Cache Compression
Reduces the memory footprint of the context window by 50%.

-flash-attn on Efficiency
Prevents performance collapse as the 64k window fills up.

🤖 Agentic Setup: Parallel Processing

To run multiple agents concurrently (e.g., a Planner and a Coder), you must introduce Parallel Slots. This allows the GPU to handle multiple request streams simultaneously.

Multi-Agent Balanced Setup llama-server -m /home/cfvasqeuz/models/gemma-4-31B-it-Q3_K_M.gguf --port 8080 -c 8192 -np 4 -cb 512 --no-mmap -ngl 99 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0

-np 4 Parallel Slots
Allows 4 concurrent agent connections. The GPU batches these together for high throughput.

-cb 512 Batch Size
Sets how many tokens are processed in one "gulp" across all active parallel slots.

-c 8192 Reduced Context
Crucial: We lower the context to make room for the 4 parallel slots in VRAM.

⚠️ THE VRAM TRADE-OFF: VRAM = Model Weights + (Slots × Context Size).
You cannot have 4 agents each with 64k context on a 16GB card. You must trade context length for parallelism.

📊 Scenario Comparison

📈 VRAM Budget Breakdown

🛠️ Stability & Troubleshooting

Setup	-np (Slots)	-c (Context)	Best For...	VRAM Risk
🧠 Deep Thinker	1	64,000	Complex RAG, long documents, single agent.	High
🤖 Agentic Team	4	8,192	Multi-agent workflows, fast switching.	Moderate
⚡ High Throughput	8+	2,048	Simple, fast-acting agents, high concurrency.	Low/Safe

💥 If you crash (OOM):

1. Reduce -c (Context).
2. Reduce -np (Slots).
3. Remove -no-mmap.

🚀 To increase speed:

1. Use a smaller model (12B-14B).
2. Increase -cb (Batch size) to 1024.

🔍 Command Deep Dive

Background & Persistence: nohup ... & ensures the server stays alive after disconnect, while > llama.log 2>&1 captures all errors for debugging OOM crashes.

Memory Optimization: -c 128000 provides a massive window for codebase analysis. --cache-type-k/v q8_0 quantizes the KV cache to 8-bit, reducing the memory footprint by ~50%.

Agentic Flow: -np 2 allows two concurrent request streams (e.g., a Planner and a Coder) to process tokens simultaneously on the GPU.

Performance: --flash-attn on prevents the performance collapse typically seen as the context window fills up.

🖥️ AMD 9060 XT — Agentic LLM Dashboard

🧠 Current Command: Deep Memory Setup

🤖 Agentic Setup: Parallel Processing

📊 Scenario Comparison

📈 VRAM Budget Breakdown

🛠️ Stability & Troubleshooting

🔍 Command Deep Dive