← Back to Home

🧠 Current Command: Deep Memory Setup

Command nohup llama-server -m /home/cfvasqeuz/models/gemma-4-31B-it-Q3_K_M.gguf --port 8080 -c 128000 -ngl 99 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 -np 2 -cb > llama.log 2>&1 &
-ngl 99 Full GPU
Keeps the 31B model entirely on VRAM for maximum token speed.
-c 128000 Ultra High Context
Allows for massive document analysis but consumes significant VRAM.
-cache-type-k/v q8_0 Cache Compression
Reduces the memory footprint of the context window by 50%.
-flash-attn on Efficiency
Prevents performance collapse as the 64k window fills up.

🤖 Agentic Setup: Parallel Processing

To run multiple agents concurrently (e.g., a Planner and a Coder), you must introduce Parallel Slots. This allows the GPU to handle multiple request streams simultaneously.

Multi-Agent Balanced Setup llama-server -m /home/cfvasqeuz/models/gemma-4-31B-it-Q3_K_M.gguf --port 8080 -c 8192 -np 4 -cb 512 --no-mmap -ngl 99 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0
-np 4 Parallel Slots
Allows 4 concurrent agent connections. The GPU batches these together for high throughput.
-cb 512 Batch Size
Sets how many tokens are processed in one "gulp" across all active parallel slots.
-c 8192 Reduced Context
Crucial: We lower the context to make room for the 4 parallel slots in VRAM.
⚠️ THE VRAM TRADE-OFF: VRAM = Model Weights + (Slots × Context Size).
You cannot have 4 agents each with 64k context on a 16GB card. You must trade context length for parallelism.

📊 Scenario Comparison

Setup -np (Slots) -c (Context) Best For... VRAM Risk
🧠 Deep Thinker 1 64,000 Complex RAG, long documents, single agent. High
🤖 Agentic Team 4 8,192 Multi-agent workflows, fast switching. Moderate
⚡ High Throughput 8+ 2,048 Simple, fast-acting agents, high concurrency. Low/Safe

📈 VRAM Budget Breakdown

Current Estimate (Deep Thinker): ~15.5 GB / 16 GB

📦
Model Weights
~12 GB (Fixed)
💾
KV Cache
Dynamic
🖥️
OS Overhead
~0.5 - 1 GB

🛠️ Stability & Troubleshooting

💥 If you crash (OOM):

1. Reduce -c (Context).
2. Reduce -np (Slots).
3. Remove -no-mmap.
🚀 To increase speed:

1. Use a smaller model (12B-14B).
2. Increase -cb (Batch size) to 1024.

🔍 Command Deep Dive

Background & Persistence: nohup ... & ensures the server stays alive after disconnect, while > llama.log 2>&1 captures all errors for debugging OOM crashes.

Memory Optimization: -c 128000 provides a massive window for codebase analysis. --cache-type-k/v q8_0 quantizes the KV cache to 8-bit, reducing the memory footprint by ~50%.

Agentic Flow: -np 2 allows two concurrent request streams (e.g., a Planner and a Coder) to process tokens simultaneously on the GPU.

Performance: --flash-attn on prevents the performance collapse typically seen as the context window fills up.