← Back to Home
🧠 Current Command: Deep Memory Setup
Command
nohup llama-server -m /home/cfvasqeuz/models/gemma-4-31B-it-Q3_K_M.gguf --port 8080 -c 128000 -ngl 99 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 -np 2 -cb > llama.log 2>&1 &
-ngl 99 Full GPU
Keeps the 31B model entirely on VRAM for maximum token speed.
-c 128000 Ultra High Context
Allows for massive document analysis but consumes significant VRAM.
-cache-type-k/v q8_0 Cache Compression
Reduces the memory footprint of the context window by 50%.
-flash-attn on Efficiency
Prevents performance collapse as the 64k window fills up.
🤖 Agentic Setup: Parallel Processing
To run multiple agents concurrently (e.g., a Planner and a Coder), you must introduce Parallel Slots. This allows the GPU to handle multiple request streams simultaneously.
Multi-Agent Balanced Setup
llama-server -m /home/cfvasqeuz/models/gemma-4-31B-it-Q3_K_M.gguf --port 8080 -c 8192 -np 4 -cb 512 --no-mmap -ngl 99 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0
-np 4 Parallel Slots
Allows 4 concurrent agent connections. The GPU batches these together for high throughput.
-cb 512 Batch Size
Sets how many tokens are processed in one "gulp" across all active parallel slots.
-c 8192 Reduced Context
Crucial: We lower the context to make room for the 4 parallel slots in VRAM.
⚠️ THE VRAM TRADE-OFF: VRAM = Model Weights + (Slots × Context Size).
You cannot have 4 agents each with 64k context on a 16GB card. You must trade context length for parallelism.
📊 Scenario Comparison
| Setup |
-np (Slots) |
-c (Context) |
Best For... |
VRAM Risk |
| 🧠 Deep Thinker |
1 |
64,000 |
Complex RAG, long documents, single agent. |
High |
| 🤖 Agentic Team |
4 |
8,192 |
Multi-agent workflows, fast switching. |
Moderate |
| ⚡ High Throughput |
8+ |
2,048 |
Simple, fast-acting agents, high concurrency. |
Low/Safe |
📈 VRAM Budget Breakdown
Current Estimate (Deep Thinker): ~15.5 GB / 16 GB
📦
Model Weights
~12 GB (Fixed)
🛠️ Stability & Troubleshooting
💥 If you crash (OOM):
1. Reduce -c (Context).
2. Reduce -np (Slots).
3. Remove -no-mmap.
🚀 To increase speed:
1. Use a smaller model (12B-14B).
2. Increase -cb (Batch size) to 1024.
🔍 Command Deep Dive
Background & Persistence: nohup ... & ensures the server stays alive after disconnect, while > llama.log 2>&1 captures all errors for debugging OOM crashes.
Memory Optimization: -c 128000 provides a massive window for codebase analysis. --cache-type-k/v q8_0 quantizes the KV cache to 8-bit, reducing the memory footprint by ~50%.
Agentic Flow: -np 2 allows two concurrent request streams (e.g., a Planner and a Coder) to process tokens simultaneously on the GPU.
Performance: --flash-attn on prevents the performance collapse typically seen as the context window fills up.