A lot happened on the way to this post. What started as a routine probe run turned into a two-day debugging session that touched every node in the cluster, uncovered a fleet-wide regression in llama.cpp, and surfaced enough capability data to finally answer the question I'd been deferring: which models actually deserve to be here?
Four models are gone. The rest have a written justification for their continued existence.
Before I could run probes against anything, I had to stop everything from crashing.
A recent llama.cpp update introduced automatic n_parallel detection: when --parallel is not specified, the server now defaults to n_parallel=4 with kv_unified=true. This is a reasonable default for shared inference servers. It is a catastrophe for small-VRAM nodes.
n_parallel=4 means four KV cache slots are allocated simultaneously. On a model like Hermes 8B Q8_0 running at 32K context with q4_0 KV, that multiplies the context buffer from ~1 GB to ~4 GB. Add the compute pp buffers — which also scale with parallel slots — and you get:
allocating 258.50 MiB on device 0: cudaMalloc failed: out of memory
failed to allocate compute pp buffers
Every model on ness-server2 (Tesla P4, 8 GB) was crash-looping. The RTX 4060 Mobile on ness-legion1 was thermal-cycling. Models that had loaded fine for months were suddenly OOMing at startup.
The fix was mechanical but widespread: add --parallel 1 to every deployment that didn't already have it. That's 14 models on ness-linux3, 7 on ness-server2, and the remaining legion models. A boring fix, but it took the better part of a day to audit and apply across the fleet.
There was a secondary issue on ness-server2: a zombie pod from a probe run had been holding ~5 GB of P4 VRAM for 97 minutes while loch-nessh's VRAM registry showed loaded_models: []. The scheduler thought the node was free and kept trying to load new models into the 1.9 GB that remained. That's a known phantom-capacity edge case — the registry entry had been cleared on pod crash, but the pod itself was still running. Scaling it down manually cleared the jam.
With --parallel 1 across the fleet and the zombie cleared, everything loaded.
The capability scores come from ipsa-probe, a Rust probe harness that runs a battery of structured tests against each model via loch-nessh. The dimensions most relevant to this audit:
open_close_fidelity) — does the model produce well-formed tagged output reliably?format, result_integration, parallel_calls, multi_step) — can it call tools, integrate results in a follow-up turn, and coordinate multiple calls?json_mode) — does it follow JSON schemas without prompting tricks?prompted_improvement) — does explicit CoT improve answer quality?emission) — does it emit native think blocks, prompted think blocks, or nothing?Scores are stored as TOML profiles and re-run whenever the probe suite changes. The suite grew significantly for this pass: five new tool-calling probes were added (NativeToolCallProbe, XmlToolCallProbe, ToolCallArgumentValidityProbe, ToolResultIntegrationProbe, ParallelToolCallProbe), which is why so many profiles needed refreshing.
Mistral-Large-Instruct-2407 — the 2407 vintage, released mid-2024, is eighteen months old. It was the fleet's general-purpose heavyweight at the time there wasn't anything better. There is now. devstral-123b and qwen3-5-122b both occupy the same 90 GB VRAM slot with more capable models and better probe scores (both score result=1.0, multi_step=true, parallel_calls=true). Mistral-large matched those scores too, but age matters: its knowledge cutoff, its instruction-following norms, and its tool-calling reliability are all behind the current alternatives. With the GGUF files deleted, there was no reason to keep the registration.
Qwen2.5-Coder-7B-Instruct-Q4_K_M on the Tesla P4. Probe scores: result_integration=0.0, multi_step=false, parallel_calls=false. Compare to qwen3-5-mx, which runs in 7 GB on the same node and scores result_integration=1.0, multi_step=true, parallel_calls=true. There's no use case where you'd route to qwen-coder-mx over qwen3-5-mx — the latter is a newer architecture, scores better on every tool-calling metric, and fits in the same VRAM budget. Retired.
Hermes-3-Llama-3.1-8B.Q8_0 configured with -ngl 0 — all inference on the i7-4770K, a 2013 Sandy Bridge derivative. This was always an experiment. The Hermes 8B Q8_0 doesn't fit comfortably in the P4's 8 GB with KV cache overhead, so it was relegated to CPU. The problem: CPU inference on a 13-year-old desktop processor is so slow it's barely useful for anything interactive. hermes-beast runs the same model on Vulkan via the Ryzen AI-Max's 96 GB of unified VRAM and is dramatically faster. There's no scenario where hermes-mx would be preferred. Retired.
Devstral-Small-2-24B-Instruct-2512-Q4_K_M with -ngl 16 on the P4. This was always a stretch: 24B parameters, only 16 of 40 layers on GPU, the rest spilling into system RAM. -ngl 20 (6.6 GB) OOMed during setup; -ngl 16 (~5.3 GB) worked but left the model heavily CPU-bound. The profile from this probe run came back entirely broken — cot=0.0, json=None, thinking=None — likely a casualty of the n_parallel=4 chaos and the resulting zombie pod situation. Even if re-probed cleanly, the practical value is limited: you're running a 24B coding model on an Intel i7-4770K, competing for RAM with the OS, waiting several minutes for a response. The GGUF is gone. Retired.
devstral-small / devstral-small-zed (54 GB, 256K ctx)
The fleet's daily driver for coding. Full tool calling with result integration and parallel calls. 256K context for whole-repo work. devstral-small-zed is a dedicated endpoint for the Zed editor integration — same model, separate service so the editor config doesn't fight with the API. Earns its spot by being fast, capable, and context-rich.
qwen3-coder (44 GB, 262K ctx) and qwen3-coder-next (90 GB, 262K ctx)
Two members of the Qwen3 coding family. qwen3-coder is a 30B MoE that fits comfortably alongside other models (44 GB leaves headroom). qwen3-coder-next is the larger variant — 90 GB means it runs solo but scores better on tool-calling probes (result=1.0, parallel=true vs result=0.0 for qwen3-coder). Route qwen3-coder for concurrent workloads; route qwen3-coder-next when you need maximum coding accuracy and can afford the exclusivity.
devstral-123b (90 GB, 131K ctx)
The fleet's heavy code model. Mistral's 123B architecture, code-specialized, full tool capabilities. Solo-run only by VRAM reservation. For the hardest coding problems — full codebase reasoning, complex refactors, architectural decisions — this is the ceiling.
gemma4-31b (68 GB, 262K ctx)
Google's 31B Gemma 4 at full Q8_0. Excellent probe scores across the board: result=1.0, parallel=true, multi_step=true, 262K context. Best-in-fleet for long-context general reasoning that isn't specifically about code. Its 68 GB footprint allows limited co-loading on the 96 GB node.
gemma4-e4b (30 GB, 131K ctx)
The surprise of this audit. A tiny E4B (4-billion-effective-parameters) model that consistently punches above its weight: full tool calling, result integration, parallel calls, and even TaggedPrompted thinking. At 30 GB it co-loads easily. Use it for anything where speed matters and the task isn't extraordinarily complex — it will often be sufficient and is considerably faster than any of the large models.
hermes-beast (32 GB, 131K ctx)
Hermes 3 Llama 3.1 8B at Q8_0. Fast, reliable instruction-following, 131K context. It doesn't do parallel tool calls and fails result integration, but for single-turn tasks — summarization, classification, extraction — it's responsive and consistent. Good prompt compatibility with the Llama 3 instruction format.
phi4-mini-instruct-beast (7 GB, 32K ctx)
The 3.8B mini. It occupies a capability tier on tool-calling probes — lacking result integration or parallel calls — which is a known limitation of the Phi architecture at this size. It's justified by speed and VRAM efficiency: it loads fast and leaves most of the node free for heavier concurrent models.
phi4-mini-reasoning-beast (7 GB, 32K ctx)
Phi-4-mini's reasoning variant with native thinking via --jinja. The current probe profile shows cot=0.2, which is wrong — almost certainly a thinking-budget truncation artifact during the probe run. This gets a re-probe. If the corrected scores look healthy, it's a strong small-model option for tasks requiring deliberate step-by-step reasoning.
deepseek-r1-32b (78 GB, 131K ctx)
The 32B R1 distill with TaggedNative thinking. When it doesn't loop. The infinite thinking loop issue is real and reproducible: some prompt patterns send it into a <think> spiral that exceeds any reasonable token budget. Probe is in progress. Not retiring until there's clear evidence the model is fundamentally broken rather than just sensitive to prompt construction. It represents the only fleet model with genuinely native chain-of-thought reasoning, which is worth preserving.
deepseek-r1-70b (73 GB, 131K ctx)
The 70B R1 distill. Current profile is unreliable — saved from a looping run, incorrectly shows thinking=None. Re-probe pending. Same reasoning as the 32B: not giving up on the DeepSeek R1 architecture yet.
qwen3-5-beast (14 GB, 131K ctx), qwen3-6-beast (68 GB, 262K ctx), qwen3-5-122b (90 GB, 131K ctx)
The Qwen3 general-reasoning tier, from fastest to most capable. qwen3-5 at 14 GB is the cheapest general model on the node — fast and solid for anything that doesn't require specialized code or extended reasoning. qwen3-6 (27B) is the mid-tier: strong reasoning, 262K context, thinking mode enabled. qwen3-5-122b (122B MoE) is the fleet's best general-purpose reasoner: result=1.0, multi_step=true, parallel=true, and enough parameter count to handle problems the smaller models fail on. Currently it runs solo by VRAM requirement.
The P4 is a constrained resource. Its 8 GB is best used by models that fully fit on-chip; partial offloads onto the i7-4770K are slow enough to hurt practical utility. After this audit, four models remain:
qwen3-5-mx (7 GB, 16K ctx)
Best capability profile on server2: result=1.0, parallel=true, multi_step=true. 16K context is the P4's practical ceiling without Flash Attention. First choice for anything routed to server2 that needs real tool-calling capability.
gemma4-e4b-mx (8 GB, 16K ctx)
Same excellent probe scores as qwen3-5-mx; the Gemma 4 E4B architecture running at full Q6_K. Slightly heavier VRAM footprint means it's exclusive, but it's a capable alternative and is the server2 equivalent of the beast's gemma4-e4b.
phi4-mini-instruct-mx (5 GB, 32K ctx) and phi4-mini-reasoning-mx (5 GB, 32K ctx)
The lightweight pair. 5 GB each, leaving 3 GB headroom on the P4. phi4-mini-instruct-mx for fast general tasks; phi4-mini-reasoning-mx for small-model thinking (TaggedNative, confirmed on the mx variant). These are the server2 equivalents for edge/burst workloads when the main node is saturated.
The DeepSeek R1 models stay in the fleet, but their profiles are unreliable. The looping issue is a known failure mode — not a bug in the model, but in the interaction between long-context thinking chains and prompt patterns that don't bound the reasoning space. The fix is likely prompt-side (constraining the thinking budget in the request, not at the server level), but that needs systematic investigation.
phi4-mini-reasoning-beast has a broken cot=0.2 probe profile that almost certainly reflects a truncated thinking chain during measurement, not actual CoT failure. Re-probe in progress.
qwen3-5-beast scored xml=0.6, which is unexpectedly low for a Qwen3 model — the mx variant scores 1.0 on the same probe. Targeted re-probe queued.
The legion node (ness-legion1, RTX 4060 Mobile) remains unstable under sustained load. The thermal situation — CPU temps spiking to 95°C under full turbo — causes hard shutdowns that look like node crashes from the cluster's perspective. Running in balanced power mode reduces peak TDP. Three legion models still need fresh profiles: phi4-mini-reasoning-legion, qwen3-5-legion, and the newly deployed gemma4-e4b-legion (which crashed mid-run during the last attempt). Those wait for a stable thermal window.
The fleet is smaller and cleaner than it was two days ago. The retirements weren't close calls.