I run a small private AI cluster — three nodes, a VRAM-aware broker, and whatever open models I can wrangle into GGUF format. After deploying a new wave of models (Phi-4-mini variants, Qwen3 series, Devstral Small Q8), I wanted a systematic answer to the obvious question: what can each of these things actually do?
So I built ipsa-probe: a Rust probe harness that fires a structured battery of tests at each model through the loch-nessh broker, and saves the results as versioned TOML profiles. These aren't vibes — they're reproducible signal.
This post is the first aggregated comparison. Sixteen profiles in, here's what the data says.
The probe suite is organized around five protocol layers, loosely derived from how agentic harnesses actually communicate with models:
| Layer | What it tests |
|---|---|
| I. Structural Delimiters | XML tag fidelity, custom tags, streaming safety |
| II. Cognitive Scaffolding | Thinking block emission, CoT quality, reflection responsiveness |
| III. Extrinsic Integration | Tool calling format, JSON mode, constrained generation |
| IV. Behavioural Overlays | Instruction adherence, system prompt authority, persona stability |
| V. Operational Parameters | Temperature sensitivity, sampling param support |
| + Response Shape | Whether the model can emit full docs, unified diffs, or custom patches |
Each probe fires a prompt, parses the output, and records a float score (0.0–1.0) or an enum against a known rubric. The harness strips thinking blocks before scoring so reasoning tokens don't inflate the answers.
Models tested as of this post. All run locally via llama.cpp unless noted.
| Model | Node | VRAM | Quant |
|---|---|---|---|
devstral-small-beast | ness-linux3 | 18 GB | Q8_0 |
deepseek-r1-32b | ness-linux3 | 26 GB | Q4_K_M |
deepseek-r1-70b | ness-linux3 | 52 GB | Q4_K_M |
deepseek-r1-15b-legion | ness-legion1 | 12 GB | Q4_K_M |
phi4-beast | ness-linux3 | 10 GB | Q8_0 |
phi4-mini-instruct-legion | ness-legion1 | 5 GB | Q4_K_M |
phi4-mini-reasoning-legion | ness-legion1 | 5 GB | Q4_K_M |
phi4-mini-instruct-mx | mx-legacy | 5 GB | Q4_K_M |
phi4-mini-reasoning-mx | mx-legacy | 5 GB | Q4_K_M |
qwen3-coder | ness-linux3 | 34 GB | Q4_K_M |
qwen-coder-legion | ness-legion1 | 18 GB | Q4_K_M |
qwen3-5-legion | ness-legion1 | 8 GB | Q4_K_M |
qwen-mx | mx-legacy (CPU+P4) | 8 GB | Q4_K_M |
gemma4-31b | ness-linux3 | 62 GB | Q4_K_M |
hermes-beast | ness-linux3 | 9 GB | Q4_K_M |
Pending: phi4-mini-instruct-beast, phi4-mini-reasoning-beast, qwen3-5-beast, qwen3-6-27b, qwen3.5-122b, devstral-2-small-mx. Profiles will be added once probes complete.
This is the most immediately useful dimension for agentic work. A model that emits thinking blocks lets you strip the scratchpad from user-visible output while still benefitting from the reasoning chain.
| Model | Thinking emission | Extractable | Temperature det. |
|---|---|---|---|
devstral-small-beast | TaggedPrompted | yes | yes |
phi4-beast | TaggedPrompted | yes | yes |
hermes-beast | TaggedPrompted | yes | no |
deepseek-r1-32b | TaggedNative | yes | yes |
deepseek-r1-70b | TaggedNative | yes | no |
deepseek-r1-15b-legion | TaggedNative | yes | yes |
phi4-mini-instruct-legion | None | — | yes |
phi4-mini-reasoning-legion | None | — | no |
phi4-mini-instruct-mx | None | — | yes |
phi4-mini-reasoning-mx | None | — | no |
qwen3-coder | None | — | no |
qwen3-5-legion | None | — | yes |
qwen-coder-legion | None | — | yes |
qwen-mx | None | — | yes |
gemma4-31b | None | — | yes |
TaggedNative means the model emits <think>…</think> blocks without any prompting — it's baked into the model's default behaviour. DeepSeek R1 variants are the only models in the fleet that do this. Every DeepSeek variant reliably surfaces its reasoning chain.
TaggedPrompted (Devstral, Phi-4, Hermes) means the model will emit thinking blocks when the system prompt explicitly asks for them, but won't by default. These are the best "agentic reasoning" candidates if you're willing to prime the system prompt.
The Phi-4-mini reasoning variants (phi4-mini-reasoning-*) are an interesting case: they're Microsoft's reasoning-tuned Phi-4-mini, but probe results show emission = "None". The thinking appears to be compiled into the response structure rather than exposed as a separable token stream — the harness's <think> stripping finds nothing to strip. Whether this means the reasoning is less accessible or just differently encoded is worth more investigation.
System prompt authority is critical for agentic use. A model where user turns can silently override the system prompt is a liability in multi-agent workflows.
| Model | System priority | Conflict resolution |
|---|---|---|
devstral-small-beast | Respected | FollowsSystem |
hermes-beast | Respected | FollowsSystem |
gemma4-31b | Respected | FollowsSystem |
qwen3-coder | Weak | BlendsBoth |
qwen3-5-legion | Respected | FollowsSystem |
qwen-coder-legion | Respected | FollowsSystem |
qwen-mx | Respected | FollowsSystem |
deepseek-r1-32b | Overrideable | Unpredictable |
deepseek-r1-70b | Overrideable | Unpredictable |
phi4-beast | Respected | FollowsSystem |
phi4-mini-instruct-legion | Overrideable | Unpredictable |
phi4-mini-reasoning-legion | Weak | BlendsBoth |
phi4-mini-instruct-mx | Overrideable | Unpredictable |
phi4-mini-reasoning-mx | Weak | BlendsBoth |
Devstral, Hermes, Gemma4, and the Qwen fleet (non-reasoning) are the safe picks here. They treat the system prompt as authoritative and resolve conflicts by deferring to it.
The DeepSeek R1 variants score Unpredictable on conflict resolution — they often blend system and user intent in ways that aren't deterministic between runs. This makes them harder to use as reliable agents, despite their strong reasoning capability. It's consistent with R1's training objective, which emphasized reasoning performance over instruction-following stability.
The qwen3-coder (beast, 34B) scores Weak on system priority, which is surprising for a model this size. It was tuned for coding assistance, and it seems to prioritize user-turn context heavily. The legion and mx Qwen variants — including qwen3-5-legion (9B) — all score Respected. The 9B model is more obedient than the 34B model. This is likely a fine-tune difference: qwen3-coder is a coding-specialist tune whereas qwen3-5-legion runs the base instruct tune, which has stronger instruction-following alignment.
Can the model avoid doing something it's told not to do?
| Model | Score |
|---|---|
gemma4-31b | 1.00 |
devstral-small-beast | 0.75 |
deepseek-r1-70b | 0.75 |
phi4-mini-instruct-mx | 0.75 |
phi4-mini-reasoning-mx | 0.75 |
phi4-mini-reasoning-legion | 0.50 |
deepseek-r1-32b | 0.50 |
phi4-beast | 0.00 |
qwen3-5-legion | 0.75 |
qwen-coder-legion | 0.00 |
qwen-mx | 0.00 |
Gemma4-31b is the only model scoring a clean 1.0 on negative instructions. The Qwen coder variants and phi4-beast score 0.0 here — these models will attempt tasks they've been told not to attempt, especially when user-turn context contradicts the prohibition.
Most agentic harnesses use XML-style tags to segment thought from action from output. A harness that can't trust the model to open and close tags correctly is flying blind.
| Model | Open/close fidelity | Custom tags | Attributes |
|---|---|---|---|
devstral-small-beast | 1.0 | yes | yes |
phi4-beast | 1.0 | yes | yes |
hermes-beast | 1.0 | yes | yes |
gemma4-31b | 1.0 | yes | yes |
deepseek-r1-32b | 1.0 | yes | yes |
deepseek-r1-70b | 1.0 | yes | yes |
qwen3-coder | 1.0 | yes | yes |
qwen3-5-legion | 1.0 | yes | yes |
qwen-coder-legion | 1.0 | yes | yes |
qwen-mx | 1.0 | yes | yes |
phi4-mini-instruct-legion | 1.0 | no | yes |
phi4-mini-instruct-mx | 1.0 | no | yes |
phi4-mini-reasoning-legion | 0.60 | yes | no |
phi4-mini-reasoning-mx | 0.60 | yes | no |
The Phi-4-mini instruct variants are notable: they close tags reliably but don't generate custom (non-standard) tags on command. The harness can still use them with a fixed vocabulary of known tags, but can't ask these models to invent new tag names at runtime.
The Phi-4-mini reasoning variants score 0.60 on fidelity — they occasionally fail to close a tag or emit malformed nesting. For a harness that uses streaming tag detection, this means a fallback parser is required.
For agentic code and document editing, we want a model that can emit a targeted diff rather than a full document rewrite. This is the document_patch field.
| Model | Patch format |
|---|---|
devstral-small-beast | UnifiedDiff |
deepseek-r1-32b | UnifiedDiff |
deepseek-r1-70b | UnifiedDiff |
gemma4-31b | UnifiedDiff |
qwen3-coder | UnifiedDiff |
qwen-coder-legion | UnifiedDiff |
qwen3-5-legion | UnifiedDiff |
qwen-mx | UnifiedDiff |
hermes-beast | UnifiedDiff |
phi4-beast | UnifiedDiff |
phi4-mini-instruct-legion | CustomEdit |
phi4-mini-instruct-mx | CustomEdit |
phi4-mini-reasoning-mx | CustomEdit |
phi4-mini-reasoning-legion | None |
UnifiedDiff means the model produces standard --- a/file / +++ b/file unified diffs when asked. These can be applied with patch(1) directly.
CustomEdit is a model-specific format — the Phi-4-mini instruct variants produce a search-replace block syntax (similar to the Aider "SEARCH/REPLACE" convention) rather than standard diffs. These need a custom parser in the harness but are still actionable.
phi4-mini-reasoning-legion scoring None here is the clearest signal that this model is not suited for agentic editing tasks. It will rewrite documents in full or refuse to produce a structured patch at all.
Honest answer: no model in the current fleet scored anything other than format = "None" on native tool calling. This is partly a probe methodology issue — the probe tests through loch-nessh's chat completions endpoint, which doesn't yet pass tool schemas downstream to the model. Native function-calling support for Devstral and Qwen3 exists in their GGUF metadata, but the integration path from loch-nessh → llama.cpp → model isn't wired up yet.
All models scored 1.0 on schema_adherence when JSON mode was engaged via soft-prompting (i.e., "respond only in JSON, using this schema"). This is useful for structured extraction tasks even without native tool calling.
Aggregated across the dimensions that matter most for agentic use:
| Model | Thinking | Sys. authority | Neg. instr. | Patch format | Best for |
|---|---|---|---|---|---|
devstral-small-beast | TaggedPrompted | Respected | 0.75 | UnifiedDiff | Agentic coding, top pick |
gemma4-31b | — | Respected | 1.00 | UnifiedDiff | Strict instruction following |
deepseek-r1-70b | TaggedNative | Overrideable | 0.75 | UnifiedDiff | Long reasoning chains |
deepseek-r1-32b | TaggedNative | Overrideable | 0.50 | UnifiedDiff | Faster reasoning |
phi4-beast | TaggedPrompted | Respected | 0.00 | UnifiedDiff | Balanced; weak on negation |
hermes-beast | TaggedPrompted | Respected | 0.00 | UnifiedDiff | Fast local chat |
qwen3-coder | — | Weak | 0.50 | UnifiedDiff | Coding; needs strong user turns |
qwen3-5-legion | — | Respected | 0.75 | UnifiedDiff | Legion: better compliance than the bigger model |
qwen-coder-legion | — | Respected | 0.00 | UnifiedDiff | Legion burst load |
phi4-mini-instruct-legion | — | Overrideable | 0.50 | CustomEdit | Legion lightweight tasks |
phi4-mini-reasoning-legion | — | Weak | 0.50 | None | Avoid for agent editing |
qwen-mx | — | Respected | 0.00 | UnifiedDiff | CPU-bound broker fallback |
For agentic coding and document editing: devstral-small-beast is the clear winner. It scores Respected on system authority, FollowsSystem on conflict resolution, emits thinking on demand, and produces standard unified diffs. The Q8_0 quantization at 18 GB leaves plenty of VRAM headroom on the 96 GB beast node.
When you need guaranteed instruction compliance: gemma4-31b is the only model scoring 1.0 on negative instructions. It's the model I'd use when the harness issues a hard constraint that must not be violated.
For long reasoning chains: The DeepSeek R1 variants (70B, 32B) are the only models emitting thinking blocks natively. The cost is Unpredictable conflict resolution — they should be used in harness configurations where the user turn is trusted, not in workflows where the system prompt is the single source of truth.
Size does not predict compliance. qwen3-5-legion (9B) scores better on system authority and negative instructions than qwen3-coder (34B). The 9B model runs the base instruct tune; the 34B model is a coding-specialist fine-tune that sacrificed some instruction-following rigidity for code generation performance. If you're routing through legion and need reliable agent behaviour, qwen3-5-legion is the pick over qwen-coder-legion unless raw code generation quality is the primary requirement.
The Phi-4-mini reasoning variants are not what they advertise (at least not to this probe). They're trained as reasoning models, but the reasoning is not externally extractable via the <think> tag protocol that works on DeepSeek and Devstral. Their system authority is Weak and they can't produce structured diffs. They might still be useful for direct Q&A tasks where thinking is implicit, but they're not ready for agentic pipelines as currently deployed.
The following models were added to the fleet after this probe run and profiles are not yet available:
phi4-mini-instruct-beast, phi4-mini-reasoning-beast — beast-node variants of the legion/mx minisqwen3-5-beast — beast-node 9B; interesting to see if it matches legion's instruction-following scoresqwen3-6-27b — 128K context Qwen3 27B, first run pendingqwen3.5-122B-A10B — the big one; needs dedicated 77 GB VRAM windowdevstral-2-small-mx — needs isolated P4 run (8 GB exclusive)This post will be updated, or a follow-up will cover the new entries once those probes complete.