title: "The Cortex Fleet: Eight Open Models Under the ipsa-probe Microscope" slug: "cortex-fleet-model-comparison-ipsa-probe" published: 2026-04-26 tags: [local-ai, llm, benchmark, ipsa-probe, open-source-models, devstral, deepseek, phi4, qwen3, gemma4] summary: "Every open model running on the home cluster put through five protocol layers — thinking, instruction adherence, structured output, response shape, and system prompt strength. The numbers behind which model goes where." menu_title: "Fleet Model Comparison" draft: false

The Cortex Fleet: Eight Open Models Under the ipsa-probe Microscope

I run a small private AI cluster — three nodes, a VRAM-aware broker, and whatever open models I can wrangle into GGUF format. After deploying a new wave of models (Phi-4-mini variants, Qwen3 series, Devstral Small Q8), I wanted a systematic answer to the obvious question: what can each of these things actually do?

So I built ipsa-probe: a Rust probe harness that fires a structured battery of tests at each model through the loch-nessh broker, and saves the results as versioned TOML profiles. These aren't vibes — they're reproducible signal.

This post is the first aggregated comparison. Sixteen profiles in, here's what the data says.

What ipsa-probe measures

The probe suite is organized around five protocol layers, loosely derived from how agentic harnesses actually communicate with models:

Layer	What it tests
I. Structural Delimiters	XML tag fidelity, custom tags, streaming safety
II. Cognitive Scaffolding	Thinking block emission, CoT quality, reflection responsiveness
III. Extrinsic Integration	Tool calling format, JSON mode, constrained generation
IV. Behavioural Overlays	Instruction adherence, system prompt authority, persona stability
V. Operational Parameters	Temperature sensitivity, sampling param support
+ Response Shape	Whether the model can emit full docs, unified diffs, or custom patches

Each probe fires a prompt, parses the output, and records a float score (0.0–1.0) or an enum against a known rubric. The harness strips thinking blocks before scoring so reasoning tokens don't inflate the answers.

The fleet

Models tested as of this post. All run locally via llama.cpp unless noted.

Model	Node	VRAM	Quant
`devstral-small-beast`	ness-linux3	18 GB	Q8_0
`deepseek-r1-32b`	ness-linux3	26 GB	Q4_K_M
`deepseek-r1-70b`	ness-linux3	52 GB	Q4_K_M
`deepseek-r1-15b-legion`	ness-legion1	12 GB	Q4_K_M
`phi4-beast`	ness-linux3	10 GB	Q8_0
`phi4-mini-instruct-legion`	ness-legion1	5 GB	Q4_K_M
`phi4-mini-reasoning-legion`	ness-legion1	5 GB	Q4_K_M
`phi4-mini-instruct-mx`	mx-legacy	5 GB	Q4_K_M
`phi4-mini-reasoning-mx`	mx-legacy	5 GB	Q4_K_M
`qwen3-coder`	ness-linux3	34 GB	Q4_K_M
`qwen-coder-legion`	ness-legion1	18 GB	Q4_K_M
`qwen3-5-legion`	ness-legion1	8 GB	Q4_K_M
`qwen-mx`	mx-legacy (CPU+P4)	8 GB	Q4_K_M
`gemma4-31b`	ness-linux3	62 GB	Q4_K_M
`hermes-beast`	ness-linux3	9 GB	Q4_K_M

Pending: phi4-mini-instruct-beast, phi4-mini-reasoning-beast, qwen3-5-beast, qwen3-6-27b, qwen3.5-122b, devstral-2-small-mx. Profiles will be added once probes complete.

Layer II — Cognitive Scaffolding: who thinks out loud?

This is the most immediately useful dimension for agentic work. A model that emits thinking blocks lets you strip the scratchpad from user-visible output while still benefitting from the reasoning chain.

Model	Thinking emission	Extractable	Temperature det.
`devstral-small-beast`	TaggedPrompted	yes	yes
`phi4-beast`	TaggedPrompted	yes	yes
`hermes-beast`	TaggedPrompted	yes	no
`deepseek-r1-32b`	TaggedNative	yes	yes
`deepseek-r1-70b`	TaggedNative	yes	no
`deepseek-r1-15b-legion`	TaggedNative	yes	yes
`phi4-mini-instruct-legion`	None	—	yes
`phi4-mini-reasoning-legion`	None	—	no
`phi4-mini-instruct-mx`	None	—	yes
`phi4-mini-reasoning-mx`	None	—	no
`qwen3-coder`	None	—	no
`qwen3-5-legion`	None	—	yes
`qwen-coder-legion`	None	—	yes
`qwen-mx`	None	—	yes
`gemma4-31b`	None	—	yes

TaggedNative means the model emits <think>…</think> blocks without any prompting — it's baked into the model's default behaviour. DeepSeek R1 variants are the only models in the fleet that do this. Every DeepSeek variant reliably surfaces its reasoning chain.

TaggedPrompted (Devstral, Phi-4, Hermes) means the model will emit thinking blocks when the system prompt explicitly asks for them, but won't by default. These are the best "agentic reasoning" candidates if you're willing to prime the system prompt.

The Phi-4-mini reasoning variants (phi4-mini-reasoning-*) are an interesting case: they're Microsoft's reasoning-tuned Phi-4-mini, but probe results show emission = "None". The thinking appears to be compiled into the response structure rather than exposed as a separable token stream — the harness's <think> stripping finds nothing to strip. Whether this means the reasoning is less accessible or just differently encoded is worth more investigation.

Layer IV — Behavioural Overlays: who follows orders?

System prompt authority is critical for agentic use. A model where user turns can silently override the system prompt is a liability in multi-agent workflows.

System prompt authority

Model	System priority	Conflict resolution
`devstral-small-beast`	Respected	FollowsSystem
`hermes-beast`	Respected	FollowsSystem
`gemma4-31b`	Respected	FollowsSystem
`qwen3-coder`	Weak	BlendsBoth
`qwen3-5-legion`	Respected	FollowsSystem
`qwen-coder-legion`	Respected	FollowsSystem
`qwen-mx`	Respected	FollowsSystem
`deepseek-r1-32b`	Overrideable	Unpredictable
`deepseek-r1-70b`	Overrideable	Unpredictable
`phi4-beast`	Respected	FollowsSystem
`phi4-mini-instruct-legion`	Overrideable	Unpredictable
`phi4-mini-reasoning-legion`	Weak	BlendsBoth
`phi4-mini-instruct-mx`	Overrideable	Unpredictable
`phi4-mini-reasoning-mx`	Weak	BlendsBoth

Devstral, Hermes, Gemma4, and the Qwen fleet (non-reasoning) are the safe picks here. They treat the system prompt as authoritative and resolve conflicts by deferring to it.

The DeepSeek R1 variants score Unpredictable on conflict resolution — they often blend system and user intent in ways that aren't deterministic between runs. This makes them harder to use as reliable agents, despite their strong reasoning capability. It's consistent with R1's training objective, which emphasized reasoning performance over instruction-following stability.

The qwen3-coder (beast, 34B) scores Weak on system priority, which is surprising for a model this size. It was tuned for coding assistance, and it seems to prioritize user-turn context heavily. The legion and mx Qwen variants — including qwen3-5-legion (9B) — all score Respected. The 9B model is more obedient than the 34B model. This is likely a fine-tune difference: qwen3-coder is a coding-specialist tune whereas qwen3-5-legion runs the base instruct tune, which has stronger instruction-following alignment.

Negative instruction adherence

Can the model avoid doing something it's told not to do?

Model	Score
`gemma4-31b`	1.00
`devstral-small-beast`	0.75
`deepseek-r1-70b`	0.75
`phi4-mini-instruct-mx`	0.75
`phi4-mini-reasoning-mx`	0.75
`phi4-mini-reasoning-legion`	0.50
`deepseek-r1-32b`	0.50
`phi4-beast`	0.00
`qwen3-5-legion`	0.75
`qwen-coder-legion`	0.00
`qwen-mx`	0.00

Gemma4-31b is the only model scoring a clean 1.0 on negative instructions. The Qwen coder variants and phi4-beast score 0.0 here — these models will attempt tasks they've been told not to attempt, especially when user-turn context contradicts the prohibition.

Layer I — Structural Delimiters: XML tag handling

Most agentic harnesses use XML-style tags to segment thought from action from output. A harness that can't trust the model to open and close tags correctly is flying blind.

Model	Open/close fidelity	Custom tags	Attributes
`devstral-small-beast`	1.0	yes	yes
`phi4-beast`	1.0	yes	yes
`hermes-beast`	1.0	yes	yes
`gemma4-31b`	1.0	yes	yes
`deepseek-r1-32b`	1.0	yes	yes
`deepseek-r1-70b`	1.0	yes	yes
`qwen3-coder`	1.0	yes	yes
`qwen3-5-legion`	1.0	yes	yes
`qwen-coder-legion`	1.0	yes	yes
`qwen-mx`	1.0	yes	yes
`phi4-mini-instruct-legion`	1.0	no	yes
`phi4-mini-instruct-mx`	1.0	no	yes
`phi4-mini-reasoning-legion`	0.60	yes	no
`phi4-mini-reasoning-mx`	0.60	yes	no

The Phi-4-mini instruct variants are notable: they close tags reliably but don't generate custom (non-standard) tags on command. The harness can still use them with a fixed vocabulary of known tags, but can't ask these models to invent new tag names at runtime.

The Phi-4-mini reasoning variants score 0.60 on fidelity — they occasionally fail to close a tag or emit malformed nesting. For a harness that uses streaming tag detection, this means a fallback parser is required.

Response Shape: can the model patch a document?

For agentic code and document editing, we want a model that can emit a targeted diff rather than a full document rewrite. This is the document_patch field.

Model	Patch format
`devstral-small-beast`	UnifiedDiff
`deepseek-r1-32b`	UnifiedDiff
`deepseek-r1-70b`	UnifiedDiff
`gemma4-31b`	UnifiedDiff
`qwen3-coder`	UnifiedDiff
`qwen-coder-legion`	UnifiedDiff
`qwen3-5-legion`	UnifiedDiff
`qwen-mx`	UnifiedDiff
`hermes-beast`	UnifiedDiff
`phi4-beast`	UnifiedDiff
`phi4-mini-instruct-legion`	CustomEdit
`phi4-mini-instruct-mx`	CustomEdit
`phi4-mini-reasoning-mx`	CustomEdit
`phi4-mini-reasoning-legion`	None

UnifiedDiff means the model produces standard --- a/file / +++ b/file unified diffs when asked. These can be applied with patch(1) directly.

CustomEdit is a model-specific format — the Phi-4-mini instruct variants produce a search-replace block syntax (similar to the Aider "SEARCH/REPLACE" convention) rather than standard diffs. These need a custom parser in the harness but are still actionable.

phi4-mini-reasoning-legion scoring None here is the clearest signal that this model is not suited for agentic editing tasks. It will rewrite documents in full or refuse to produce a structured patch at all.

Layer III — Extrinsic Integration: tool calling

Honest answer: no model in the current fleet scored anything other than format = "None" on native tool calling. This is partly a probe methodology issue — the probe tests through loch-nessh's chat completions endpoint, which doesn't yet pass tool schemas downstream to the model. Native function-calling support for Devstral and Qwen3 exists in their GGUF metadata, but the integration path from loch-nessh → llama.cpp → model isn't wired up yet.

All models scored 1.0 on schema_adherence when JSON mode was engaged via soft-prompting (i.e., "respond only in JSON, using this schema"). This is useful for structured extraction tasks even without native tool calling.

The summary table

Aggregated across the dimensions that matter most for agentic use:

Model	Thinking	Sys. authority	Neg. instr.	Patch format	Best for
`devstral-small-beast`	TaggedPrompted	Respected	0.75	UnifiedDiff	Agentic coding, top pick
`gemma4-31b`	—	Respected	1.00	UnifiedDiff	Strict instruction following
`deepseek-r1-70b`	TaggedNative	Overrideable	0.75	UnifiedDiff	Long reasoning chains
`deepseek-r1-32b`	TaggedNative	Overrideable	0.50	UnifiedDiff	Faster reasoning
`phi4-beast`	TaggedPrompted	Respected	0.00	UnifiedDiff	Balanced; weak on negation
`hermes-beast`	TaggedPrompted	Respected	0.00	UnifiedDiff	Fast local chat
`qwen3-coder`	—	Weak	0.50	UnifiedDiff	Coding; needs strong user turns
`qwen3-5-legion`	—	Respected	0.75	UnifiedDiff	Legion: better compliance than the bigger model
`qwen-coder-legion`	—	Respected	0.00	UnifiedDiff	Legion burst load
`phi4-mini-instruct-legion`	—	Overrideable	0.50	CustomEdit	Legion lightweight tasks
`phi4-mini-reasoning-legion`	—	Weak	0.50	None	Avoid for agent editing
`qwen-mx`	—	Respected	0.00	UnifiedDiff	CPU-bound broker fallback

Takeaways

For agentic coding and document editing: devstral-small-beast is the clear winner. It scores Respected on system authority, FollowsSystem on conflict resolution, emits thinking on demand, and produces standard unified diffs. The Q8_0 quantization at 18 GB leaves plenty of VRAM headroom on the 96 GB beast node.

When you need guaranteed instruction compliance: gemma4-31b is the only model scoring 1.0 on negative instructions. It's the model I'd use when the harness issues a hard constraint that must not be violated.

For long reasoning chains: The DeepSeek R1 variants (70B, 32B) are the only models emitting thinking blocks natively. The cost is Unpredictable conflict resolution — they should be used in harness configurations where the user turn is trusted, not in workflows where the system prompt is the single source of truth.

Size does not predict compliance. qwen3-5-legion (9B) scores better on system authority and negative instructions than qwen3-coder (34B). The 9B model runs the base instruct tune; the 34B model is a coding-specialist fine-tune that sacrificed some instruction-following rigidity for code generation performance. If you're routing through legion and need reliable agent behaviour, qwen3-5-legion is the pick over qwen-coder-legion unless raw code generation quality is the primary requirement.

The Phi-4-mini reasoning variants are not what they advertise (at least not to this probe). They're trained as reasoning models, but the reasoning is not externally extractable via the <think> tag protocol that works on DeepSeek and Devstral. Their system authority is Weak and they can't produce structured diffs. They might still be useful for direct Q&A tasks where thinking is implicit, but they're not ready for agentic pipelines as currently deployed.

What's still pending

The following models were added to the fleet after this probe run and profiles are not yet available:

phi4-mini-instruct-beast, phi4-mini-reasoning-beast — beast-node variants of the legion/mx minis
qwen3-5-beast — beast-node 9B; interesting to see if it matches legion's instruction-following scores
qwen3-6-27b — 128K context Qwen3 27B, first run pending
qwen3.5-122B-A10B — the big one; needs dedicated 77 GB VRAM window
devstral-2-small-mx — needs isolated P4 run (8 GB exclusive)

This post will be updated, or a follow-up will cover the new entries once those probes complete.