my-server
← Back to Blog

title: "The Cortex Fleet: Eight Open Models Under the ipsa-probe Microscope" slug: "cortex-fleet-model-comparison-ipsa-probe" published: 2026-04-26 tags: [local-ai, llm, benchmark, ipsa-probe, open-source-models, devstral, deepseek, phi4, qwen3, gemma4] summary: "Every open model running on the home cluster put through five protocol layers — thinking, instruction adherence, structured output, response shape, and system prompt strength. The numbers behind which model goes where." menu_title: "Fleet Model Comparison" draft: false

The Cortex Fleet: Eight Open Models Under the ipsa-probe Microscope

I run a small private AI cluster — three nodes, a VRAM-aware broker, and whatever open models I can wrangle into GGUF format. After deploying a new wave of models (Phi-4-mini variants, Qwen3 series, Devstral Small Q8), I wanted a systematic answer to the obvious question: what can each of these things actually do?

So I built ipsa-probe: a Rust probe harness that fires a structured battery of tests at each model through the loch-nessh broker, and saves the results as versioned TOML profiles. These aren't vibes — they're reproducible signal.

This post is the first aggregated comparison. Sixteen profiles in, here's what the data says.


What ipsa-probe measures

The probe suite is organized around five protocol layers, loosely derived from how agentic harnesses actually communicate with models:

LayerWhat it tests
I. Structural DelimitersXML tag fidelity, custom tags, streaming safety
II. Cognitive ScaffoldingThinking block emission, CoT quality, reflection responsiveness
III. Extrinsic IntegrationTool calling format, JSON mode, constrained generation
IV. Behavioural OverlaysInstruction adherence, system prompt authority, persona stability
V. Operational ParametersTemperature sensitivity, sampling param support
+ Response ShapeWhether the model can emit full docs, unified diffs, or custom patches

Each probe fires a prompt, parses the output, and records a float score (0.0–1.0) or an enum against a known rubric. The harness strips thinking blocks before scoring so reasoning tokens don't inflate the answers.


The fleet

Models tested as of this post. All run locally via llama.cpp unless noted.

ModelNodeVRAMQuant
devstral-small-beastness-linux318 GBQ8_0
deepseek-r1-32bness-linux326 GBQ4_K_M
deepseek-r1-70bness-linux352 GBQ4_K_M
deepseek-r1-15b-legionness-legion112 GBQ4_K_M
phi4-beastness-linux310 GBQ8_0
phi4-mini-instruct-legionness-legion15 GBQ4_K_M
phi4-mini-reasoning-legionness-legion15 GBQ4_K_M
phi4-mini-instruct-mxmx-legacy5 GBQ4_K_M
phi4-mini-reasoning-mxmx-legacy5 GBQ4_K_M
qwen3-coderness-linux334 GBQ4_K_M
qwen-coder-legionness-legion118 GBQ4_K_M
qwen3-5-legionness-legion18 GBQ4_K_M
qwen-mxmx-legacy (CPU+P4)8 GBQ4_K_M
gemma4-31bness-linux362 GBQ4_K_M
hermes-beastness-linux39 GBQ4_K_M

Pending: phi4-mini-instruct-beast, phi4-mini-reasoning-beast, qwen3-5-beast, qwen3-6-27b, qwen3.5-122b, devstral-2-small-mx. Profiles will be added once probes complete.


Layer II — Cognitive Scaffolding: who thinks out loud?

This is the most immediately useful dimension for agentic work. A model that emits thinking blocks lets you strip the scratchpad from user-visible output while still benefitting from the reasoning chain.

ModelThinking emissionExtractableTemperature det.
devstral-small-beastTaggedPromptedyesyes
phi4-beastTaggedPromptedyesyes
hermes-beastTaggedPromptedyesno
deepseek-r1-32bTaggedNativeyesyes
deepseek-r1-70bTaggedNativeyesno
deepseek-r1-15b-legionTaggedNativeyesyes
phi4-mini-instruct-legionNoneyes
phi4-mini-reasoning-legionNoneno
phi4-mini-instruct-mxNoneyes
phi4-mini-reasoning-mxNoneno
qwen3-coderNoneno
qwen3-5-legionNoneyes
qwen-coder-legionNoneyes
qwen-mxNoneyes
gemma4-31bNoneyes

TaggedNative means the model emits <think>…</think> blocks without any prompting — it's baked into the model's default behaviour. DeepSeek R1 variants are the only models in the fleet that do this. Every DeepSeek variant reliably surfaces its reasoning chain.

TaggedPrompted (Devstral, Phi-4, Hermes) means the model will emit thinking blocks when the system prompt explicitly asks for them, but won't by default. These are the best "agentic reasoning" candidates if you're willing to prime the system prompt.

The Phi-4-mini reasoning variants (phi4-mini-reasoning-*) are an interesting case: they're Microsoft's reasoning-tuned Phi-4-mini, but probe results show emission = "None". The thinking appears to be compiled into the response structure rather than exposed as a separable token stream — the harness's <think> stripping finds nothing to strip. Whether this means the reasoning is less accessible or just differently encoded is worth more investigation.


Layer IV — Behavioural Overlays: who follows orders?

System prompt authority is critical for agentic use. A model where user turns can silently override the system prompt is a liability in multi-agent workflows.

System prompt authority

ModelSystem priorityConflict resolution
devstral-small-beastRespectedFollowsSystem
hermes-beastRespectedFollowsSystem
gemma4-31bRespectedFollowsSystem
qwen3-coderWeakBlendsBoth
qwen3-5-legionRespectedFollowsSystem
qwen-coder-legionRespectedFollowsSystem
qwen-mxRespectedFollowsSystem
deepseek-r1-32bOverrideableUnpredictable
deepseek-r1-70bOverrideableUnpredictable
phi4-beastRespectedFollowsSystem
phi4-mini-instruct-legionOverrideableUnpredictable
phi4-mini-reasoning-legionWeakBlendsBoth
phi4-mini-instruct-mxOverrideableUnpredictable
phi4-mini-reasoning-mxWeakBlendsBoth

Devstral, Hermes, Gemma4, and the Qwen fleet (non-reasoning) are the safe picks here. They treat the system prompt as authoritative and resolve conflicts by deferring to it.

The DeepSeek R1 variants score Unpredictable on conflict resolution — they often blend system and user intent in ways that aren't deterministic between runs. This makes them harder to use as reliable agents, despite their strong reasoning capability. It's consistent with R1's training objective, which emphasized reasoning performance over instruction-following stability.

The qwen3-coder (beast, 34B) scores Weak on system priority, which is surprising for a model this size. It was tuned for coding assistance, and it seems to prioritize user-turn context heavily. The legion and mx Qwen variants — including qwen3-5-legion (9B) — all score Respected. The 9B model is more obedient than the 34B model. This is likely a fine-tune difference: qwen3-coder is a coding-specialist tune whereas qwen3-5-legion runs the base instruct tune, which has stronger instruction-following alignment.

Negative instruction adherence

Can the model avoid doing something it's told not to do?

ModelScore
gemma4-31b1.00
devstral-small-beast0.75
deepseek-r1-70b0.75
phi4-mini-instruct-mx0.75
phi4-mini-reasoning-mx0.75
phi4-mini-reasoning-legion0.50
deepseek-r1-32b0.50
phi4-beast0.00
qwen3-5-legion0.75
qwen-coder-legion0.00
qwen-mx0.00

Gemma4-31b is the only model scoring a clean 1.0 on negative instructions. The Qwen coder variants and phi4-beast score 0.0 here — these models will attempt tasks they've been told not to attempt, especially when user-turn context contradicts the prohibition.


Layer I — Structural Delimiters: XML tag handling

Most agentic harnesses use XML-style tags to segment thought from action from output. A harness that can't trust the model to open and close tags correctly is flying blind.

ModelOpen/close fidelityCustom tagsAttributes
devstral-small-beast1.0yesyes
phi4-beast1.0yesyes
hermes-beast1.0yesyes
gemma4-31b1.0yesyes
deepseek-r1-32b1.0yesyes
deepseek-r1-70b1.0yesyes
qwen3-coder1.0yesyes
qwen3-5-legion1.0yesyes
qwen-coder-legion1.0yesyes
qwen-mx1.0yesyes
phi4-mini-instruct-legion1.0noyes
phi4-mini-instruct-mx1.0noyes
phi4-mini-reasoning-legion0.60yesno
phi4-mini-reasoning-mx0.60yesno

The Phi-4-mini instruct variants are notable: they close tags reliably but don't generate custom (non-standard) tags on command. The harness can still use them with a fixed vocabulary of known tags, but can't ask these models to invent new tag names at runtime.

The Phi-4-mini reasoning variants score 0.60 on fidelity — they occasionally fail to close a tag or emit malformed nesting. For a harness that uses streaming tag detection, this means a fallback parser is required.


Response Shape: can the model patch a document?

For agentic code and document editing, we want a model that can emit a targeted diff rather than a full document rewrite. This is the document_patch field.

ModelPatch format
devstral-small-beastUnifiedDiff
deepseek-r1-32bUnifiedDiff
deepseek-r1-70bUnifiedDiff
gemma4-31bUnifiedDiff
qwen3-coderUnifiedDiff
qwen-coder-legionUnifiedDiff
qwen3-5-legionUnifiedDiff
qwen-mxUnifiedDiff
hermes-beastUnifiedDiff
phi4-beastUnifiedDiff
phi4-mini-instruct-legionCustomEdit
phi4-mini-instruct-mxCustomEdit
phi4-mini-reasoning-mxCustomEdit
phi4-mini-reasoning-legionNone

UnifiedDiff means the model produces standard --- a/file / +++ b/file unified diffs when asked. These can be applied with patch(1) directly.

CustomEdit is a model-specific format — the Phi-4-mini instruct variants produce a search-replace block syntax (similar to the Aider "SEARCH/REPLACE" convention) rather than standard diffs. These need a custom parser in the harness but are still actionable.

phi4-mini-reasoning-legion scoring None here is the clearest signal that this model is not suited for agentic editing tasks. It will rewrite documents in full or refuse to produce a structured patch at all.


Layer III — Extrinsic Integration: tool calling

Honest answer: no model in the current fleet scored anything other than format = "None" on native tool calling. This is partly a probe methodology issue — the probe tests through loch-nessh's chat completions endpoint, which doesn't yet pass tool schemas downstream to the model. Native function-calling support for Devstral and Qwen3 exists in their GGUF metadata, but the integration path from loch-nessh → llama.cpp → model isn't wired up yet.

All models scored 1.0 on schema_adherence when JSON mode was engaged via soft-prompting (i.e., "respond only in JSON, using this schema"). This is useful for structured extraction tasks even without native tool calling.


The summary table

Aggregated across the dimensions that matter most for agentic use:

ModelThinkingSys. authorityNeg. instr.Patch formatBest for
devstral-small-beastTaggedPromptedRespected0.75UnifiedDiffAgentic coding, top pick
gemma4-31bRespected1.00UnifiedDiffStrict instruction following
deepseek-r1-70bTaggedNativeOverrideable0.75UnifiedDiffLong reasoning chains
deepseek-r1-32bTaggedNativeOverrideable0.50UnifiedDiffFaster reasoning
phi4-beastTaggedPromptedRespected0.00UnifiedDiffBalanced; weak on negation
hermes-beastTaggedPromptedRespected0.00UnifiedDiffFast local chat
qwen3-coderWeak0.50UnifiedDiffCoding; needs strong user turns
qwen3-5-legionRespected0.75UnifiedDiffLegion: better compliance than the bigger model
qwen-coder-legionRespected0.00UnifiedDiffLegion burst load
phi4-mini-instruct-legionOverrideable0.50CustomEditLegion lightweight tasks
phi4-mini-reasoning-legionWeak0.50NoneAvoid for agent editing
qwen-mxRespected0.00UnifiedDiffCPU-bound broker fallback

Takeaways

For agentic coding and document editing: devstral-small-beast is the clear winner. It scores Respected on system authority, FollowsSystem on conflict resolution, emits thinking on demand, and produces standard unified diffs. The Q8_0 quantization at 18 GB leaves plenty of VRAM headroom on the 96 GB beast node.

When you need guaranteed instruction compliance: gemma4-31b is the only model scoring 1.0 on negative instructions. It's the model I'd use when the harness issues a hard constraint that must not be violated.

For long reasoning chains: The DeepSeek R1 variants (70B, 32B) are the only models emitting thinking blocks natively. The cost is Unpredictable conflict resolution — they should be used in harness configurations where the user turn is trusted, not in workflows where the system prompt is the single source of truth.

Size does not predict compliance. qwen3-5-legion (9B) scores better on system authority and negative instructions than qwen3-coder (34B). The 9B model runs the base instruct tune; the 34B model is a coding-specialist fine-tune that sacrificed some instruction-following rigidity for code generation performance. If you're routing through legion and need reliable agent behaviour, qwen3-5-legion is the pick over qwen-coder-legion unless raw code generation quality is the primary requirement.

The Phi-4-mini reasoning variants are not what they advertise (at least not to this probe). They're trained as reasoning models, but the reasoning is not externally extractable via the <think> tag protocol that works on DeepSeek and Devstral. Their system authority is Weak and they can't produce structured diffs. They might still be useful for direct Q&A tasks where thinking is implicit, but they're not ready for agentic pipelines as currently deployed.


What's still pending

The following models were added to the fleet after this probe run and profiles are not yet available:

This post will be updated, or a follow-up will cover the new entries once those probes complete.