Part 1 covered the first five protocol layers: XML fidelity, thinking block emission, structured output, instruction adherence, and sampling parameters. That pass established protocol fitness — does this model behave correctly when asked to follow a format?
This post is about something harder: cognitive fitness — what can the model actually reason through?
I added four new probe dimensions to the ipsa-probe harness and ran the full fleet again. This time across 27 registered models on three cluster nodes: a 96 GB AMD ROCm box (ness-linux3), an 8 GB CPU/Nvidia broker (mx-legacy), and a small Nvidia CUDA burst node (ness-legion1). Not all probes finished — a few models crashed, several haven't run yet, and one 122B model took down the pod it was running on. But there's enough signal to draw conclusions.
The core test. The model is given a natural language description of a domain and asked to produce a structured hierarchical decomposition — entities, relationships, and sub-components at increasing levels of complexity.
Three levels of task:
The overall_score is a weighted average across levels attempted. max_level_75pct is the highest level where the model achieved ≥75% on all subscores.
# Example profile excerpt — domain modeling section
[domain_modeling]
overall_score = 0.9188973903656006 # qwen3-5-beast
max_level_75pct = 3
[[domain_modeling.level_scores]]
level = 1
coverage = 0.833 # did it find all the key entities?
relationships = 1.0 # did it correctly link them?
decomposition = 1.0 # did it break compound entities further?
economy = 1.0 # did it avoid redundancy and bloat?
weighted_score = 0.958
elapsed_ms = 287729
A model that can decompose a domain is useful. A model that can maintain and evolve that decomposition under successive updates is what agents actually need.
Three phases:
The regression score measures whether the final model preserved the original structure. High mutation scores with low regression = the model keeps rewriting from scratch rather than patching in place.
[domain_mutation]
phase1_score = 1.0 # devstral-small-beast: perfect add
phase2_score = 1.0 # perfect refactor
phase3_score = 0.9285714 # near-perfect extension
churn_rate = 0.0 # no spurious changes between phases
regression_score = 1.0 # original structure fully preserved
high_cohesion = false
elapsed_ms = 64567
This probe answers: "how much thinking budget does this model need to reliably solve reasoning tasks?" It sweeps four budget levels (0, 512, 2048, 8192 tokens) with 3 runs each, and records the pass rate and stddev.
The min_budget_75pct field is the minimum budget where the average score crosses 0.75. Models with passes_without_thinking = true have a thinking toggle but pass even without it.
This matters a lot for production: over-budgeting a model wastes VRAM and latency; under-budgeting a model that needs reasoning tokens causes silent degradation.
[budget_matrix]
min_budget_75pct = 512 # devstral-123b: sweet spot at 512t
[[budget_matrix.entries]]
budget_tokens = 0
avg_score = 0.50 # fails half the time without thinking
[[budget_matrix.entries]]
budget_tokens = 512
avg_score = 0.833 # jumps to 83% at just 512 tokens
[[budget_matrix.entries]]
budget_tokens = 2048
avg_score = 0.50 # dips back! over-thinking hurts this model
This probe tests the model's ability to produce a valid Prolog-style HTN (Hierarchical Task Network) program from a natural language goal description — the same output format used by the atomus agent in ipsa-agent.
The score is a reasoning_level from 0 to 5:
[dsl_compilation]
reasoning_level = 5 # full capability
reasoning_quotient = 1.0 # 100% of tasks solved correctly
cycle_safe_recursion = true # no infinite loops in generated programs
guarded_transitions = true # all state transitions have preconditions
Models at level ≤2 should not be used as atomus backends without human review of every output.
┌─────────────────────────────────────────────────────────────────────┐
│ ipsa-probe pipeline │
│ │
│ probe-model binary │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Protocol │ │ Cognitive │ │ Domain │ │
│ │ Probes │──▶│ Probes │──▶│ Modeling │──▶ TOML │
│ │ (Part 1) │ │ (thinking, │ │ & Mutation │ profile │
│ └──────────────┘ │ CoT, budget) │ │ Probes │ │
│ └──────────────┘ └──────────────┘ │
│ All traffic via loch-nessh (VRAM-aware broker) │
└─────────────────────────────────────────────────────────────────────┘
All requests flow through loch-nessh, which handles VRAM accounting, GPU locking, and claim lifecycle across the three nodes. The probe binary submits each task as a non-streaming claim, waits for the response, and scores it against the rubric.
Nodes:
| Node | GPU | Total VRAM | Role |
|---|---|---|---|
ness-linux3 | AMD Ryzen AI-Max 395+ (ROCm) | 96 GB | Primary inference |
mx-legacy | Nvidia Tesla P4 | 8 GB | CPU broker, small models |
ness-legion1 | Nvidia CUDA | 8 GB | Burst inference, small models |
The big iron. Runs everything from 7B smalls to 123B behemoths.
| Model | VRAM | Domain ↓ | P1/P2/P3 | Reg | Budget | DSL | Thinking | TTFT | TPS |
|---|---|---|---|---|---|---|---|---|---|
| qwen3-5-beast | 10 GB | 0.92 | 0.75/0.75/0.75 | 0.50 | 8192t | 4 | None | 7040ms | 9.2 |
| devstral-small-beast | 44 GB | 0.86 | 1.0/1.0/0.93 | 1.0 | — | 2 | Prompted | 643ms | 2.0 |
| deepseek-r1-32b | 78 GB | 0.86 | 0.5/0.5/0.5 | 1.0 | 8192t | 5 | Native | 640ms | 9.0 |
| devstral-123b | 90 GB | 0.84 | 1.0/0.92/0.86 | 1.0 | 512t | 5 | Prompted | 1323ms | 0.7 |
| qwen3-5-122b | 90 GB | 0.83 | 0.5/0.5/0.68 | 0.50 | 2048t | 5 | None | 12148ms | — |
| gemma4-31b | 68 GB | 0.80 | 0.75/0.67/0.61 | 0.50 | 2048t | 5 | None | 12460ms | 1.6 |
| phi4-beast | 16 GB | 0.78 | 0.75/0.67/0.68 | 0.50 | 0t | 5 | Prompted | 552ms | 14.1 |
| mistral-large | 90 GB | 0.77 | 0.75/0.67/0.61 | 0.50 | 2048t | 5 | Prompted | 1337ms | 0.8 |
| qwen3-coder | 44 GB | 0.91 | 1.0/1.0/0.93 | 1.0 | — | 2 | None | 832ms | 14.1 |
| phi4-mini-reasoning-beast | 7 GB | 0.73 | 0.0/0.0/0.54 | 0.50 | 2048t | 2 | Native | 492ms | 46.4 |
| phi4-mini-instruct-beast | 7 GB | 0.60 | 0.25/0.25/0.25 | 0.50 | — | 2 | Prompted | 477ms | 16.5 |
| hermes-beast | 30 GB | 0.41 | 0.5/0.5/0.5 | 1.0 | — | 2 | Prompted | 488ms | 9.2 |
| deepseek-r1-70b | 73 GB | — | — | — | — | — | — | — | — |
| qwen3-6-beast | 44 GB | — | — | — | — | — | — | — | — |
— = probe not yet complete. qwen3-5-122b crashed mid-probe (pod OOM); re-run in progress.
Small GGUF models running CPU or a P4. Throughput is measured but constrained.
| Model | VRAM | Domain | P1/P2/P3 | Budget | DSL | TTFT | TPS |
|---|---|---|---|---|---|---|---|
| qwen3-5-mx | 7 GB | 0.85* | —/—/— | 8192t | 1 | 15467ms | — |
| phi4-mini-instruct-mx | 5 GB | 0.41* | —/—/— | — | 2 | 466ms | 9.8 |
| devstral-2-small-mx | 8 GB | — | — | — | — | — | — |
| deepseek-r1-15b-mx | 7 GB | — | — | — | — | — | — |
| phi4-mini-reasoning-mx | 5 GB | — | — | — | — | — | — |
| hermes-mx | — | — | — | — | — | — | — |
| phi4-mx | — | — | — | — | — | — | — |
| qwen-mx | — | — | — | — | — | — | — |
* = partial probe: domain_modeling done, domain_mutation did not finish.
| Model | VRAM | Domain | P1/P2/P3 | Budget | DSL | TTFT | TPS |
|---|---|---|---|---|---|---|---|
| phi4-mini-instruct-legion | 5 GB | 0.47* | —/—/— | — | 2 | 465ms | 19.0 |
| deepseek-r1-15b-legion | 7 GB | — | — | — | — | — | — |
| qwen-coder-legion | 6 GB | — | — | — | — | — | — |
| phi4-mini-reasoning-legion | 5 GB | — | — | — | — | — | — |
| qwen3-5-legion | 7 GB | — | — | — | — | — | — |
qwen3-coder (44 GB, 262K context) is the biggest surprise in the full fleet. Domain score 0.907 — second only to qwen3-5-beast. Mutation scores: 1.0/1.0/0.929, regression 1.0 — matching devstral-small-beast exactly. TTFT 832ms, 14.1 TPS. No thinking emission, passes_without_thinking = true. And it uses no minimum thinking budget — the budget matrix shows 0.25 at all levels, which is unusual: this model doesn't improve its reasoning quality by giving it more thinking tokens at all.
The combination of near-perfect domain modeling AND perfect mutation stability AND reasonable speed is unique in this fleet. devstral-small-beast has the same mutation score but weaker domain (0.860 vs 0.907). The only weakness: DSL level 2 (same as devstral-small-beast) — not suitable for complex atomus HTN generation. system_priority = "Weak" and negative_instruction = 0.5 mean system prompt authority is loose.
Best use case: iterative domain modeling and agent refinement loops where throughput > DSL capability.
phi4-mini-reasoning-beast (7 GB) posts 46.4 TPS — nearly 3× faster than phi4-beast (14.1), the next fastest. Domain score 0.730, which for a 7B reasoning model is respectable. Native thinking emission (TaggedNative). Mutation phases 1 and 2 are 0.0 (it cannot add to or refactor an existing model without losing the original), phase 3 improves to 0.54. Needs 2048 minimum budget tokens.
This model's role is clear: anything that needs volume at minimum latency. Stream-classification, fast summarization, high-frequency signal processing where losing some quality is acceptable. Don't use it for anything requiring multi-phase consistency.
phi4-mini-instruct-beast (7 GB) is the weakest complete model: domain 0.596, mutation uniformly 0.25, DSL level 2. Faster than most (16.5 TPS, 477ms TTFT) but there are better options at similar speed (phi4-mini-reasoning, hermes, even qwen3-5-beast). Reserve it for the absolute simplest formatting-only tasks.
qwen3-5-beast has the highest domain modeling score at 0.919 — above devstral-small and devstral-123b. It's also only 10 GB VRAM. The catch: it's not a thinking model (emission = None), yet it somehow delivers top-tier decomposition quality. The price you pay is 8192 thinking budget tokens minimum (the budget matrix probe, not the thinking sweep) and a slow TTFT of 7 seconds. For async agentic tasks where latency doesn't matter, this is a compelling choice.
What Qwen's budget matrix reveals is subtle: at 0/512/2048 token budgets, average score = 0.0. At 8192, it jumps to 1.0 with zero stddev. There's a hard cliff — this model is binary, not gradual. You need to give it the full budget or don't bother.
Despite qwen3-5-beast winning on domain score, devstral-small-beast wins on what matters more for long-running agents: mutation stability. Phases 1 and 2 are perfect (1.0/1.0), regression is perfect (1.0), and churn rate is 0.0. When you give this model a model and ask it to evolve it over multiple turns, it does so without losing history.
The high_cohesion = false flag is interesting — it means the model doesn't enforce semantic grouping constraints. For most agent tasks, this doesn't matter.
The DSL level of 2 (vs. 5 for most larger models) is a real limitation if you're using it as an atomus backend for complex decomposition tasks. Use it for mutation work; use devstral-123b or deepseek-r1-32b for initial decomposition.
qwen3-5-122b (90 GB, ness-linux3) finished its probe after an earlier crash. Domain modeling score: 0.827 — solid, but below the 10 GB qwen3-5-beast (0.919) that runs on the same node. The 122B model's decomposition score at Level 1 is only 0.33 (it over-collapsed the hierarchy), while its L3 score of 0.897 is excellent. It gets harder problems right and easier ones wrong — a sign of a model that thinks in complex abstractions by default.
Mutation scores are mediocre (0.5/0.5/0.68, regression=0.5) — same weakness as deepseek-r1-32b. It's not a good model for iterative refinement loops. Where it distinguishes itself: negative_instruction = 1.0 and conflict_resolution = "FollowsSystem" — the best instruction compliance in the fleet alongside devstral-123b and gemma4-31b. It also has DSL level 5 and min_budget_75pct = 2048 (hard cliff, same as gemma4).
The real problem: TTFT of 12.1s and tokens_per_sec = 0.0. At 90 GB with no quantisation headroom on a 96 GB node, every token is slow. Use this for async, high-stakes single-shot tasks where answer quality trumps latency. Don't put it in a fast loop.
devstral-123b is the only model that scores well on both domain modeling (0.841) and mutation (phase1=1.0, regression=1.0), while also having DSL level 5. Its budget sweet spot is just 512 tokens — it's the most reasoning-efficient heavy model in the fleet. The phase 2 mutation dip at budget=2048 (0.5) vs. 512 (0.83) is a genuine quirk worth remembering: don't over-think it.
At 90 GB VRAM and 0.7 TPS, it's not fast. But for structured planning, complex tool orchestration, and domain modeling tasks where quality matters more than latency, this is currently the recommended choice on ness-linux3.
deepseek-r1-32b combines a domain score of 0.856 with DSL level 5 (full Prolog capability) and native thinking emission (AlwaysOn). It's also fast at 640ms TTFT and 9.0 TPS. It needs 8192 thinking tokens to reach full potential, but at this speed that's manageable.
The problem: mutation phases are all 0.5. This model rewrites the world. It's great at initial decompositions but poor at evolving an existing model in place. Use it for the first pass; don't use it for iterative refinement loops.
Also notable: conflict_resolution = "Unpredictable" — this model sometimes ignores the system prompt when it conflicts with its training priors. In practice this means it needs clean, forceful system prompts without ambiguity.
gemma4-31b has no thinking emission at all (AlwaysOff) and yet scores DSL level 5, domain score 0.798, and passes passes_without_thinking = true on the cognitive sweep. Its budget matrix shows a hard requirement though: below 2048 budget tokens, domain score drops to 0.0. It reasons exclusively through in-context CoT, and needs room to do it.
The 12.5s TTFT is painful. This is caused by the model architecture running over the HTTP path rather than the direct llama.cpp path — a deployment detail, not a model limitation. Worth investigating if you want this one in a fast loop.
phi4-beast is the only complete model with min_budget_75pct = 0 — it's the only model that reliably reasons well at zero thinking budget. DSL level 5, domain score 0.781, 552ms TTFT, 14.1 TPS. For tasks that need fast, light, structured reasoning without thinking overhead, this is the choice.
The weakness: negative_instruction = 0.25 — it often does what you told it not to do. For agent tasks where constraint compliance matters (e.g., "do not modify X"), phi4-beast needs explicit positive restatements, not negative instructions.
qwen3-5-mx (7 GB on the CPU broker) shows a domain modeling score of 0.849 — better than mistral-large, gemma4-31b, and phi4-beast on the main inference node. This is incomplete (mutation probe didn't finish), but the domain modeling quality suggests the quantized Qwen3.5 retains more reasoning capability than expected at low VRAM.
The TTFT of 15.5s is expected for CPU inference, and tokens_per_sec = 0.0 suggests the TPS measurement hit a timeout. Still, for async workloads where latency is irrelevant, small Qwen models on the broker are more capable than their VRAM budget suggests.
Here's how I use these profiles when selecting models for agent roles:
| Agent role | Key capability | Best candidates |
|---|---|---|
| Structured planner (initial HTN/DSL generation) | DSL level 5 + domain modeling | devstral-123b, deepseek-r1-32b |
| Model evolver (iterative agent reasoning) | Mutation phase 1–3 + regression | devstral-small-beast, devstral-123b |
| Fast responder (low-latency tool calls) | Low TTFT + passes_wo_thinking | phi4-beast, hermes-beast |
| Heavy reasoner (complex single-shot tasks) | High domain score + thinking budget | qwen3-5-beast (async), devstral-123b |
| Instruction follower (strict protocol compliance) | negative_instruction ≥ 0.75 | devstral-123b, gemma4-31b |
| Budget-aware tasks | min_budget_75pct low | phi4-beast (0t), devstral-123b (512t) |
Each model produces a TOML profile at projects/ipsa-agent/profiles/<name>.toml. The key sections:
# How the model handles structured markup
[xml_tags]
open_close_fidelity = 1.0 # Does it close all tags?
custom_tags = true # Does it emit user-defined tag names?
# Protocol-level format support
[prompt_format]
system_priority = "Respected" # Overrideable / Weak / Respected / Strict
conflict_resolution = "FollowsSystem" # What wins when instructions conflict?
# Thinking block behavior
[thinking]
emission = "TaggedPrompted" # None / TaggedNative / TaggedPrompted
toggle = "PromptControlled" # AlwaysOff / AlwaysOn / PromptControlled
# The core cognitive tests
[domain_modeling]
overall_score = 0.841
[domain_mutation]
phase1_score = 1.0
regression_score = 1.0
# How much thinking budget does it need?
[budget_matrix]
min_budget_75pct = 512
# Can it write valid structured DSL programs?
[dsl_compilation]
reasoning_level = 5
reasoning_quotient = 1.0
# What cognitive class is it?
[taxonomy]
cognitive_class = "System2" # System1 = pattern matching; System2 = deliberate reasoning
devstral-2-small, deepseek-r1-15b, hermes, phi4, qwen all pending. The qwen3-5-mx partial result is promising enough to prioritize.deepseek-r1-15b-legion, qwen-coder-legion, phi4-mini-reasoning all pendingdeepseek-r1-70b — got a stub only; 73 GB model on 96 GB node should work fine; scheduling nextton/caveman context compression transparently (model-side feature detection)report-probe-summary.sh — new script that aggregates all profiles and calls devstral-small for AI analysis; generates a combined document with per-host tables and AI commentaryThe profiles are the ground truth for model selection in the Ipsa-Agent framework. Every agent configuration that references a model should look up the relevant capability — system_priority, min_budget_75pct, dsl_compilation.reasoning_level — and use it to set request parameters correctly, rather than using guessed defaults.
All probes run against models serving through loch-nessh at http://ness-linux3:32100. Profile source: projects/ipsa-agent/profiles/. Probe harness: crates/ipsa-probe in ai-workbench.