title: "Cognitive Depth: Mapping What 27 Local LLMs Can Actually Do" slug: "cognitive-depth-capability-mapping-ipsa-probe" published: 2026-05-12 updated: 2026-05-12 tags: [local-ai, llm, ipsa-probe, capability-mapping, domain-modeling, agent-orchestration, devstral, deepseek, qwen3, gemma4, phi4] summary: "Running ipsa-probe's cognitive battery — domain decomposition, mutation stability, thinking budget, and DSL compilation — against 27 models across a three-node home cluster. The data behind which models earn which agent roles." menu_title: "Cognitive Depth Probes" series: "cortex-fleet" series_part: 2 draft: false

Cognitive Depth: Mapping What 27 Local LLMs Can Actually Do

Part 1 covered the first five protocol layers: XML fidelity, thinking block emission, structured output, instruction adherence, and sampling parameters. That pass established protocol fitness — does this model behave correctly when asked to follow a format?

This post is about something harder: cognitive fitness — what can the model actually reason through?

I added four new probe dimensions to the ipsa-probe harness and ran the full fleet again. This time across 27 registered models on three cluster nodes: a 96 GB AMD ROCm box (ness-linux3), an 8 GB CPU/Nvidia broker (mx-legacy), and a small Nvidia CUDA burst node (ness-legion1). Not all probes finished — a few models crashed, several haven't run yet, and one 122B model took down the pod it was running on. But there's enough signal to draw conclusions.

What the cognitive probes measure

Domain Modeling

The core test. The model is given a natural language description of a domain and asked to produce a structured hierarchical decomposition — entities, relationships, and sub-components at increasing levels of complexity.

Three levels of task:

Level 1 — a straightforward domain (e.g., a library system). Score: coverage × relationships × decomposition × economy
Level 2 — a medium-complexity domain with cross-cutting concerns. Tests whether the model can handle competing abstractions
Level 3 — a rich domain with recursive structure and emergent relationships

The overall_score is a weighted average across levels attempted. max_level_75pct is the highest level where the model achieved ≥75% on all subscores.

# Example profile excerpt — domain modeling section
[domain_modeling]
overall_score = 0.9188973903656006   # qwen3-5-beast
max_level_75pct = 3

[[domain_modeling.level_scores]]
level = 1
coverage = 0.833      # did it find all the key entities?
relationships = 1.0   # did it correctly link them?
decomposition = 1.0   # did it break compound entities further?
economy = 1.0         # did it avoid redundancy and bloat?
weighted_score = 0.958
elapsed_ms = 287729

Domain Mutation

A model that can decompose a domain is useful. A model that can maintain and evolve that decomposition under successive updates is what agents actually need.

Three phases:

Phase 1 — given the existing model, add a new component (tests: does it integrate cleanly?)
Phase 2 — refactor the model (tests: does it maintain all original relationships?)
Phase 3 — extend the model with a cross-cutting concern (tests: can it reason about impact?)

The regression score measures whether the final model preserved the original structure. High mutation scores with low regression = the model keeps rewriting from scratch rather than patching in place.

[domain_mutation]
phase1_score = 1.0             # devstral-small-beast: perfect add
phase2_score = 1.0             # perfect refactor
phase3_score = 0.9285714       # near-perfect extension
churn_rate = 0.0               # no spurious changes between phases
regression_score = 1.0         # original structure fully preserved
high_cohesion = false
elapsed_ms = 64567

Thinking Budget Matrix

This probe answers: "how much thinking budget does this model need to reliably solve reasoning tasks?" It sweeps four budget levels (0, 512, 2048, 8192 tokens) with 3 runs each, and records the pass rate and stddev.

The min_budget_75pct field is the minimum budget where the average score crosses 0.75. Models with passes_without_thinking = true have a thinking toggle but pass even without it.

This matters a lot for production: over-budgeting a model wastes VRAM and latency; under-budgeting a model that needs reasoning tokens causes silent degradation.

[budget_matrix]
min_budget_75pct = 512           # devstral-123b: sweet spot at 512t

[[budget_matrix.entries]]
budget_tokens = 0
avg_score = 0.50    # fails half the time without thinking
[[budget_matrix.entries]]
budget_tokens = 512
avg_score = 0.833   # jumps to 83% at just 512 tokens
[[budget_matrix.entries]]
budget_tokens = 2048
avg_score = 0.50    # dips back! over-thinking hurts this model

DSL Compilation

This probe tests the model's ability to produce a valid Prolog-style HTN (Hierarchical Task Network) program from a natural language goal description — the same output format used by the atomus agent in ipsa-agent.

The score is a reasoning_level from 0 to 5:

0 — cannot produce valid DSL at all
1 — produces a syntactically valid file but misses most semantics
2 — correct basic structure; misses cycle guards and recursion safety
3 — correct structure; partial guard coverage
4 — correct structure + guards; minor recursion issues
5 — fully valid DSL: cycle-safe recursion, complete guard coverage, correct goal/method links

[dsl_compilation]
reasoning_level = 5          # full capability
reasoning_quotient = 1.0     # 100% of tasks solved correctly
cycle_safe_recursion = true  # no infinite loops in generated programs
guarded_transitions = true   # all state transitions have preconditions

Models at level ≤2 should not be used as atomus backends without human review of every output.

The fleet and test setup

┌─────────────────────────────────────────────────────────────────────┐
│                     ipsa-probe pipeline                             │
│                                                                     │
│  probe-model binary                                                 │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐            │
│  │ Protocol     │   │ Cognitive    │   │ Domain       │            │
│  │ Probes       │──▶│ Probes       │──▶│ Modeling     │──▶ TOML    │
│  │ (Part 1)     │   │ (thinking,   │   │ & Mutation   │   profile  │
│  └──────────────┘   │ CoT, budget) │   │ Probes       │            │
│                     └──────────────┘   └──────────────┘            │
│  All traffic via loch-nessh (VRAM-aware broker)                     │
└─────────────────────────────────────────────────────────────────────┘

All requests flow through loch-nessh, which handles VRAM accounting, GPU locking, and claim lifecycle across the three nodes. The probe binary submits each task as a non-streaming claim, waits for the response, and scores it against the rubric.

Nodes:

Node	GPU	Total VRAM	Role
`ness-linux3`	AMD Ryzen AI-Max 395+ (ROCm)	96 GB	Primary inference
`mx-legacy`	Nvidia Tesla P4	8 GB	CPU broker, small models
`ness-legion1`	Nvidia CUDA	8 GB	Burst inference, small models

Results

ness-linux3 — Primary Inference (96 GB)

The big iron. Runs everything from 7B smalls to 123B behemoths.

Model	VRAM	Domain ↓	P1/P2/P3	Reg	Budget	DSL	Thinking	TTFT	TPS
qwen3-5-beast	10 GB	0.92	0.75/0.75/0.75	0.50	8192t	4	None	7040ms	9.2
devstral-small-beast	44 GB	0.86	1.0/1.0/0.93	1.0	—	2	Prompted	643ms	2.0
deepseek-r1-32b	78 GB	0.86	0.5/0.5/0.5	1.0	8192t	5	Native	640ms	9.0
devstral-123b	90 GB	0.84	1.0/0.92/0.86	1.0	512t	5	Prompted	1323ms	0.7
qwen3-5-122b	90 GB	0.83	0.5/0.5/0.68	0.50	2048t	5	None	12148ms	—
gemma4-31b	68 GB	0.80	0.75/0.67/0.61	0.50	2048t	5	None	12460ms	1.6
phi4-beast	16 GB	0.78	0.75/0.67/0.68	0.50	0t	5	Prompted	552ms	14.1
mistral-large	90 GB	0.77	0.75/0.67/0.61	0.50	2048t	5	Prompted	1337ms	0.8
qwen3-coder	44 GB	0.91	1.0/1.0/0.93	1.0	—	2	None	832ms	14.1
phi4-mini-reasoning-beast	7 GB	0.73	0.0/0.0/0.54	0.50	2048t	2	Native	492ms	46.4
phi4-mini-instruct-beast	7 GB	0.60	0.25/0.25/0.25	0.50	—	2	Prompted	477ms	16.5
hermes-beast	30 GB	0.41	0.5/0.5/0.5	1.0	—	2	Prompted	488ms	9.2
deepseek-r1-70b	73 GB	—	—	—	—	—	—	—	—
qwen3-6-beast	44 GB	—	—	—	—	—	—	—	—

— = probe not yet complete. qwen3-5-122b crashed mid-probe (pod OOM); re-run in progress.

mx-legacy — CPU Broker (8 GB)

Small GGUF models running CPU or a P4. Throughput is measured but constrained.

Model	VRAM	Domain	P1/P2/P3	Budget	DSL	TTFT	TPS
qwen3-5-mx	7 GB	0.85*	—/—/—	8192t	1	15467ms	—
phi4-mini-instruct-mx	5 GB	0.41*	—/—/—	—	2	466ms	9.8
devstral-2-small-mx	8 GB	—	—	—	—	—	—
deepseek-r1-15b-mx	7 GB	—	—	—	—	—	—
phi4-mini-reasoning-mx	5 GB	—	—	—	—	—	—
hermes-mx	—	—	—	—	—	—	—
phi4-mx	—	—	—	—	—	—	—
qwen-mx	—	—	—	—	—	—	—

* = partial probe: domain_modeling done, domain_mutation did not finish.

ness-legion1 — Burst Node (8 GB)

Model	VRAM	Domain	P1/P2/P3	Budget	DSL	TTFT	TPS
phi4-mini-instruct-legion	5 GB	0.47*	—/—/—	—	2	465ms	19.0
deepseek-r1-15b-legion	7 GB	—	—	—	—	—	—
qwen-coder-legion	6 GB	—	—	—	—	—	—
phi4-mini-reasoning-legion	5 GB	—	—	—	—	—	—
qwen3-5-legion	7 GB	—	—	—	—	—	—

Analysis

qwen3-coder: the hidden gem — top-tier quality at speed

qwen3-coder (44 GB, 262K context) is the biggest surprise in the full fleet. Domain score 0.907 — second only to qwen3-5-beast. Mutation scores: 1.0/1.0/0.929, regression 1.0 — matching devstral-small-beast exactly. TTFT 832ms, 14.1 TPS. No thinking emission, passes_without_thinking = true. And it uses no minimum thinking budget — the budget matrix shows 0.25 at all levels, which is unusual: this model doesn't improve its reasoning quality by giving it more thinking tokens at all.

The combination of near-perfect domain modeling AND perfect mutation stability AND reasonable speed is unique in this fleet. devstral-small-beast has the same mutation score but weaker domain (0.860 vs 0.907). The only weakness: DSL level 2 (same as devstral-small-beast) — not suitable for complex atomus HTN generation. system_priority = "Weak" and negative_instruction = 0.5 mean system prompt authority is loose.

Best use case: iterative domain modeling and agent refinement loops where throughput > DSL capability.

phi4-mini-reasoning: the speed champion

phi4-mini-reasoning-beast (7 GB) posts 46.4 TPS — nearly 3× faster than phi4-beast (14.1), the next fastest. Domain score 0.730, which for a 7B reasoning model is respectable. Native thinking emission (TaggedNative). Mutation phases 1 and 2 are 0.0 (it cannot add to or refactor an existing model without losing the original), phase 3 improves to 0.54. Needs 2048 minimum budget tokens.

This model's role is clear: anything that needs volume at minimum latency. Stream-classification, fast summarization, high-frequency signal processing where losing some quality is acceptable. Don't use it for anything requiring multi-phase consistency.

phi4-mini-instruct: the weakest complete profile

phi4-mini-instruct-beast (7 GB) is the weakest complete model: domain 0.596, mutation uniformly 0.25, DSL level 2. Faster than most (16.5 TPS, 477ms TTFT) but there are better options at similar speed (phi4-mini-reasoning, hermes, even qwen3-5-beast). Reserve it for the absolute simplest formatting-only tasks.

The top of the hierarchy is not who I expected

qwen3-5-beast has the highest domain modeling score at 0.919 — above devstral-small and devstral-123b. It's also only 10 GB VRAM. The catch: it's not a thinking model (emission = None), yet it somehow delivers top-tier decomposition quality. The price you pay is 8192 thinking budget tokens minimum (the budget matrix probe, not the thinking sweep) and a slow TTFT of 7 seconds. For async agentic tasks where latency doesn't matter, this is a compelling choice.

What Qwen's budget matrix reveals is subtle: at 0/512/2048 token budgets, average score = 0.0. At 8192, it jumps to 1.0 with zero stddev. There's a hard cliff — this model is binary, not gradual. You need to give it the full budget or don't bother.

Devstral-small is still the mutation champion

Despite qwen3-5-beast winning on domain score, devstral-small-beast wins on what matters more for long-running agents: mutation stability. Phases 1 and 2 are perfect (1.0/1.0), regression is perfect (1.0), and churn rate is 0.0. When you give this model a model and ask it to evolve it over multiple turns, it does so without losing history.

The high_cohesion = false flag is interesting — it means the model doesn't enforce semantic grouping constraints. For most agent tasks, this doesn't matter.

The DSL level of 2 (vs. 5 for most larger models) is a real limitation if you're using it as an atomus backend for complex decomposition tasks. Use it for mutation work; use devstral-123b or deepseek-r1-32b for initial decomposition.

Qwen3-5-122B: size doesn't buy what you expect

qwen3-5-122b (90 GB, ness-linux3) finished its probe after an earlier crash. Domain modeling score: 0.827 — solid, but below the 10 GB qwen3-5-beast (0.919) that runs on the same node. The 122B model's decomposition score at Level 1 is only 0.33 (it over-collapsed the hierarchy), while its L3 score of 0.897 is excellent. It gets harder problems right and easier ones wrong — a sign of a model that thinks in complex abstractions by default.

Mutation scores are mediocre (0.5/0.5/0.68, regression=0.5) — same weakness as deepseek-r1-32b. It's not a good model for iterative refinement loops. Where it distinguishes itself: negative_instruction = 1.0 and conflict_resolution = "FollowsSystem" — the best instruction compliance in the fleet alongside devstral-123b and gemma4-31b. It also has DSL level 5 and min_budget_75pct = 2048 (hard cliff, same as gemma4).

The real problem: TTFT of 12.1s and tokens_per_sec = 0.0. At 90 GB with no quantisation headroom on a 96 GB node, every token is slow. Use this for async, high-stakes single-shot tasks where answer quality trumps latency. Don't put it in a fast loop.

Devstral-123b: the balance point for heavy agentic work

devstral-123b is the only model that scores well on both domain modeling (0.841) and mutation (phase1=1.0, regression=1.0), while also having DSL level 5. Its budget sweet spot is just 512 tokens — it's the most reasoning-efficient heavy model in the fleet. The phase 2 mutation dip at budget=2048 (0.5) vs. 512 (0.83) is a genuine quirk worth remembering: don't over-think it.

At 90 GB VRAM and 0.7 TPS, it's not fast. But for structured planning, complex tool orchestration, and domain modeling tasks where quality matters more than latency, this is currently the recommended choice on ness-linux3.

DeepSeek R1-32B: best raw reasoning, worst mutation stability

deepseek-r1-32b combines a domain score of 0.856 with DSL level 5 (full Prolog capability) and native thinking emission (AlwaysOn). It's also fast at 640ms TTFT and 9.0 TPS. It needs 8192 thinking tokens to reach full potential, but at this speed that's manageable.

The problem: mutation phases are all 0.5. This model rewrites the world. It's great at initial decompositions but poor at evolving an existing model in place. Use it for the first pass; don't use it for iterative refinement loops.

Also notable: conflict_resolution = "Unpredictable" — this model sometimes ignores the system prompt when it conflicts with its training priors. In practice this means it needs clean, forceful system prompts without ambiguity.

Gemma4-31b: the quiet overperformer

gemma4-31b has no thinking emission at all (AlwaysOff) and yet scores DSL level 5, domain score 0.798, and passes passes_without_thinking = true on the cognitive sweep. Its budget matrix shows a hard requirement though: below 2048 budget tokens, domain score drops to 0.0. It reasons exclusively through in-context CoT, and needs room to do it.

The 12.5s TTFT is painful. This is caused by the model architecture running over the HTTP path rather than the direct llama.cpp path — a deployment detail, not a model limitation. Worth investigating if you want this one in a fast loop.

Phi4-beast: the speed/reasoning outlier

phi4-beast is the only complete model with min_budget_75pct = 0 — it's the only model that reliably reasons well at zero thinking budget. DSL level 5, domain score 0.781, 552ms TTFT, 14.1 TPS. For tasks that need fast, light, structured reasoning without thinking overhead, this is the choice.

The weakness: negative_instruction = 0.25 — it often does what you told it not to do. For agent tasks where constraint compliance matters (e.g., "do not modify X"), phi4-beast needs explicit positive restatements, not negative instructions.

The mx-legacy surprise: Qwen3-5 punches above its weight

qwen3-5-mx (7 GB on the CPU broker) shows a domain modeling score of 0.849 — better than mistral-large, gemma4-31b, and phi4-beast on the main inference node. This is incomplete (mutation probe didn't finish), but the domain modeling quality suggests the quantized Qwen3.5 retains more reasoning capability than expected at low VRAM.

The TTFT of 15.5s is expected for CPU inference, and tokens_per_sec = 0.0 suggests the TPS measurement hit a timeout. Still, for async workloads where latency is irrelevant, small Qwen models on the broker are more capable than their VRAM budget suggests.

What capability scores mean in practice

Here's how I use these profiles when selecting models for agent roles:

Agent role	Key capability	Best candidates
Structured planner (initial HTN/DSL generation)	DSL level 5 + domain modeling	devstral-123b, deepseek-r1-32b
Model evolver (iterative agent reasoning)	Mutation phase 1–3 + regression	devstral-small-beast, devstral-123b
Fast responder (low-latency tool calls)	Low TTFT + passes_wo_thinking	phi4-beast, hermes-beast
Heavy reasoner (complex single-shot tasks)	High domain score + thinking budget	qwen3-5-beast (async), devstral-123b
Instruction follower (strict protocol compliance)	negative_instruction ≥ 0.75	devstral-123b, gemma4-31b
Budget-aware tasks	min_budget_75pct low	phi4-beast (0t), devstral-123b (512t)

Profile format reference

Each model produces a TOML profile at projects/ipsa-agent/profiles/<name>.toml. The key sections:

# How the model handles structured markup
[xml_tags]
open_close_fidelity = 1.0    # Does it close all tags?
custom_tags = true           # Does it emit user-defined tag names?

# Protocol-level format support
[prompt_format]
system_priority = "Respected"  # Overrideable / Weak / Respected / Strict
conflict_resolution = "FollowsSystem"  # What wins when instructions conflict?

# Thinking block behavior
[thinking]
emission = "TaggedPrompted"  # None / TaggedNative / TaggedPrompted
toggle = "PromptControlled"  # AlwaysOff / AlwaysOn / PromptControlled

# The core cognitive tests
[domain_modeling]
overall_score = 0.841

[domain_mutation]
phase1_score = 1.0
regression_score = 1.0

# How much thinking budget does it need?
[budget_matrix]
min_budget_75pct = 512

# Can it write valid structured DSL programs?
[dsl_compilation]
reasoning_level = 5
reasoning_quotient = 1.0

# What cognitive class is it?
[taxonomy]
cognitive_class = "System2"   # System1 = pattern matching; System2 = deliberate reasoning

What's next

qwen3-5-122b — ✓ probe complete (completed 2026-05-12 16:22). Results above.
Complete mx-legacy probes — devstral-2-small, deepseek-r1-15b, hermes, phi4, qwen all pending. The qwen3-5-mx partial result is promising enough to prioritize.
Complete ness-legion1 probes — deepseek-r1-15b-legion, qwen-coder-legion, phi4-mini-reasoning all pending
deepseek-r1-70b — got a stub only; 73 GB model on 96 GB node should work fine; scheduling next
Syntactic compression libraries — planning to test which models support ton/caveman context compression transparently (model-side feature detection)
report-probe-summary.sh — new script that aggregates all profiles and calls devstral-small for AI analysis; generates a combined document with per-host tables and AI commentary

The profiles are the ground truth for model selection in the Ipsa-Agent framework. Every agent configuration that references a model should look up the relevant capability — system_priority, min_budget_75pct, dsl_compilation.reasoning_level — and use it to set request parameters correctly, rather than using guessed defaults.

All probes run against models serving through loch-nessh at http://ness-linux3:32100. Profile source: projects/ipsa-agent/profiles/. Probe harness: crates/ipsa-probe in ai-workbench.