my-server
← Back to Blog

title: "Cognitive Depth: Mapping What 27 Local LLMs Can Actually Do" slug: "cognitive-depth-capability-mapping-ipsa-probe" published: 2026-05-12 updated: 2026-05-12 tags: [local-ai, llm, ipsa-probe, capability-mapping, domain-modeling, agent-orchestration, devstral, deepseek, qwen3, gemma4, phi4] summary: "Running ipsa-probe's cognitive battery — domain decomposition, mutation stability, thinking budget, and DSL compilation — against 27 models across a three-node home cluster. The data behind which models earn which agent roles." menu_title: "Cognitive Depth Probes" series: "cortex-fleet" series_part: 2 draft: false

Cognitive Depth: Mapping What 27 Local LLMs Can Actually Do

Part 1 covered the first five protocol layers: XML fidelity, thinking block emission, structured output, instruction adherence, and sampling parameters. That pass established protocol fitness — does this model behave correctly when asked to follow a format?

This post is about something harder: cognitive fitness — what can the model actually reason through?

I added four new probe dimensions to the ipsa-probe harness and ran the full fleet again. This time across 27 registered models on three cluster nodes: a 96 GB AMD ROCm box (ness-linux3), an 8 GB CPU/Nvidia broker (mx-legacy), and a small Nvidia CUDA burst node (ness-legion1). Not all probes finished — a few models crashed, several haven't run yet, and one 122B model took down the pod it was running on. But there's enough signal to draw conclusions.


What the cognitive probes measure

Domain Modeling

The core test. The model is given a natural language description of a domain and asked to produce a structured hierarchical decomposition — entities, relationships, and sub-components at increasing levels of complexity.

Three levels of task:

The overall_score is a weighted average across levels attempted. max_level_75pct is the highest level where the model achieved ≥75% on all subscores.

# Example profile excerpt — domain modeling section
[domain_modeling]
overall_score = 0.9188973903656006   # qwen3-5-beast
max_level_75pct = 3

[[domain_modeling.level_scores]]
level = 1
coverage = 0.833      # did it find all the key entities?
relationships = 1.0   # did it correctly link them?
decomposition = 1.0   # did it break compound entities further?
economy = 1.0         # did it avoid redundancy and bloat?
weighted_score = 0.958
elapsed_ms = 287729

Domain Mutation

A model that can decompose a domain is useful. A model that can maintain and evolve that decomposition under successive updates is what agents actually need.

Three phases:

The regression score measures whether the final model preserved the original structure. High mutation scores with low regression = the model keeps rewriting from scratch rather than patching in place.

[domain_mutation]
phase1_score = 1.0             # devstral-small-beast: perfect add
phase2_score = 1.0             # perfect refactor
phase3_score = 0.9285714       # near-perfect extension
churn_rate = 0.0               # no spurious changes between phases
regression_score = 1.0         # original structure fully preserved
high_cohesion = false
elapsed_ms = 64567

Thinking Budget Matrix

This probe answers: "how much thinking budget does this model need to reliably solve reasoning tasks?" It sweeps four budget levels (0, 512, 2048, 8192 tokens) with 3 runs each, and records the pass rate and stddev.

The min_budget_75pct field is the minimum budget where the average score crosses 0.75. Models with passes_without_thinking = true have a thinking toggle but pass even without it.

This matters a lot for production: over-budgeting a model wastes VRAM and latency; under-budgeting a model that needs reasoning tokens causes silent degradation.

[budget_matrix]
min_budget_75pct = 512           # devstral-123b: sweet spot at 512t

[[budget_matrix.entries]]
budget_tokens = 0
avg_score = 0.50    # fails half the time without thinking
[[budget_matrix.entries]]
budget_tokens = 512
avg_score = 0.833   # jumps to 83% at just 512 tokens
[[budget_matrix.entries]]
budget_tokens = 2048
avg_score = 0.50    # dips back! over-thinking hurts this model

DSL Compilation

This probe tests the model's ability to produce a valid Prolog-style HTN (Hierarchical Task Network) program from a natural language goal description — the same output format used by the atomus agent in ipsa-agent.

The score is a reasoning_level from 0 to 5:

[dsl_compilation]
reasoning_level = 5          # full capability
reasoning_quotient = 1.0     # 100% of tasks solved correctly
cycle_safe_recursion = true  # no infinite loops in generated programs
guarded_transitions = true   # all state transitions have preconditions

Models at level ≤2 should not be used as atomus backends without human review of every output.


The fleet and test setup

┌─────────────────────────────────────────────────────────────────────┐
│                     ipsa-probe pipeline                             │
│                                                                     │
│  probe-model binary                                                 │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐            │
│  │ Protocol     │   │ Cognitive    │   │ Domain       │            │
│  │ Probes       │──▶│ Probes       │──▶│ Modeling     │──▶ TOML    │
│  │ (Part 1)     │   │ (thinking,   │   │ & Mutation   │   profile  │
│  └──────────────┘   │ CoT, budget) │   │ Probes       │            │
│                     └──────────────┘   └──────────────┘            │
│  All traffic via loch-nessh (VRAM-aware broker)                     │
└─────────────────────────────────────────────────────────────────────┘

All requests flow through loch-nessh, which handles VRAM accounting, GPU locking, and claim lifecycle across the three nodes. The probe binary submits each task as a non-streaming claim, waits for the response, and scores it against the rubric.

Nodes:

NodeGPUTotal VRAMRole
ness-linux3AMD Ryzen AI-Max 395+ (ROCm)96 GBPrimary inference
mx-legacyNvidia Tesla P48 GBCPU broker, small models
ness-legion1Nvidia CUDA8 GBBurst inference, small models

Results

ness-linux3 — Primary Inference (96 GB)

The big iron. Runs everything from 7B smalls to 123B behemoths.

ModelVRAMDomain ↓P1/P2/P3RegBudgetDSLThinkingTTFTTPS
qwen3-5-beast10 GB0.920.75/0.75/0.750.508192t4None7040ms9.2
devstral-small-beast44 GB0.861.0/1.0/0.931.02Prompted643ms2.0
deepseek-r1-32b78 GB0.860.5/0.5/0.51.08192t5Native640ms9.0
devstral-123b90 GB0.841.0/0.92/0.861.0512t5Prompted1323ms0.7
qwen3-5-122b90 GB0.830.5/0.5/0.680.502048t5None12148ms
gemma4-31b68 GB0.800.75/0.67/0.610.502048t5None12460ms1.6
phi4-beast16 GB0.780.75/0.67/0.680.500t5Prompted552ms14.1
mistral-large90 GB0.770.75/0.67/0.610.502048t5Prompted1337ms0.8
qwen3-coder44 GB0.911.0/1.0/0.931.02None832ms14.1
phi4-mini-reasoning-beast7 GB0.730.0/0.0/0.540.502048t2Native492ms46.4
phi4-mini-instruct-beast7 GB0.600.25/0.25/0.250.502Prompted477ms16.5
hermes-beast30 GB0.410.5/0.5/0.51.02Prompted488ms9.2
deepseek-r1-70b73 GB
qwen3-6-beast44 GB

= probe not yet complete. qwen3-5-122b crashed mid-probe (pod OOM); re-run in progress.

mx-legacy — CPU Broker (8 GB)

Small GGUF models running CPU or a P4. Throughput is measured but constrained.

ModelVRAMDomainP1/P2/P3BudgetDSLTTFTTPS
qwen3-5-mx7 GB0.85*—/—/—8192t115467ms
phi4-mini-instruct-mx5 GB0.41*—/—/—2466ms9.8
devstral-2-small-mx8 GB
deepseek-r1-15b-mx7 GB
phi4-mini-reasoning-mx5 GB
hermes-mx
phi4-mx
qwen-mx

* = partial probe: domain_modeling done, domain_mutation did not finish.

ness-legion1 — Burst Node (8 GB)

ModelVRAMDomainP1/P2/P3BudgetDSLTTFTTPS
phi4-mini-instruct-legion5 GB0.47*—/—/—2465ms19.0
deepseek-r1-15b-legion7 GB
qwen-coder-legion6 GB
phi4-mini-reasoning-legion5 GB
qwen3-5-legion7 GB

Analysis

qwen3-coder: the hidden gem — top-tier quality at speed

qwen3-coder (44 GB, 262K context) is the biggest surprise in the full fleet. Domain score 0.907 — second only to qwen3-5-beast. Mutation scores: 1.0/1.0/0.929, regression 1.0 — matching devstral-small-beast exactly. TTFT 832ms, 14.1 TPS. No thinking emission, passes_without_thinking = true. And it uses no minimum thinking budget — the budget matrix shows 0.25 at all levels, which is unusual: this model doesn't improve its reasoning quality by giving it more thinking tokens at all.

The combination of near-perfect domain modeling AND perfect mutation stability AND reasonable speed is unique in this fleet. devstral-small-beast has the same mutation score but weaker domain (0.860 vs 0.907). The only weakness: DSL level 2 (same as devstral-small-beast) — not suitable for complex atomus HTN generation. system_priority = "Weak" and negative_instruction = 0.5 mean system prompt authority is loose.

Best use case: iterative domain modeling and agent refinement loops where throughput > DSL capability.

phi4-mini-reasoning: the speed champion

phi4-mini-reasoning-beast (7 GB) posts 46.4 TPS — nearly 3× faster than phi4-beast (14.1), the next fastest. Domain score 0.730, which for a 7B reasoning model is respectable. Native thinking emission (TaggedNative). Mutation phases 1 and 2 are 0.0 (it cannot add to or refactor an existing model without losing the original), phase 3 improves to 0.54. Needs 2048 minimum budget tokens.

This model's role is clear: anything that needs volume at minimum latency. Stream-classification, fast summarization, high-frequency signal processing where losing some quality is acceptable. Don't use it for anything requiring multi-phase consistency.

phi4-mini-instruct: the weakest complete profile

phi4-mini-instruct-beast (7 GB) is the weakest complete model: domain 0.596, mutation uniformly 0.25, DSL level 2. Faster than most (16.5 TPS, 477ms TTFT) but there are better options at similar speed (phi4-mini-reasoning, hermes, even qwen3-5-beast). Reserve it for the absolute simplest formatting-only tasks.

The top of the hierarchy is not who I expected

qwen3-5-beast has the highest domain modeling score at 0.919 — above devstral-small and devstral-123b. It's also only 10 GB VRAM. The catch: it's not a thinking model (emission = None), yet it somehow delivers top-tier decomposition quality. The price you pay is 8192 thinking budget tokens minimum (the budget matrix probe, not the thinking sweep) and a slow TTFT of 7 seconds. For async agentic tasks where latency doesn't matter, this is a compelling choice.

What Qwen's budget matrix reveals is subtle: at 0/512/2048 token budgets, average score = 0.0. At 8192, it jumps to 1.0 with zero stddev. There's a hard cliff — this model is binary, not gradual. You need to give it the full budget or don't bother.

Devstral-small is still the mutation champion

Despite qwen3-5-beast winning on domain score, devstral-small-beast wins on what matters more for long-running agents: mutation stability. Phases 1 and 2 are perfect (1.0/1.0), regression is perfect (1.0), and churn rate is 0.0. When you give this model a model and ask it to evolve it over multiple turns, it does so without losing history.

The high_cohesion = false flag is interesting — it means the model doesn't enforce semantic grouping constraints. For most agent tasks, this doesn't matter.

The DSL level of 2 (vs. 5 for most larger models) is a real limitation if you're using it as an atomus backend for complex decomposition tasks. Use it for mutation work; use devstral-123b or deepseek-r1-32b for initial decomposition.

Qwen3-5-122B: size doesn't buy what you expect

qwen3-5-122b (90 GB, ness-linux3) finished its probe after an earlier crash. Domain modeling score: 0.827 — solid, but below the 10 GB qwen3-5-beast (0.919) that runs on the same node. The 122B model's decomposition score at Level 1 is only 0.33 (it over-collapsed the hierarchy), while its L3 score of 0.897 is excellent. It gets harder problems right and easier ones wrong — a sign of a model that thinks in complex abstractions by default.

Mutation scores are mediocre (0.5/0.5/0.68, regression=0.5) — same weakness as deepseek-r1-32b. It's not a good model for iterative refinement loops. Where it distinguishes itself: negative_instruction = 1.0 and conflict_resolution = "FollowsSystem" — the best instruction compliance in the fleet alongside devstral-123b and gemma4-31b. It also has DSL level 5 and min_budget_75pct = 2048 (hard cliff, same as gemma4).

The real problem: TTFT of 12.1s and tokens_per_sec = 0.0. At 90 GB with no quantisation headroom on a 96 GB node, every token is slow. Use this for async, high-stakes single-shot tasks where answer quality trumps latency. Don't put it in a fast loop.

Devstral-123b: the balance point for heavy agentic work

devstral-123b is the only model that scores well on both domain modeling (0.841) and mutation (phase1=1.0, regression=1.0), while also having DSL level 5. Its budget sweet spot is just 512 tokens — it's the most reasoning-efficient heavy model in the fleet. The phase 2 mutation dip at budget=2048 (0.5) vs. 512 (0.83) is a genuine quirk worth remembering: don't over-think it.

At 90 GB VRAM and 0.7 TPS, it's not fast. But for structured planning, complex tool orchestration, and domain modeling tasks where quality matters more than latency, this is currently the recommended choice on ness-linux3.

DeepSeek R1-32B: best raw reasoning, worst mutation stability

deepseek-r1-32b combines a domain score of 0.856 with DSL level 5 (full Prolog capability) and native thinking emission (AlwaysOn). It's also fast at 640ms TTFT and 9.0 TPS. It needs 8192 thinking tokens to reach full potential, but at this speed that's manageable.

The problem: mutation phases are all 0.5. This model rewrites the world. It's great at initial decompositions but poor at evolving an existing model in place. Use it for the first pass; don't use it for iterative refinement loops.

Also notable: conflict_resolution = "Unpredictable" — this model sometimes ignores the system prompt when it conflicts with its training priors. In practice this means it needs clean, forceful system prompts without ambiguity.

Gemma4-31b: the quiet overperformer

gemma4-31b has no thinking emission at all (AlwaysOff) and yet scores DSL level 5, domain score 0.798, and passes passes_without_thinking = true on the cognitive sweep. Its budget matrix shows a hard requirement though: below 2048 budget tokens, domain score drops to 0.0. It reasons exclusively through in-context CoT, and needs room to do it.

The 12.5s TTFT is painful. This is caused by the model architecture running over the HTTP path rather than the direct llama.cpp path — a deployment detail, not a model limitation. Worth investigating if you want this one in a fast loop.

Phi4-beast: the speed/reasoning outlier

phi4-beast is the only complete model with min_budget_75pct = 0 — it's the only model that reliably reasons well at zero thinking budget. DSL level 5, domain score 0.781, 552ms TTFT, 14.1 TPS. For tasks that need fast, light, structured reasoning without thinking overhead, this is the choice.

The weakness: negative_instruction = 0.25 — it often does what you told it not to do. For agent tasks where constraint compliance matters (e.g., "do not modify X"), phi4-beast needs explicit positive restatements, not negative instructions.

The mx-legacy surprise: Qwen3-5 punches above its weight

qwen3-5-mx (7 GB on the CPU broker) shows a domain modeling score of 0.849 — better than mistral-large, gemma4-31b, and phi4-beast on the main inference node. This is incomplete (mutation probe didn't finish), but the domain modeling quality suggests the quantized Qwen3.5 retains more reasoning capability than expected at low VRAM.

The TTFT of 15.5s is expected for CPU inference, and tokens_per_sec = 0.0 suggests the TPS measurement hit a timeout. Still, for async workloads where latency is irrelevant, small Qwen models on the broker are more capable than their VRAM budget suggests.


What capability scores mean in practice

Here's how I use these profiles when selecting models for agent roles:

Agent roleKey capabilityBest candidates
Structured planner (initial HTN/DSL generation)DSL level 5 + domain modelingdevstral-123b, deepseek-r1-32b
Model evolver (iterative agent reasoning)Mutation phase 1–3 + regressiondevstral-small-beast, devstral-123b
Fast responder (low-latency tool calls)Low TTFT + passes_wo_thinkingphi4-beast, hermes-beast
Heavy reasoner (complex single-shot tasks)High domain score + thinking budgetqwen3-5-beast (async), devstral-123b
Instruction follower (strict protocol compliance)negative_instruction ≥ 0.75devstral-123b, gemma4-31b
Budget-aware tasksmin_budget_75pct lowphi4-beast (0t), devstral-123b (512t)

Profile format reference

Each model produces a TOML profile at projects/ipsa-agent/profiles/<name>.toml. The key sections:

# How the model handles structured markup
[xml_tags]
open_close_fidelity = 1.0    # Does it close all tags?
custom_tags = true           # Does it emit user-defined tag names?

# Protocol-level format support
[prompt_format]
system_priority = "Respected"  # Overrideable / Weak / Respected / Strict
conflict_resolution = "FollowsSystem"  # What wins when instructions conflict?

# Thinking block behavior
[thinking]
emission = "TaggedPrompted"  # None / TaggedNative / TaggedPrompted
toggle = "PromptControlled"  # AlwaysOff / AlwaysOn / PromptControlled

# The core cognitive tests
[domain_modeling]
overall_score = 0.841

[domain_mutation]
phase1_score = 1.0
regression_score = 1.0

# How much thinking budget does it need?
[budget_matrix]
min_budget_75pct = 512

# Can it write valid structured DSL programs?
[dsl_compilation]
reasoning_level = 5
reasoning_quotient = 1.0

# What cognitive class is it?
[taxonomy]
cognitive_class = "System2"   # System1 = pattern matching; System2 = deliberate reasoning

What's next

The profiles are the ground truth for model selection in the Ipsa-Agent framework. Every agent configuration that references a model should look up the relevant capability — system_priority, min_budget_75pct, dsl_compilation.reasoning_level — and use it to set request parameters correctly, rather than using guessed defaults.


All probes run against models serving through loch-nessh at http://ness-linux3:32100. Profile source: projects/ipsa-agent/profiles/. Probe harness: crates/ipsa-probe in ai-workbench.