title: "Scheduling GPU Workloads at Home: A Capacity Model That Scales" slug: "homelab-gpu-scheduling-capacity-model" tags: [local-ai, gpu-scheduling, capacity-planning, loch-nessh, kubernetes, cortex, architecture] summary: "A rigorous capacity model for home GPU clusters — one that works on a single node today and extends naturally when new hardware arrives. Built around three state vectors, six diagnostic metrics, and the actual implementation in loch-nessh." menu_title: "GPU Scheduling Whitepaper" draft: false

Scheduling GPU Workloads at Home: A Capacity Model That Scales

Most homelab GPU schedulers start from the wrong abstraction. They model the GPU as a binary resource — in use or not — and handle "capacity" by watching whether the pod crashes. This works until it doesn't: until you have three models loaded simultaneously, a video generation job queued behind a 72B inference run, and a phantom pod holding VRAM the scheduler thinks is free.

This document describes a different approach. One that treats GPU memory the same way a cloud scheduler treats CPU millicores — as a first-class resource with tracked allocation, observed usage, and computable headroom. It's what I built into loch-nessh for the Cortex cluster. And unlike most homelab one-offs, the math doesn't change when you add nodes.

The model: three vectors, six metrics

The foundation is dimension-agnostic. A "resource" is any quantity that can be measured, reserved, and consumed. For GPU scheduling, the primary dimension is VRAM — but the framework extends identically to RAM, CPU slots, and network bandwidth.

For any resource dimension, track three state vectors:

Vector	What it is	Where it lives in Cortex
Capacity	Absolute physical maximum	Configmap `total_vram_gb` per node
Allocation	Sum of reserved amounts	Valkey `registry:node:{node}:vram` (incremented on pod scale-up)
Usage	Actual real-time consumption	Not yet instrumented; approximated by Allocation

From these three vectors, six diagnostic metrics fall out of pure arithmetic:

Metric	Formula	What it tells you
Over-Subscription	Allocation / Capacity	>1.0 = you've promised more than you have
Utilization	Usage / Allocation	>1.0 = starvation/throttling; <0.3 = expensive waste
Saturation	Usage / Capacity	Physical stress; alert at >0.85
Headroom	Capacity − Usage	Raw buffer before failure
Potential	Allocation − Usage	Idle reserved resources
Risk	Potential / Headroom	>1.0 = burst allocations could exceed physical capacity

These are the same metrics AWS uses internally for Fargate capacity planning. The math is identical whether you're scheduling GPU pods on a home cluster or EC2 instances in us-east-1. The difference is scale, not structure.

How loch-nessh implements this today

The Cortex cluster has three scheduling-relevant nodes:

Node	GPU	VRAM	Role
ness-linux3	AMD Radeon 8060S (gfx1151)	96 GB	Primary inference — all LLMs and media models
ness-server2	Tesla P4	8 GB	Secondary — small models, CPU-capable fallback
ness-legion1	RTX 4060 Mobile	8 GB	Burst inference — CUDA models

The VRAM ledger

loch-nessh maintains a VRAM ledger in Valkey. When a model pod scales up, its vram_gb (from the configmap) is added to registry:node:{node}:vram. When the pod terminates cleanly, it's subtracted. The broker checks available VRAM before every scale-up:

if (current_allocation + model.vram_gb as i64) > node.total_vram_gb as i64 {
    return Err(BrokerError::InsufficientVram);
}

This is the Allocation vector. The Capacity vector is the total_vram_gb in the configmap. Usage is currently approximated as equal to Allocation — a conservative assumption that treats every loaded model as consuming its full budget.

Phantom Capacity: the edge case that matters

The most dangerous failure mode is the phantom: a pod that has crashed or been evicted but whose Valkey allocation entry was cleaned up before the GPU actually released the memory. If loch-nessh's VRAM registry shows 0 GB allocated but a zombie pod is still holding 30 GB, the next scale-up will appear to succeed — then fail at runtime when the kernel can't allocate.

loch-nessh handles this with the Phantom Capacity rule: VRAM is only freed in the ledger after the pod's deletionTimestamp clears, not when the scale-down command is issued. This is a conservative bias toward over-estimating Allocation rather than under-estimating it. A false "insufficient VRAM" is a retryable error. A false "sufficient VRAM" causes a harder failure.

The GPU lock

For single-GPU nodes, there's a second constraint below the VRAM level: the execution pipeline itself. Two models can be loaded simultaneously (their weights fit), but only one can be actively computing at a time.

loch-nessh implements this with a Valkey lock: SET NX lock:gpu:{node}:0 with a 60-second TTL and a 20-second heartbeat. The claim that wins the lock runs; others queue behind it. This is the binary layer the naive scheduler treats as the only layer. We treat it as the bottom layer, sitting below the continuous VRAM accounting.

Worked example: ness-linux3 at capacity

Current model fleet on ness-linux3 (with representative VRAM budgets):

devstral-small    54 GB
qwen3-6-beast     68 GB
hermes-beast      32 GB
flux2-dev         60 GB
wan22-video       38 GB
nomic-embed        1 GB

None of these can co-load freely — the sum (253 GB) far exceeds the 96 GB pool. What loch-nessh actually manages is a sliding window of loaded models, evicted by idle TTL:

Example active state: devstral-small + hermes-beast co-loaded

Capacity:   96 GB
Allocation: 54 + 32 = 86 GB
Usage:      ≈86 GB (approximated)
Headroom:   10 GB
Potential:  0 GB (all reserved VRAM is in use)
Risk:       0.0 / 10 = 0.0 (no burst risk; at natural ceiling)

Add wan22-video (38 GB) — refused:

86 + 38 = 124 GB > 96 GB
→ BrokerError::InsufficientVram
→ wan22-video enqueued, waits for eviction

After hermes-beast TTL expires and pod terminates:

Allocation: 54 + 38 = 92 GB → fits
Risk:       0 GB potential / 4 GB headroom = 0.0

The system self-regulates. The TTL is the eviction pressure valve. When the cluster is idle, models drain. When it's busy, loch-nessh queues requests behind the running workload and loads the next model as headroom opens.

Diagnostic alerts you can derive from the model

With Valkey as the state store and a simple scraper, these metrics become real-time dashboards:

Saturation alert (>0.85)

(current_vram_allocated / total_vram_gb) > 0.85

Action: lower model TTLs so eviction happens faster under load.

Risk alert (>1.0)

(vram_potential / vram_headroom) > 1.0

This condition means: if every loaded model suddenly ran at its maximum allocation simultaneously, the node would OOM. In practice this can't happen (the execution lock prevents it), but it's a signal that the Allocation estimates are optimistic relative to physical headroom.

Zombie detection

ledger_says_free AND any(pod.phase == Running AND pod.model in loaded_models)

This is the phantom capacity check. loch-nessh doesn't currently run a periodic reconciliation loop — that's a known gap. A background coroutine that compares Valkey state to k8s pod state and re-adds any orphaned VRAM allocations would close it.

Scaling to a multi-node fleet

The framework extends to N nodes without structural changes. Each node gets its own capacity vector. The broker becomes a fleet-level scheduler.

What changes:

The routing decision gains a second dimension. Instead of "does this model fit on the only node?", it becomes "which node is the best fit for this model right now?"

A minimal routing heuristic:

def best_node(model):
    candidates = [n for n in nodes if n.headroom >= model.vram_gb]
    if not candidates:
        return QUEUE  # wait for eviction
    # prefer the node where this model was last loaded (warm weights in memory)
    warm = [n for n in candidates if model in n.loaded_models]
    if warm:
        return min(warm, key=lambda n: n.saturation)
    # otherwise, pick the node with the most headroom
    return max(candidates, key=lambda n: n.headroom)

What stays the same:

The three vectors (Capacity, Allocation, Usage) per node
The six derived metrics
The Valkey ledger structure — it's already keyed by node: registry:node:{node}:vram
The GPU lock per node
The phantom capacity rule

Concrete: adding a fourth node

Say a second ness-linux3-class machine arrives — call it ness-linux4, same 96 GB unified VRAM. The only changes needed:

Add node entry to the configmap: total_vram_gb: 96
Add model entries with node: ness-linux4 for models you want running there
loch-nessh initialises the Valkey registry key on startup

The broker loop runs unchanged. It doesn't care how many nodes exist — it iterates the registry and finds the first (or best) fit.

The limits of Allocation as a proxy for Usage

The current implementation treats Allocation ≈ Usage. This is safe but imprecise.

The actual VRAM a model consumes depends on:

Weight footprint (known; stored in the configmap)
KV cache size (varies with context length and slot count)
Activation buffers (varies with batch size and sequence length)

A devstral-small at 256K context with q8_0 KV and a single parallel slot uses ~46 GB. The same model at 32K context uses ~26 GB. The configmap vram_gb: 54 covers the worst case. The scheduler is always safe, but it sometimes refuses requests that would physically fit.

Closing this gap requires per-claim VRAM telemetry — reading actual usage from the GPU after each inference run and feeding it back into a rolling estimate. On AMD with ROCm, that's hipMemGetInfo(). On Vulkan, it requires VK_EXT_memory_budget (supported on RADV). Neither is implemented yet. When it is, the Utilization metric becomes real rather than assumed.

The model in one page

For each node N in the fleet:

  Capacity[N]    = configmap.total_vram_gb[N]
  Allocation[N]  = Σ vram_gb for all loaded models on N    (Valkey)
  Usage[N]       = Σ actual VRAM consumed on N              (GPU telemetry; currently ≈ Allocation)

  Headroom[N]    = Capacity[N] - Usage[N]
  Saturation[N]  = Usage[N] / Capacity[N]

On every model load request:
  1. Check Allocation[N] + model.vram_gb ≤ Capacity[N]  → refuse if false
  2. Acquire GPU lock[N]                                 → queue if held
  3. Execute claim
  4. Release GPU lock[N]

On model eviction:
  1. Scale pod to 0
  2. Wait for deletionTimestamp to clear                 → Phantom Capacity rule
  3. Subtract model.vram_gb from Allocation[N]

Alerts:
  Saturation[N]  > 0.85  → reduce TTL on N
  Risk[N]        > 1.0   → re-evaluate Allocation estimates
  Usage[N] ≠ Allocation[N] (when telemetry is live) → trigger right-sizing review

This is the whole thing. It fits in a page. It runs today on a cluster of three nodes. And when node four arrives, it runs there too.