I'd had the Wan2.2 model loaded on ness-linux3 for a few days. The deployment manifest was in git, the weights were on disk, loch-nessh would load the pod on demand. Everything looked right. But every video generation request came back the same way: a tiny WebM file, correct dimensions, VP8 codec — and exactly one frame.
This is the story of finding out why, and what it took to get past it.
/tmp/fox-10s.webm — 65 KB, 512×512, 16 fps, 1 frame. Duration: 0.001 seconds.
$ ffprobe -v quiet -select_streams v:0 -count_frames \
-show_entries stream=nb_read_frames,r_frame_rate,width,height \
-of default=noprint_wrappers=1 /tmp/fox-10s.webm
width=512
height=512
r_frame_rate=16/1
nb_read_frames=1
The filename said 10 seconds. The file said one frame. Something was wrong before the model even started generating.
loch-nessh translates OpenAI-format image/video requests into the sdcpp vid_gen HTTP API. When I looked at what it was actually sending, I found this in executor.rs:
let vid_body = serde_json::json!({
"prompt": req.get("prompt")...,
"sample_params": {
"width": width,
"height": height,
"video_frames": req.get("video_frames")...unwrap_or(33),
"fps": ...,
"sample_steps": ...,
"cfg_scale": ...,
},
"output_format": "webm",
});
The parameters were wrapped inside a sample_params object. I tested a job directly against the sdcpp pod to find out what it actually expected:
# nested — always 1 frame
curl .../sdcpp/v1/vid_gen -d '{"prompt":"test","sample_params":{"video_frames":5,...}}'
# flat — correct frame count
curl .../sdcpp/v1/vid_gen -d '{"prompt":"test","video_frames":5,"width":256,"height":256,...}'
Five frames with flat params, one frame with nested params. The sdcpp server accepted the nested body without error — it just silently fell back to its default of one frame when it didn't find video_frames at the top level. The fix was straightforward: unwrap the object, hoist everything to the root.
I fixed it, rebuilt, pushed, redeployed. Tested a small case. Worked: 5 frames, 256×256, confirmed with ffprobe. Great.
Before going further, here's the fox that started this whole investigation. This was generated with FLUX.2-dev earlier in the session — a still image that became the conceptual target for the video:
With the API shape fixed, I tried to generate an actual 10-second video. 160 frames at 16 fps, 832×480 — the native Wan2.2 resolution. The pod received the request, logged:
[WARN] stable-diffusion.cpp — align video frames from 160 to 157 for Wan 2.x
[INFO] generate_video 832x480x157
ggml_vulkan: Device memory allocation of size 31539456000 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] Wan2.x-T2V-14B: failed to allocate the compute buffer
31.5 GB requested. Failed immediately. I checked the Vulkan device properties inside the pod:
maxMemoryAllocationSize = 0xfffffffc # 4 GB
There it is. The RADV driver on gfx1151 (the Ryzen AI Max 395+'s iGPU) has a 4 GB hard ceiling per Vulkan memory allocation. This isn't a total VRAM limit — the hardware has 96 GB of unified memory pinned to the GPU — it's a per-allocation limit enforced by the Vulkan API layer.
ggml's video attention mechanism allocates a single contiguous buffer for the spatial-temporal attention computation. The size of that buffer scales roughly as:
buffer_size ∝ seq²
seq = (width/8) × (height/8) × ((frames-1)/4 + 1)
For 832×480 at 33 frames: seq = 104 × 60 × 9 = 56,160 → buffer ~31 GB. For 157 frames: seq = 249,600 → buffer ~587 GB. Even the minimum frame count at the native resolution blows the ceiling by 7×.
I dropped resolution and recalculated. The relationship held across every test:
| Resolution | Frames | Buffer needed | Result |
|---|---|---|---|
| 832×480 | 33 | 31 GB | OOM |
| 832×480 | 157 | 587 GB | OOM |
| 256×256 | 161 | 18.9 GB | OOM |
| 224×128 | 161 | 3.3 GB | ✓ |
| 256×256 | 5 | ~44 MB | ✓ |
224×128 at 161 frames fits. Barely — 3.3 GB against a 4 GB ceiling. But it fits.
At some point during the debugging, a document appeared with the heading "Force Vulkan Memory Splitting" and this suggestion:
GGML_VK_FORCE_CONCURRENT=1 ./sd-cli ...
The claim was that this environment variable would split ggml's Vulkan compute allocations across multiple physical buffers, bypassing the 4 GB ceiling. It cited GitHub issues, Reddit threads, StackOverflow — the full bibliography.
I checked:
strings /app/sd-server | grep -i "FORCE_CONCURRENT\|VK_FORCE\|GGML_VK"
# (only GGML_VULKAN_DEVICE appears)
Not in the binary. The string doesn't exist anywhere in the compiled sd-server. The document was AI-generated, and that specific flag was fabricated. The other suggestion in the document — --vae-tiling — is real and confirmed in the sd-server --help output. But the memory splitting flag is fiction.
The actual fix for the 4 GB limit is a ROCm/HIP rebuild of stable-diffusion.cpp (cmake -DSD_HIP=ON). ROCm treats the GPU memory as a unified contiguous pool and doesn't impose Vulkan's per-allocation ceiling. That's the next step. For now: 224×128.
curl -X POST http://ness-linux3:32100/v1/images/generations \
-H 'Content-Type: application/json' \
-d '{
"model": "wan22-video",
"prompt": "a fox running through a forest, cinematic",
"size": "224x128",
"video_frames": 161,
"fps": 16
}'
Submitted. The pod loaded (38 GB VRAM reserved). HighNoise pass: 520 seconds. LowNoise pass: 520 seconds. VAE decode: 11 seconds. Total: 17 minutes and 32 seconds for a 10-second video at 224×128.
$ ffprobe -v quiet -select_streams v:0 -count_frames \
-show_entries stream=nb_read_frames,r_frame_rate,width,height \
-of default=noprint_wrappers=1 /tmp/fox10s_224x128.webm
width=224
height=128
r_frame_rate=16/1
nb_read_frames=161
161 frames. 10.06 seconds. The fox runs.
The resolution ceiling is a solvable problem. ness-linux3 has /dev/kfd mounted in all pods — ROCm is already accessible. A Dockerfile.sd-hip with cmake -DSD_HIP=ON and HSA_OVERRIDE_GFX_VERSION=11.0.0 for gfx1151 compatibility should lift the constraint entirely. No BIOS changes, no GRUB changes — ROCm and Vulkan run side by side, the LLM pods don't need to change.
There's also the question of I2V (image-to-video) continuation — generating a sequence of connected clips rather than one low-resolution shot. That requires the Wan2.2-I2V-A14B weights, which aren't downloaded yet. But the plumbing is there.
For now: the fox runs. It runs at 224×128, it takes 17 minutes, and it took most of a day to get there. But it runs.
sample_params wrapper removed; vid_gen params are now flat top-level fields (commit 0651e30)maxMemoryAllocationSize = 4 GB on RADV gfx1151; buffer scales as seq²; practical limit for 10s@16fps is 224×128GGML_VK_FORCE_CONCURRENT is not a real flag — do not use