This is a working proof-of-concept built in 48 hours on AMD MI300X (192 GB VRAM) for the lablab.ai AMD Developer Hackathon. The story below is the whole pitch: vLLM's Automatic Prefix Caching on ROCm makes multi-agent legislative analysis cheap, because 10 specialist agents share one prefill of the bill text.
One MI300X. Four Qwen models plus Wan 2.2 hot at the same time, sharing 192 GB — the entire pipeline (analysis + slide gen + animation + narration + critic) runs on a single GPU with zero model swapping mid-pipeline.
spine Qwen3-30B-A3B-Instruct-2507-FP8 vLLM port 8001 ~35 GB max_model_len=262,144
vision Qwen3-VL-8B-Thinking-FP8 vLLM port 8002 ~9 GB dual-call critic
imagegen Qwen-Image-2512 FP8 + Lightning ComfyUI port 8188 ~22 GB 1280x720 in ~5 s
videogen Wan 2.2 i2v 14B FP8 + LightX2V ComfyUI port 8188 ~30 GB 832x480 81-frame clips
tts Qwen3-TTS-12Hz-1.7B-CustomVoice ComfyUI port 8188 ~5 GB Alex/Jordan voices
combined ~150 / 192 GB VRAM (78.1%, with 40+ GB headroom for KV cache)
backend vllm/vllm-openai-rocm:v0.17.1 on ROCm 7.2.3 + ComfyUI on PyTorch 2.10.0
The spine holds a 250K-token bill chunk in its context. Every specialist agent
hits the same chunk. With --enable-prefix-caching, the chunk is prefilled
once; every additional agent against that chunk pays only its short
instruction overhead. On a 32-80 GB GPU you cannot even fit this stack — the spine alone needs more than a 4090. On smaller GPUs, agents have to spill, re-prefill, or model-swap — so on
consumer hardware this whole pattern doesn't work. 192 GB is the unlock.
| Act | Bill | Pages | Tokens | Chunks |
|---|---|---|---|---|
| I | HR 1 — OBBB Act 2025 (enacted) | 330 | 230,925 | 1 |
| II | HR 5376 — Build Back Better 2021 | 2,468 | 927,292 | 6 |
| III | HR 2670 — FY24 NDAA | 1,236 | 541,309 | 3 |
Smart chunker is title-boundary aware: never splits a TITLE / Subtitle mid-section, falls back to Subtitle-level cuts when a single Title overflows the 200K cl100k-token budget (calibrated for Qwen tokenizer's ~16% inflation over cl100k_base).
First Day-1 measurement on AMD MI300X. Same 99,727-token prefix from HR 1, two different follow-up questions:
{
"model": "spine",
"endpoint": "http://localhost:8001/v1",
"prefix_tokens": 99727,
"request_a_cold": {
"ttft_ms": 51046.1,
"question": "Summarize the first major Title of the bill in two sentences."
},
"request_b_warm": {
"ttft_ms": 542.2,
"question": "List three named entities mentioned in the bill text above."
},
"speedup_ttft": 94.15,
"passes_3x_floor": true,
"passes_5x_target": true
}
That number alone is the elevator pitch. Whatever else this hackathon ships, that ratio is real and reproducible. Source: eval/apc-benchmark-hr1.json.
Cold call: Plain-English Summarizer on BBB-2021 chunk 1 (TITLE I — AGRICULTURE, 232K Qwen tokens, ~$100 B in forest-restoration funding). 5 minutes.
Warm rerun, identical chunk text, same agent: 21.4 seconds. 14.8× wall-clock speedup, no code changes — just APC reusing the KV cache from the first call.
| Run | State | elapsed | prompt_tok | completion_tok |
|---|---|---|---|---|
| Summarizer | cold prefill | 316.4 s | 232,853 | 369 |
| Summarizer (rerun) | APC warm | 21.4 s | 232,853 | 369 |
| USC Cross-Reference | APC partial | ~95 s | 232,994 | ~6,000 |
| Pork Finder | APC partial | 92.4 s | 232,994 | 2,216 |
| Conflict Spotter | APC partial | 270.6 s | 233,019 | 25 |
Sustained Prefix cache hit rate: 68.5% on the spine across all four agents on
the same chunk. This is the core hackathon thesis — 14 agents amortized across one
prefill instead of paying for 14 prefills.
The USC Cross-Reference agent identifies citations the bill makes (e.g. "Section 4(a) of the Food and Nutrition Act of 2008"), then a local LMDB lookup attaches the actual current US Code text to each citation. This stops the LLM from hallucinating about what cited statutes say — the LLM identifies where, the LMDB grounds what.
Five real recovered citations from BBB-2021 chunk 1:
The USC Cross-Reference agent extracts every citation in the bill text, then enriches each
one against a local USC LMDB built from Title 1-54 of the U.S. Code (60K+ sections, ~10 us
per hit). Sub-paragraph notations like 7 U.S.C. 3103(2) are canonicalized down
to their parent section. Multi-chunk bills get their citations concatenated with a chunk
provenance tag so the merged report can attribute every reference to the chunk it came from.
Qwen3-VL-8B-Thinking-FP8 (port 8002) renders a single page of BBB-2021 (page 2222 — a tax bracket schedule for joint filers) and emits structured JSON in 11.4 seconds:
{
"type": "tax_bracket",
"applies_to": "MARRIED INDIVIDUALS FILING JOINT RETURNS AND SURVIVING SPOUSES",
"rows": [{
"lower_bound_usd": 400000,
"upper_bound_usd": 450000,
"tax_rate_pct": 35,
"notes": "plus 35% of the excess over $400,000"
}]
}
All values verified exact against the source bill. The model can see the page, read the structure, and emit JSON. No OCR, no tabula post-processing.
On Day 7 the Qwen+Wan pipeline gained a fifth output mode: a talking-head podcast where two AI hosts (Alex and Jordan) deliver the bill summary on camera. Audio is Qwen3-TTS (Ryan + Ono_anna voices), faces and lipsync come from InfiniteTalk-on-Wan22 (832×480 / 25 fps, ~10 s pair clips), and a brand overlay composites DeadAir Broadcasting cards on top.
The hybrid compositor alternates slide pairs (Wan-animated stills with Qwen-Image visuals) and talking-head pairs (lipsynced avatars), cutting on dialog boundaries. The result is a slide → talking-head → slide → talking-head hybrid the brain can actually follow at podcast pacing, with a lipsync intro/outro setting tone.
BBB hybrid master final-bbb-cloud-hybrid-podcast.mp4
17.2 MB / 3:08 / 17 clips (14 body + 3 brand)
5 slide pairs + 4 talking-head pairs, alternating
-c copy concat (no re-encode) - all clips share
h264 High yuv420p 832x480 @25fps + AAC LC mono 24kHz
Lipsync intro/outro overlays add 14 seconds of DeadAir branding without breaking the audio-aligned cut chain. The same compositor wires all three output modes — a single source script can render slides-only, all-talking-head, or the alternating hybrid — so the right format for the bill (data-heavy => lean slides; vibe-heavy => lean avatars) is a one-line argument change.
Every claim on this page links to a real file in the repo:
docker ps + rocm-smi showing all 3 Qwen containers upRoadmap through Day 7: 10 more specialist agents (Citation Validator, Effective Date Tracker, Stakeholder Tracer, Constitutional/Preemption Analyst, Final Synthesizer); orchestrator that runs them in parallel against one chunk; full-bill end-to-end on BBB-2021 in <8 minutes target time at FP8.