LLM inference (weight streaming)
Streams model weights on demand from HBF, removing the need to fully reside in GPU memory.
AI scaling has hit a wall — not compute, but memory. We’re building HBF, a new memory layer between HBM and SSD that delivers near-HBM latency with multi-terabyte capacity. This unlocks ~3x more models per GPU and significantly lowers cost per inference.
GPU memory usage ↓ 30–60%, Models per node ↑ 3–5x, Context length ↑ 10x+
Streams model weights on demand from HBF, removing the need to fully reside in GPU memory.
Offloads KV cache to HBF with deterministic access, enabling long-context inference without GPU memory explosion.
Dynamically loads experts from HBF in real time, allowing sparse models to scale without memory constraints.