The potential here with High-Bandwidth Flash is super cool. Effectively trying to go from 8 or a dozen flash channels to having a hundred or hundreds of channels would be amazing:
> The KAIST professor discussed an HBF unit having a capacity of 512 GB and a 1.638 TBps bandwidth.
One weird thing about this would be that it's still NAND flash and NAND flash still has limited read/write cycles, often measured in the thousands (Drive-Writes-a-Day across 5 years). If you can load a model & just keep querying it, that's not a problem. Maybe it's small enough to not be so bad, but my gut is that writing context here too might present difficulty.
I assume the use case is that you are an inference provider, and you put a bunch of models you might want to serve in the HBF to be able to quickly swap them in and out on demand.
I think the hope is to run directly off of HBF directly, to eventually replace RAM with it entirely. 1.5TB/s is a pretty solid number! It's not going to be easy, it doesn't just drop in and replace (vastly bigger latency) but HBF replacing HBM for gobs of bandwidth is the intent, I believe.
Kioxia for example has AiSAQ, which works in a couple places such as Milvus; not 100% clear but me exactly what's going on there, but it's trying to push work to the NVMe. And with NVMe 2.1 having computational storage, I expect we see more pushing work to the SSD.
These aren't directly the same thing as HBF. A lot is caching, but also, I tend to think there is an aspiration of trying to move some work out of ram, not merely to be able to load into ram faster.
Now I understand why NVMe flash drive prices have rocketed up to triple the normal in the last few months! The AI hyperscalers aren't just sucking up the wafer runs for memory, they're also monopolising the wafers for SSDs.
Sam Altman bought 40% of the world's supply of DRAM in an underhanded, secret deal with two large manufacturers. It will take years for supply to recover.
The best part is the wafers are being bought with no plans to use them; just to keep them in storage so that competition cannot easily access RAM. Supervillain shit, should have been the last straw for PG to publicly denounce Sam and for OpenAI to be sued by the US government for anticompetitive practices. All this does is harm the consumer. Of course that ks never going to happen.
He didn't actually buy it, nor does he have the money to. He just "committed" to buying it at a later date to disrupt the supply chain for his competitors. It's scams all the way down.
The potential here with High-Bandwidth Flash is super cool. Effectively trying to go from 8 or a dozen flash channels to having a hundred or hundreds of channels would be amazing:
> The KAIST professor discussed an HBF unit having a capacity of 512 GB and a 1.638 TBps bandwidth.
One weird thing about this would be that it's still NAND flash and NAND flash still has limited read/write cycles, often measured in the thousands (Drive-Writes-a-Day across 5 years). If you can load a model & just keep querying it, that's not a problem. Maybe it's small enough to not be so bad, but my gut is that writing context here too might present difficulty.
I assume the use case is that you are an inference provider, and you put a bunch of models you might want to serve in the HBF to be able to quickly swap them in and out on demand.
I think the hope is to run directly off of HBF directly, to eventually replace RAM with it entirely. 1.5TB/s is a pretty solid number! It's not going to be easy, it doesn't just drop in and replace (vastly bigger latency) but HBF replacing HBM for gobs of bandwidth is the intent, I believe.
Kioxia & Nvidia are already talking about 100M IOps SSD's directly attached to GPUs. This is less about running hte model & more about offboarding context for future use, but Nvidia is pushing KV cache to ssd. And using BlueField-4 which has PCIe on it to attach SSDs, process there. https://blocksandfiles.com/2025/09/15/kioxia-100-million-iop... https://blocksandfiles.com/2026/01/06/nvidia-standardizes-gp... https://developer.nvidia.com/blog/introducing-nvidia-bluefie...
We've already deepseek running straight off NVMe, weights runnig there. Slowly, but this maybe could scale. https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/deepsee...
Kioxia for example has AiSAQ, which works in a couple places such as Milvus; not 100% clear but me exactly what's going on there, but it's trying to push work to the NVMe. And with NVMe 2.1 having computational storage, I expect we see more pushing work to the SSD.
These aren't directly the same thing as HBF. A lot is caching, but also, I tend to think there is an aspiration of trying to move some work out of ram, not merely to be able to load into ram faster.
Flash has limited write cycles. The faster you write, the faster it wears out. How do you overcome that?
they will probably use a simpler more direct protocol than NVMe
Can't wait to have another cache layer I have to think about.
Now I understand why NVMe flash drive prices have rocketed up to triple the normal in the last few months! The AI hyperscalers aren't just sucking up the wafer runs for memory, they're also monopolising the wafers for SSDs.
Sam Altman bought 40% of the world's supply of DRAM in an underhanded, secret deal with two large manufacturers. It will take years for supply to recover.
The best part is the wafers are being bought with no plans to use them; just to keep them in storage so that competition cannot easily access RAM. Supervillain shit, should have been the last straw for PG to publicly denounce Sam and for OpenAI to be sued by the US government for anticompetitive practices. All this does is harm the consumer. Of course that ks never going to happen.
He didn't actually buy it, nor does he have the money to. He just "committed" to buying it at a later date to disrupt the supply chain for his competitors. It's scams all the way down.
Yeah, good clarification. But the deal is made nonetheless, for the time being we have to expect it to be carried out and act accordingly.
china also buys raw materials (metals, ...) with no intention of using them immediately
but they do because they prefer holding commodities to us dollars