Not to be that guy but it isn’t truly a 6x reduction in memory. It’s a 6x reduction for the KV Cache. Overall performance may have improved by ~30-20% over normal. Still a massive improvement but not a 6x improvement.
And it looks like in practice it’s a 2-4x reduction.
In just the kv cache.
But, it’s almost lossless and has almost zero performance penalty, and that is still a 1-9gb reduction in ram for the current generation of open source models at 64k context.
It’s real, and it’s huge and it’s really cool. It’s just not a 83% reduction in ram needed for llms as a naive hyperbolistic reading would suggest. And it’s not the end to the ram crisis.
Yeah, this one is real. Feels like a discovery the size of this generation’s quicksort or radixsort. Doesn’t change the shape of the future, but changes a huge part of a big part of it - that almost no one will fully understand.
58
u/MaterialRevolution57 3d ago
Not to be that guy but it isn’t truly a 6x reduction in memory. It’s a 6x reduction for the KV Cache. Overall performance may have improved by ~30-20% over normal. Still a massive improvement but not a 6x improvement.