VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

Mar 5, 2025ยท
Zihan Liu
,
Xinhao Luo
,
Junxian Guo
Wentao Ni
Wentao Ni
,
Yangjie Zhou
,
Yue Guan
,
Cong Guo
,
Weihao Cui
,
Yu Feng
,
Minyi Guo
,
Yuhao Zhu
,
Minjia Zhang
,
Jingwen Leng
,
Chen Jin
ยท 0 min read
Abstract
In this work, we design and implement VQ-LLM, an efficient fused Vector Quantization (VQ) kernel generation framework. We first introduce a software abstraction called codebook cache to optimize codebook access efficiency and support the integration of VQ with various computations. The codebook cache adaptively stores different entries across the GPU’s memory hierarchy, including off-chip global memory, on-chip shared memory, and registers. Centered around the codebook cache, we design an efficient computation engine that optimizes memory traffic during computations involving codebooks. This compute engine adopts the codebook-centric dataflow and fusion optimizations. Additionally, we provide adaptive heuristics to tailor parameter selection in our optimizations to diverse VQ configurations. Our optimizations achieve an average latency reduction of 46.13% compared to unoptimized versions. Compared to existing open-source implementations, our methods decrease latency by 64.36% to 99.1%. A final comparison with state-of-the-art element-wise quantization methods like AWQ and KVQuant shows that our VQ-LLM is practically viable, achieving latencies close or even better latencies to those at equivalent bit-widths, potentially offering greater accuracy.
Type
Publication
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2