VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

Mar 5, 2025·

Zihan Liu

Xinhao Luo

Junxian Guo

Wentao Ni

Yangjie Zhou

Yue Guan

Cong Guo

Weihao Cui

Yu Feng

Minyi Guo

Yuhao Zhu

Minjia Zhang

Jingwen Leng

Chen Jin

· 0 min read

PDF Cite Code

Abstract

In this work, we design and implement VQ-LLM, an efficient fused Vector Quantization (VQ) kernel generation framework. We first introduce a software abstraction called codebook cache to optimize codebook access efficiency and support the integration of VQ with various computations. The codebook cache adaptively stores different entries across the GPU’s memory hierarchy, including off-chip global memory, on-chip shared memory, and registers. Centered around the codebook cache, we design an efficient computation engine that optimizes memory traffic during computations involving codebooks. This compute engine adopts the codebook-centric dataflow and fusion optimizations. Additionally, we provide adaptive heuristics to tailor parameter selection in our optimizations to diverse VQ configurations. Our optimizations achieve an average latency reduction of 46.13% compared to unoptimized versions. Compared to existing open-source implementations, our methods decrease latency by 64.36% to 99.1%. A final comparison with state-of-the-art element-wise quantization methods like AWQ and KVQuant shows that our VQ-LLM is practically viable, achieving latencies close or even better latencies to those at equivalent bit-widths, potentially offering greater accuracy.

Type

Conference paper

Publication

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Last updated on Mar 5, 2025

Vector Quantization

Authors

Wentao Ni

First-year Ph.D.

JUNO: Optimizing High-Dimensional Approximate Nearest Neighbour Search with Sparsity-Aware Algorithm and Ray-Tracing Core Mapping Jan 1, 2024 →