ggml: add Vulkan and SYCL backends for TQ3_0 KV cache quantization#7
Open
metalchef1 wants to merge 1 commit into
Open
ggml: add Vulkan and SYCL backends for TQ3_0 KV cache quantization#7metalchef1 wants to merge 1 commit into
metalchef1 wants to merge 1 commit into
Conversation
Add hardware-accelerated TQ3_0 support for Vulkan and SYCL backends, enabling TQ3_0 KV cache on Intel Arc GPU with 58% VRAM savings vs q8_0. Vulkan backend: - types.glsl: add block_tq3_0_packed16 struct and A_TYPE_PACKED16 macro - dequant_funcs.glsl: add dequantize/dequantize4/get_dm for TQ3_0 using 32-point Walsh-Hadamard Transform with Max-Lloyd centroids - copy_to_quant.comp: TQ3_0 quantization path (F32 -> TQ3_0) - dequant_tq3_0.comp: standalone dequantize shader (TQ3_0 -> F32) - flash_attn_base.glsl: TQ3_0 dequantize4 for flash attention K/V path - vulkan-shaders-gen.cpp: compile TQ3_0 shaders (cpy, dequant, mul_mat_vec, get_rows, flash_attn scalar path) - ggml-vulkan.cpp: pipeline creation for cpy/dequant/mul_mat_vec/flash_attn, supports_op entries for CPY/DUP/CONT/MUL_MAT/FLASH_ATTN_EXT SYCL backend: - dequantize.hpp: TQ3_0 WHT dequantize kernel - dmmv.cpp: dot-matrix-vector kernel for TQ3_0 x F32 - convert.cpp: F32 -> TQ3_0 and TQ3_0 -> F32 copy kernels - ggml-sycl.cpp: register TQ3_0 pipelines Shared: - ggml-common.h: add QR_TQ3_0 = 1 macro - llama-context.cpp: enable flash_attn for TQ3_0 KV cache (was force-disabled) Tested on Intel Arc Pro B60 (24GB) with Qwen3.6-35B-A3B-IQ4_XS: ctx=32768, K+V q8_0: 340 MiB ctx=32768, K+V tq3_0: 140 MiB (58.8% reduction, correct outputs verified) TQ3_0 uses flash attention (FLASH_ATTN_EXT) since the K cache is accessed non-contiguously. Requires --cache-type-v tq3_0 alongside --cache-type-k tq3_0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add hardware-accelerated TQ3_0 support for Vulkan and SYCL backends, enabling TQ3_0 KV cache on Intel Arc GPU with 58% VRAM savings vs q8_0.
Vulkan backend:
SYCL backend:
Shared:
Tested on Intel Arc Pro B60 (24GB) with Qwen3.6-35B-A3B-IQ4_XS:
ctx=32768, K+V q8_0: 340 MiB
ctx=32768, K+V tq3_0: 140 MiB (58.8% reduction, correct outputs verified)
TQ3_0 uses flash attention (FLASH_ATTN_EXT) since the K cache is accessed non-contiguously. Requires --cache-type-v tq3_0 alongside --cache-type-k tq3_0.
Overview
Additional information
Requirements