kernel-fusion

Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.

Updated Apr 1, 2026
Python

nopperl / pytorch-fused-lamb

Star

LAMB go brrr

cuda optimizer pytorch triton lamb kernel-fusion triton-lang

Updated Apr 11, 2024
Python

svdrecbd / mhc-mlx

Star

MLX + Metal implementation of mHC: Manifold-Constrained Hyper-Connections by DeepSeek-AI.

performance deep-learning metal gpu transformers mhc mlx sinkhorn kernel-fusion sinkhorn-knopp apple-silicon metal-kernel mlx-explore fused-kernels manifold-constrained-hyper-connections hyperconnections birkhoff-polytope

Updated Jan 13, 2026
Python

fraidakis / PDS_BitonicSortCUDA

Star

Assigment 3 for the "Parallel & Distributed Systems" course (ECE, AUTh) - Fall 2024

cuda shared-memory radix-sort bitonic-sort nvidia-gpu kernel-fusion

Updated Mar 16, 2025
Cuda

PwnKit-Labs / noeris

Star

Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).

benchmarking cuda pytorch triton autotuning gemma gpu-kernels github-actions kernel-fusion llm-training llm-inference kernel-optimization

Updated Apr 21, 2026
Python

LessUp / triton-fused-ops

Star

Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, and FP8 GEMM.

Updated Apr 23, 2026
Python

ShkalikovOleh / alpaka_expr_trees

Star

Compile time kernels fusion and expression trees as Alpaka boost.odeint backend. This is my team project developed in collaboration with and under the supervision of HZDR.

cuda accelerators kernel-fusion alpaka

Updated Feb 20, 2024
C++

LessUp / tiny-dl-inference

Star

High-Performance WebGPU Deep Learning Inference Engine: Zero Dependencies, Hand-Written WGSL Shaders, Kernel Fusion | 高性能 WebGPU 深度学习推理引擎

machine-learning typescript browser deep-learning neural-network wasm inference mnist tensor gpu-computing webgpu kernel-fusion wgsl

Updated Apr 23, 2026
TypeScript

ParCoreLab / gpu-fusion

Star

GPU fusion code and algorithm

gpu cuda kernel-fusion

Updated May 24, 2024
Cuda

varad-more / fused-triton-rmsnorm-residual-qkv

Star

Production-grade Triton kernel fusing residual add + RMSNorm + packed QKV projection into a single GPU launch for decoder-only transformer inference (Llama-3, Mistral, Qwen2). +2.4% tok/s, -1.5 GB VRAM on A10G.

cuda pytorch transformer triton llama memory-bandwidth gpu-kernels kernel-fusion rmsnorm llm-inference

Updated Apr 22, 2026
Python

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

Star

High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.

deep-learning cuda pytorch gpu-optimization kernel-fusion layernorm

Updated Aug 17, 2025
Python

Improve this page

Add a description, image, and links to the kernel-fusion topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kernel-fusion topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel-fusion

Here are 15 public repositories matching this topic...

tracel-ai / burn

ROCm / iris

chhzh123 / Krill

wu-kan / GoPTX

Argonaut790 / fused-turboquant

nopperl / pytorch-fused-lamb

svdrecbd / mhc-mlx

fraidakis / PDS_BitonicSortCUDA

PwnKit-Labs / noeris

LessUp / triton-fused-ops

ShkalikovOleh / alpaka_expr_trees

LessUp / tiny-dl-inference

ParCoreLab / gpu-fusion

varad-more / fused-triton-rmsnorm-residual-qkv

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

Improve this page

Add this topic to your repo