kernel-optimization

Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and quantized workloads.

cuda high-performance-computing triton quantization rocm gpu-kernels prompt-engineering llm-agents ai-coding kernel-optimization

Updated Apr 11, 2026

IntelLabs / Triton8

Star

Automatic Triton kernel generation and optimization for Intel GPU, powered by Claude Code.

triton code-generation gpu-computing intel-gpu xpu llm-agents kernel-optimization

Updated Apr 19, 2026
Python

SUNMMIO / Tilelang

Star

Extended TileLang as a unified DSL to enable high-performance kernel development for Near-Memory Computing, Distributed Memory AI Accelerators, and Networked Accelerators.

scale-up scale-out near-memory-compute kernel-optimization distributed-memory-accelator

Updated Apr 24, 2026
Python

PwnKit-Labs / noeris

Star

Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).

benchmarking cuda pytorch triton autotuning gemma gpu-kernels github-actions kernel-fusion llm-training llm-inference kernel-optimization

Updated Apr 21, 2026
Python

LessUp / diy-flash-attention

Star

Learn Triton by building FlashAttention from scratch with educational kernels, benchmarks, and bilingual docs

tutorial deep-learning cuda pytorch triton educational attention-mechanism gpu-programming llm flash-attention kernel-optimization online-softmax

Updated Apr 23, 2026
Python

vibhu1233 / autoresearch

Star

Enable autonomous AI agents to optimize LLM training code through iterative experiments and improve models without manual intervention overnight

productivity ai gpu iteration artificial-general-intelligence multi-agent-systems multi-agent-system paper-generation scientific-discovery ai-agent multi-agent-debate self-evolving deepresearch kernel-optimization local-ai-agents openclaw experiment-execution

Updated Apr 25, 2026

tianyuxbear / cuda-kernels

Star

A collection of high-performance CUDA kernels and experiments for learning and optimizing GPU compute primitives.

gpu cuda high-performance-computing cuda-kernels kernel-optimization

Updated Jan 21, 2026
Cuda

LessUp / hpc-ai-optimization-lab

Star

CUDA kernel optimization lab for GEMM, FlashAttention, quantization, and GPU performance learning.

cuda high-performance-computing cuda-kernels gpu-computing gemm cpp20 gpu-programming ai-inference tensor-core nanobind flashattention kernel-optimization

Updated Apr 23, 2026
Cuda

Thedtk24 / optimisation_kernel

Star

Optimisation HPC et analyse de performances d'un noyau de calcul pour le compilateur GCC avec MAQAO.

gnu-gcc cache-optimization maqao kernel-optimization compilation-optimization

Updated Apr 21, 2026
C

theyonecodes / Windows-Super-Smooth

Star

The Ultimate Kernel Orchestration Suite for Windows. Optimized for low-latency development and high-priority workloads.

performance windows-10 low-latency dev-tools windows-11 kernel-optimization state-ideation

Updated Mar 10, 2026
Batchfile

thephfox / ubuntu-touch-lenovo-tab-m8

Star

Optimized Ubuntu Touch for Lenovo Tab M8 HD (TB-8505F) - Kernel improvements, performance tuning, boot experience, and system optimizations for the MediaTek Helio A22 tablet

linux mediatek performance-tuning arm64 ubuntu-touch ubports kernel-optimization lenovo-tab-m8 helio-a22

Updated Feb 8, 2026
Shell

Yoonkyu-Lee / batched-lenet-cuda

Star

10,000-image LeNet-5 forward pass in ~28 ms on a single A40 via fused convolution and Tensor Cores (TF32).

parallel-computing cuda inference cnn matrix-multiplication lenet convolution gpu-computing ampere gpu-programming lenet-5 im2col cuda-programming tensor-cores kernel-optimization tf32 wmma

Updated Apr 16, 2026
Cuda

LessUp / cuda-kernel-academy

Star

Systematic CUDA kernel engineering from SGEMM fundamentals to reusable kernels, advanced optimization experiments, and lightweight inference components. https://lessup.github.io/cuda-kernel-academy/

education tutorial cplusplus hpc cuda gemm inference-engine gpu-programming sgemm cuda-programming tensor-core kernel-optimization

Updated Apr 23, 2026
C++

Mocchibird / pto-kernels-plots

Star

Benchmark plots and performance analysis for upstream pto-kernels development

visualization benchmarking performance-analysis npu ascend ml-systems kernel-optimization

Updated Apr 21, 2026

Teascented-swimmingstroke954 / autokernel

Star

Optimize PyTorch GPU kernels by autonomously profiling, extracting, and improving Triton or CUDA C++ code for better performance and efficiency.

rust reinforcement-learning kernel deep-learning gpu cuda configuration pytorch triton halide tensor kconfig tvm kernel-optimization autoresearch

Updated Apr 25, 2026
Python

ssmall256 / mps-kernels-skill

Star

Skill pack for custom PyTorch MPS kernels on Apple Silicon (examples, tests, and optimization patterns).

python machine-learning deep-learning metal gpu pytorch mps apple-silicon kernel-optimization metal-shading-language pytorch-mps

Updated Feb 16, 2026
Python

Improve this page

Add a description, image, and links to the kernel-optimization topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kernel-optimization topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel-optimization

Here are 20 public repositories matching this topic...

RightNow-AI / autokernel

WecoAI / weco-cli

wolffcatskyy / linux-mac

RightNow-AI / forge-mcp-server

KrxGu / kernel-skills

IntelLabs / Triton8

SUNMMIO / Tilelang

PwnKit-Labs / noeris

LessUp / diy-flash-attention

vibhu1233 / autoresearch

tianyuxbear / cuda-kernels

LessUp / hpc-ai-optimization-lab

Thedtk24 / optimisation_kernel

theyonecodes / Windows-Super-Smooth

thephfox / ubuntu-touch-lenovo-tab-m8

Yoonkyu-Lee / batched-lenet-cuda

LessUp / cuda-kernel-academy

Mocchibird / pto-kernels-plots

Teascented-swimmingstroke954 / autokernel

ssmall256 / mps-kernels-skill

Improve this page

Add this topic to your repo