Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
-
Updated
Mar 19, 2026 - Python
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
Production-Grade Autoresearch. Ideal for GPU kernels, ML model development, feature engineering, prompt engineering, and other optimizable code.
Custom Linux kernels purpose-built for Apple Mac hardware
Forge: Swarm Agents That Turn Slow PyTorch Into Fast CUDA/Triton Kernels
Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and quantized workloads.
Automatic Triton kernel generation and optimization for Intel GPU, powered by Claude Code.
Extended TileLang as a unified DSL to enable high-performance kernel development for Near-Memory Computing, Distributed Memory AI Accelerators, and Networked Accelerators.
Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).
Learn Triton by building FlashAttention from scratch with educational kernels, benchmarks, and bilingual docs
Enable autonomous AI agents to optimize LLM training code through iterative experiments and improve models without manual intervention overnight
A collection of high-performance CUDA kernels and experiments for learning and optimizing GPU compute primitives.
CUDA kernel optimization lab for GEMM, FlashAttention, quantization, and GPU performance learning.
Optimized Ubuntu Touch for Lenovo Tab M8 HD (TB-8505F) - Kernel improvements, performance tuning, boot experience, and system optimizations for the MediaTek Helio A22 tablet
The Ultimate Kernel Orchestration Suite for Windows. Optimized for low-latency development and high-priority workloads.
10,000-image LeNet-5 forward pass in ~28 ms on a single A40 via fused convolution and Tensor Cores (TF32).
Systematic CUDA kernel engineering from SGEMM fundamentals to reusable kernels, advanced optimization experiments, and lightweight inference components. https://lessup.github.io/cuda-kernel-academy/
Optimisation HPC et analyse de performances d'un noyau de calcul pour le compilateur GCC avec MAQAO.
Benchmark plots and performance analysis for upstream pto-kernels development
Optimize PyTorch GPU kernels by autonomously profiling, extracting, and improving Triton or CUDA C++ code for better performance and efficiency.
Skill pack for custom PyTorch MPS kernels on Apple Silicon (examples, tests, and optimization patterns).
Add a description, image, and links to the kernel-optimization topic page so that developers can more easily learn about it.
To associate your repository with the kernel-optimization topic, visit your repo's landing page and select "manage topics."