Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
-
Updated
Mar 19, 2026 - Python
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
Production-Grade Autoresearch. Ideal for GPU kernels, ML model development, feature engineering, prompt engineering, and other optimizable code.
Custom Linux kernels purpose-built for Apple Mac hardware
Forge: Swarm Agents That Turn Slow PyTorch Into Fast CUDA/Triton Kernels
Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and quantized workloads.
Automatic Triton kernel generation and optimization for Intel GPU, powered by Claude Code.
Extended TileLang as a unified DSL to enable high-performance kernel development for Near-Memory Computing, Distributed Memory AI Accelerators, and Networked Accelerators.
Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).
Enable autonomous AI agents to optimize LLM training code through iterative experiments and improve models without manual intervention overnight
A collection of high-performance CUDA kernels and experiments for learning and optimizing GPU compute primitives.
🎓 CUDA HPC Kernel Optimization Lab: Progressive GEMM, FlashAttention, Tensor Core & CUDA 13 Features | 从朴素到 Tensor Core 的 CUDA 高性能算子优化实验室
⚡ LLM-Speed: High-performance CUDA kernels for LLM inference — FlashAttention with O(N) memory, Tensor Core GEMM (95% cuBLAS), and seamless PyTorch integration. Supports Volta to Hopper GPUs.
Optimisation HPC et analyse de performances d'un noyau de calcul pour le compilateur GCC avec MAQAO.
The Ultimate Kernel Orchestration Suite for Windows. Optimized for low-latency development and high-priority workloads.
Optimized Ubuntu Touch for Lenovo Tab M8 HD (TB-8505F) - Kernel improvements, performance tuning, boot experience, and system optimizations for the MediaTek Helio A22 tablet
10,000-image LeNet-5 forward pass in ~28 ms on a single A40 via fused convolution and Tensor Cores (TF32).
Benchmark plots and performance analysis for upstream pto-kernels development
Optimize PyTorch GPU kernels by autonomously profiling, extracting, and improving Triton or CUDA C++ code for better performance and efficiency.
Skill pack for custom PyTorch MPS kernels on Apple Silicon (examples, tests, and optimization patterns).
Add a description, image, and links to the kernel-optimization topic page so that developers can more easily learn about it.
To associate your repository with the kernel-optimization topic, visit your repo's landing page and select "manage topics."