Skip to content
#

kernel-optimization

Here are 19 public repositories matching this topic...

🎓 CUDA HPC Kernel Optimization Lab: Progressive GEMM, FlashAttention, Tensor Core & CUDA 13 Features | 从朴素到 Tensor Core 的 CUDA 高性能算子优化实验室

  • Updated Apr 22, 2026
  • Cuda

⚡ LLM-Speed: High-performance CUDA kernels for LLM inference — FlashAttention with O(N) memory, Tensor Core GEMM (95% cuBLAS), and seamless PyTorch integration. Supports Volta to Hopper GPUs.

  • Updated Apr 22, 2026
  • Python

Improve this page

Add a description, image, and links to the kernel-optimization topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kernel-optimization topic, visit your repo's landing page and select "manage topics."

Learn more