Skip to content

SJTU-IPADS/kunserve

Repository files navigation

KunServe

This repository hosts KunServe (Python orchestration, Ray, scheduling experiments) and FlashTransformer (high-performance dense Transformer inference kernels and model execution in C++/CUDA), used together for LLM serving research and reproducible evaluation.

Disclaimer

  • Academic and evaluation focus. The stack is intended for research, benchmarking, and prototyping, not as a drop-in replacement for commercial managed inference. Operability, security hardening, and long-term API stability are not goals of this release on their own.

  • Dense models. We currently focus on dense architectures in the supported code paths. For new topologies, start from the existing Llama-oriented implementation under FlashTransformer/src/csrc/model/sota/ and extend thoughtfully.

  • Upstream lineage. The C++ backend was forked and renamed to FlashTransformer in this tree; Python tooling is branded KunServe. Historical ties to earlier research systems are acknowledged in kunserve/README.md.

Quick start

git clone <this-repository> kunserve && cd kunserve

# Submodules: FlashTransformer is declared in the repo-root `.gitmodules`.
# Nested deps (flash-attention, flashinfer) stay in `FlashTransformer/.gitmodules`
# with paths relative to that directory — use recursive update from the root:
git submodule update --init --recursive

# Conda environment (see environment.yml at repo root)
conda env create -f environment.yml && conda activate kunserve

# Build FlashTransformer (CUDA toolchain required)
cmake -S FlashTransformer -B FlashTransformer/build -DBUILD_MODE=RELEASE && cmake --build FlashTransformer/build -j"$(nproc)"

pip install -e .

After a successful build you should have libflash_pybinding.so under FlashTransformer/build/lib (exact layout may depend on your CMake preset).

The Git submodule URL for FlashTransformer is set in .gitmodules to the FlashTransformer project name. If your hosting still serves the repository under an older remote name, override it locally, for example: git config submodule.FlashTransformer.url <your-clone-url>.

Documentation

Roadmap

We will extend KunServe in the following directions:

  • Weight loading from host memory or SSD — broader storage tiers beyond today’s assumptions, for large models and flexible deployment.
  • Integrate SOTA inference engines (e.g. vLLM, SGLang) — optional backends alongside KunServe’s research-oriented execution stack, so benchmarks and scheduling ideas can target widely used runtimes.
  • CUDA graphs for efficient decoding — capture and replay steady-state decode for lower launch overhead where the graph constraints are acceptable.
  • Extend KunServe to RL rollout — reuse orchestration, batching, and tracing for rollout-heavy RL training loops, not only static serving benchmarks.

Acknowledgement

During the development, KunServe learns and borrows modules from these projects:

  • DistServe & SwiftTransformer
  • Llumnix
  • vLLM
  • FlashInfer

Citation

If this codebase helps your research, please kindly cite our paperusing the following bib, thanks!

@inproceedings{kunserve-eurosys26,
  author       = {Rongxin Cheng and Yuxin Lai and Xingda Wei and Rong Chen and Haibo Chen},
  title        = {KUNSERVE: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving},
  booktitle    = {Proceedings of the 21st European Conference on Computer Systems,
                  EuroSys 2026, Edinburgh, Scotland Uk, April 27--30, 2026},
  publisher    = {{ACM}},
  year         = {2026},
  doi          = {10.1145/3767295.3769348},
}

About

Code repo for paper: KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages