KunServe

This repository hosts KunServe (Python orchestration, Ray, scheduling experiments) and FlashTransformer (high-performance dense Transformer inference kernels and model execution in C++/CUDA), used together for LLM serving research and reproducible evaluation.

Disclaimer

Academic and evaluation focus. The stack is intended for research, benchmarking, and prototyping, not as a drop-in replacement for commercial managed inference. Operability, security hardening, and long-term API stability are not goals of this release on their own.
Dense models. We currently focus on dense architectures in the supported code paths. For new topologies, start from the existing Llama-oriented implementation under FlashTransformer/src/csrc/model/sota/ and extend thoughtfully.
Upstream lineage. The C++ backend was forked and renamed to FlashTransformer in this tree; Python tooling is branded KunServe. Historical ties to earlier research systems are acknowledged in kunserve/README.md.

Quick start

git clone <this-repository> kunserve && cd kunserve

# Submodules: FlashTransformer is declared in the repo-root `.gitmodules`.
# Nested deps (flash-attention, flashinfer) stay in `FlashTransformer/.gitmodules`
# with paths relative to that directory — use recursive update from the root:
git submodule update --init --recursive

# Conda environment (see environment.yml at repo root)
conda env create -f environment.yml && conda activate kunserve

# Build FlashTransformer (CUDA toolchain required)
cmake -S FlashTransformer -B FlashTransformer/build -DBUILD_MODE=RELEASE && cmake --build FlashTransformer/build -j"$(nproc)"

pip install -e .

After a successful build you should have libflash_pybinding.so under FlashTransformer/build/lib (exact layout may depend on your CMake preset).

The Git submodule URL for FlashTransformer is set in .gitmodules to the FlashTransformer project name. If your hosting still serves the repository under an older remote name, override it locally, for example: git config submodule.FlashTransformer.url <your-clone-url>.

Documentation

KunServe (Python layer, evaluation scope, acknowledgements): kunserve/README.md
FlashTransformer (native backend): FlashTransformer/README.md

Roadmap

We will extend KunServe in the following directions:

Weight loading from host memory or SSD — broader storage tiers beyond today’s assumptions, for large models and flexible deployment.
Integrate SOTA inference engines (e.g. vLLM, SGLang) — optional backends alongside KunServe’s research-oriented execution stack, so benchmarks and scheduling ideas can target widely used runtimes.
CUDA graphs for efficient decoding — capture and replay steady-state decode for lower launch overhead where the graph constraints are acceptable.
Extend KunServe to RL rollout — reuse orchestration, batching, and tracing for rollout-heavy RL training loops, not only static serving benchmarks.

Acknowledgement

During the development, KunServe learns and borrows modules from these projects:

DistServe & SwiftTransformer
Llumnix
vLLM
FlashInfer

Citation

If this codebase helps your research, please kindly cite our paperusing the following bib, thanks!

@inproceedings{kunserve-eurosys26,
  author       = {Rongxin Cheng and Yuxin Lai and Xingda Wei and Rong Chen and Haibo Chen},
  title        = {KUNSERVE: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving},
  booktitle    = {Proceedings of the 21st European Conference on Computer Systems,
                  EuroSys 2026, Edinburgh, Scotland Uk, April 27--30, 2026},
  publisher    = {{ACM}},
  year         = {2026},
  doi          = {10.1145/3767295.3769348},
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
FlashTransformer @ 2dede30		FlashTransformer @ 2dede30
evaluation		evaluation
kunserve		kunserve
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
burstgpt-1h.parquet		burstgpt-1h.parquet
burstgpt-225x.parquet		burstgpt-225x.parquet
contribution.md		contribution.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KunServe

Disclaimer

Quick start

Documentation

Roadmap

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KunServe

Disclaimer

Quick start

Documentation

Roadmap

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages