#

vllm-serve

Here are 37 public repositories matching this topic...

xerrors / mvllm

Intelligent load balancer for distributed vLLM server clusters 分布式 vLLM 服务器集群的智能负载均衡器

inference balancer llms vllm vllm-serve

Updated Oct 22, 2025
Python

Perpetue237 / agentsculptor

agentsculptor is an experimental AI-powered development agent designed to analyze, refactor, and extend Python projects automatically. It uses an OpenAI-like planner–executor loop on top of a vLLM backend, combining project context analysis, structured tool calls, and iterative refinement. It has only been tested with gpt-oss-120b via vLLM.

nlp open-source ai hackathon-project coding-assistant llms vllm agentic-ai vllm-serve gpt-oss gpt-oss-120b gpt-oss-20b vllm-server-config

Updated Sep 17, 2025
Python

KempnerInstitute / distributed-inference-vllm

Distributed Inference with vLLM

hpc slurm vllm llama3 qwen2-5 vllm-serve

Updated Apr 24, 2026
Shell

BudEcosystem / Awesome-vLLM-plugins

A curated list of plugins built on top of vLLM

plugins vllm vllm-operator vllm-serve vllm-integration vllm-plugins

Updated Dec 12, 2025

MekayelAnik / vllm-cpu

Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets

cpu-inference vllm llm-inference vllm-serve vllm-server

Updated Apr 19, 2026
Shell

project-david-ai / projectdavid-core

The core source files to this self-hostable successor to the OpenAI Assistants API. To contribute to the core logic, fork or submit pull requests to this repro.

python docker self-hosted orchestration multi-agent firejail gdpr ai-platform llm vllm assistants-api rag-pipeline tool-calling vllm-serve openai-compatible

Updated Apr 24, 2026
Python

hadi-technology / vllm-mlops

Performant LLM inferencing on Kubernetes via vLLM

kubernetes digitalocean machine-learning mlops vllm vllm-serve

Updated Feb 11, 2025

brokedba / vllm-lab

This Repository contains terraform configuration for vllm production-stack in the cloud managed K8s

gke aks civo eks oke vllm llmcache vllm-operator vllm-serve vllm-production-stack

Updated Nov 10, 2025
HCL

kingabzpro / Deploying-the-Magistral-with-Modal

Deploy the Magistral-Small-2506 model using vLLM and Modal

modal mistral openai-api vllm-serve

Updated Jun 16, 2025
Python

iguanesolutions / qwen35-rp

Qwen 3.5 Reverse Proxy for handling instant / thinking modes and their variants automatically

inference reverse-proxy instant thinking openai-api llm vllm genai vllm-serve qwen3-5 sampling-parameters

Updated Apr 8, 2026
Go

SeungjaeLim / Efficient-Road-Repairs-System

[KAIST CS632] Road damage detection using YOLOv8 on Xilinx FPGA, repair estimation with vLLM-Serve Phi-3.5 FAISS RAG, and data management via GS1 EPCISv2 and React dashboard

react gs1 xilinx-fpga epcis faiss lmm rag yolov8 microsoft-phi3 vllm-serve

Updated Dec 19, 2024
Python

AbdulSametTurkmenoglu / vllm_rag_api

This project offers a production-ready RAG (Retrieval-Augmented Generation) API running on FastAPI, utilizing the high-performance vLLM engine.

rag llm vllm rag-chatbot vllm-serve

Updated Oct 31, 2025
Python

Aquiles-ai / load-test-vllm-gpt-oss-20b

Load testing openai/gpt-oss-20b with vLLM and Docker

docker load-testing vllm-serve gpt-oss-20b

Updated Sep 8, 2025
Python

mahimairaja / modal-qwen-3.5-9B

Deploy the SOTA qwen 3.5 to 9B Serverless

modal llm vllm llm-inference vllm-serve qwen3-5 qwen9b

Updated Mar 4, 2026
Python

Rohit2sali / vllm-multi-tenant-llm-gateway

This is vllm multi tenant large language model gateway. This system is created to serve lot of requests at same time to lot of users. It uses vllm as it's engine to run llm, it has scheduler to schedule the queries of users and limiter to limit the use of specific user. It also uses LoRA adapters in vllm.

machine-learning deep-learning lora multitenant inference-engine llm vllm vllm-serve

Updated Mar 5, 2026
Jupyter Notebook

HTAnh2003 / ViOCR-VLM-1B

Đây là mô hình OCR được tinh chỉnh từ Vintern1B (InternVL 1B) với 1 tỷ tham số. Mô hình có khả năng nhận diện văn bản trong nhiều ngữ cảnh khác nhau như chữ viết tay, chữ in, và văn bản trên các đối tượng thực tế.

docker llm vllm-serve

Updated Jun 9, 2025
HTML

SiliconLanguage / model-explorer-open-llm

A hybrid testbed for evaluating top open-source LLMs (like gpt-oss-20b and Llama 3.3) on local, cloud GPUs, and AWS Inferentia2/Trainium instances, focusing on vLLM optimization, capacity management, kernel bypass, hardware-software co-design, as well as supporting infrastructure such as NCCL, RDMA, NVMeoF.

aws gpu rdma nvme kernel-bypass nccl gpudirect nvlink nvmeof llm vllm trainium vllm-serve inferentia2 software-hardware-co-design aws-ofi-nccl

Updated Apr 21, 2026
Python

rohitkt10 / vllm-bench

A reproducible benchmarking suite for vLLM inference. Measure latency, throughput, and VRAM across model configurations, quantization schemes, and deployment environments.

modal inference quantization llm vllm vllm-serve

Updated Jan 25, 2026
Python

Abhay-Sastha-S / Qwen3-TTS-myfi

High-concurrency async TTS server that serves Qwen3-TTS-12Hz-1.7B-Base with pre-computed speaker embeddings on a custom vLLM-Omni fork

tts vllm vllm-serve qwen3-tts vllm-omni

Updated Apr 17, 2026
Python

ai-art-dev99 / vLLM-efficient-serving-stack

Production-grade vLLM serving with an OpenAI-compatible API, per-request LoRA routing, KEDA autoscaling on Prometheus metrics, Grafana/OTel observability, and a benchmark comparing AWQ vs GPTQ vs GGUF.

grafana openai-api keda-scalers awq large-language-models vllm low-rank-adaptation vllm-serve

Updated Aug 30, 2025
Python

Improve this page

Add a description, image, and links to the vllm-serve topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the vllm-serve topic, visit your repo's landing page and select "manage topics."