This guide walks you through creating, validating, and using your first serving card.
# Clone the repo (PyPI package coming soon)
git clone https://github.com/zenprocess/servingcard
cd servingcard/packages/python
pip install -e .This installs the servingcard CLI and the Python library.
A serving card captures the exact configuration and benchmark results for serving a model on specific hardware. Start by creating a YAML file:
servingcard: "1.0"
model: llama4-scout
variant: fp16-baseline
hardware: nvidia-rtx4090
framework: vllm
author: your-name
created: "2026-03-26"
method: manual
serving:
tensor_parallel_size: 1
gpu_memory_utilization: 0.90
max_model_len: 32768
benchmark:
single_stream:
tok_s: 38.5
ttft_ms: 420Save this as my-first-card.yaml.
Run the validator to check your card:
servingcard validate my-first-card.yamlIf everything is correct, you will see:
VALID: my-first-card.yaml
If there are errors, the validator tells you exactly what to fix:
INVALID: my-first-card.yaml
- Missing required field: author
- Missing benchmark section
View a summary of any serving card:
servingcard info my-first-card.yamlOutput:
Model: llama4-scout
Variant: fp16-baseline
Hardware: nvidia-rtx4090
Framework: vllm
Author: your-name
Method: manual
Benchmark:
Single stream: 38.5 tok/s
TTFT: 420 ms
Load a serving card in Python:
from servingcard.schema import ServingCard
card = ServingCard.from_yaml("my-first-card.yaml")
print(card.model) # llama4-scout
print(card.hardware) # nvidia-rtx4090
if card.benchmark and card.benchmark.single_stream:
print(f"{card.benchmark.single_stream.tok_s} tok/s")Search for existing configs:
servingcard search --model qwen3-coder
servingcard search --hardware nvidia-gb10- Read the Format Overview for details on every section
- Read the Benchmark Guide for how to run proper benchmarks
- Browse the registry for real-world examples
- Contribute your own serving card