An end-to-end observability demo built around a Python FastAPI service and an OpenTelemetry pipeline.
This repository is designed to be easy to run locally, easy to explain in interviews, and concrete enough to show backend + SRE thinking in one project.
中文简介: 这是一个可本地运行的小型可观测性平台,展示了我如何把一个 Python FastAPI 服务接入 OpenTelemetry,并把 traces、metrics、logs 统一送到 OTEL Collector,再接入 Grafana、Prometheus、Tempo 和 Loki 做观测与排障。
This project demonstrates:
- a containerized backend service with FastAPI
- distributed tracing with OpenTelemetry and Tempo
- structured JSON logging with Loki
- metrics collection with Prometheus
- centralized OTLP ingestion through the OpenTelemetry Collector
- Grafana provisioning for datasources and dashboards
FastAPI app
-> OTLP traces -> OTEL Collector -> Tempo -> Grafana
-> OTLP metrics -> OTEL Collector -> Prometheus -> Grafana
-> OTLP logs -> OTEL Collector -> Loki -> Grafana
- Built a containerized observability demo platform around a FastAPI service and OpenTelemetry instrumentation.
- Implemented request tracing, structured JSON logging, and custom metrics with a unified OTLP pipeline through the OpenTelemetry Collector.
- Provisioned Grafana, Prometheus, Tempo, and Loki automatically with Docker Compose so the full stack can be started locally with one command.
- Added trace-to-log correlation through
trace_id, making it possible to pivot from a failing request trace to the exact related logs.
- FastAPI
- OpenTelemetry
- OpenTelemetry Collector
- Prometheus
- Grafana
- Tempo
- Loki
- Docker Compose
- micromamba for local development
The backend exposes two endpoints:
-
GET /ok- lightweight health-style endpoint
- returns
200 - includes
trace_idin both the JSON body and theX-Trace-Idresponse header
-
GET /slow- simulates random latency
- supports configurable failure probability
- creates custom spans named
fake-dbandexternal-call - returns
trace_idin both the JSON body and theX-Trace-Idresponse header
The application also emits:
- JSON logs with
trace_idandspan_id requests_total{route,method,status}request_duration_seconds_bucket{route,method,status,le}inflight_requests
observability-platform/
app/ # FastAPI service, Dockerfile, Python dependencies
otel-collector/ # OTEL Collector pipeline config
prometheus/ # Prometheus scrape config
grafana/ # Grafana provisioning and dashboard JSON
tempo/ # Tempo trace backend config
loki/ # Loki log backend config
docker-compose.yml # one-command local stack startup
environment.yml # micromamba development environment
.env.example # app and Grafana runtime settings
docker compose up --buildDefault local endpoints:
- App:
http://localhost:8000 - Grafana:
http://localhost:3000 - Prometheus:
http://localhost:9090 - Loki:
http://localhost:3100 - Tempo:
http://localhost:3200 - Collector metrics exporter:
http://localhost:9464/metrics
Grafana credentials:
- username:
admin - password:
admin
micromamba env create -f environment.yml
micromamba activate obs-platformIf you want to run the FastAPI app on your host while the rest of the stack stays in Docker:
export OTEL_EXPORTER_OTLP_BASE_ENDPOINT=http://localhost:4318
python app/main.pycurl -s http://localhost:8000/ok | jq .
curl -s "http://localhost:8000/slow" | jq .
curl -s "http://localhost:8000/slow?min_ms=200&max_ms=1200&fail_rate=0.35" | jq .
curl -i "http://localhost:8000/slow?fail_rate=1"seq 200 | xargs -I{} -P 20 curl -s "http://localhost:8000/slow?min_ms=100&max_ms=900&fail_rate=0.2" >/dev/nullOpen Grafana and go to:
DashboardsObservability Demo / Service Overview
The dashboard includes:
- request rate
- error rate
- P95 latency
- inflight requests
- requests by route and status
- recent request logs
The dashboard uses:
histogram_quantile(0.95, sum by (le) (rate(request_duration_seconds_bucket[$__rate_interval])))
- Open
Explore - Select the
Tempodatasource - Query recent traces
- Open a
/slowtrace to inspect:- the FastAPI HTTP server span
- the
fake-dbcustom span - the
external-callcustom span
Use the trace_id from the response header or from a Tempo trace and query Loki with:
{service_name="demo-api"} | json | trace_id="PUT_TRACE_ID_HERE"
Grafana is provisioned so Tempo can link directly to Loki logs for the same trace.
Average latency hides tail behavior. A service can have a good average while still serving a meaningful number of very slow requests. P95 gives a more useful operational signal.
This repository is a local demo and interview project, so the default is optimized for visibility and learning. In production, you would usually reduce the sampling ratio based on traffic volume and cost.
The sampling ratio is controlled by:
TRACE_SAMPLE_RATIO=1.0
Metrics are already aggregated and relatively cheap compared to full-fidelity traces. Request rate, error rate, and latency SLOs need complete counts to stay reliable.
- backend API implementation in Python with FastAPI
- observability-first service design
- OpenTelemetry instrumentation for traces, metrics, and logs
- containerized local platform setup with Docker Compose
- operational thinking around latency, error rate, structured logs, and trace correlation
- Grafana provisioning instead of manual dashboard setup
- How to use the OpenTelemetry Collector as a central ingestion layer instead of wiring each backend directly into the application.
- Why P95 latency is usually a more operationally useful signal than average latency.
- How structured logs become much more valuable when they share the same
trace_idas traces. - How to package a multi-service local platform so other engineers can run it with one command instead of manual setup.
Check whether the Collector metrics exporter has data:
curl -s http://localhost:9464/metrics | grep requests_totalMake sure you have called /ok or /slow, then inspect recent traces in Grafana Explore with the Tempo datasource.
First confirm the app is writing JSON logs:
docker compose logs appIf logs appear in stdout but not in Loki, inspect the OTEL Collector and Loki container logs next.

