The forkd controller exposes a JSON/HTTP API on 127.0.0.1:8889 by
default. Pass --tls-cert/--tls-key to serve HTTPS instead of
plain HTTP. All routes except /healthz require a bearer token when
the daemon is started with --token-file.
Authorization: Bearer <contents-of-token-file>API versioning: every breaking change moves to a new /vN prefix. The
controller will support the previous major in parallel for one minor
release after the new one ships.
Liveness probe. Always returns 200. Bypasses authentication so that load balancers can probe the daemon without a credential.
{ "ok": true }{ "version": "0.1.0", "api": "v1" }Prometheus text exposition format. Stable metric names:
forkd_snapshots_total(gauge) — registered snapshotsforkd_sandboxes_active(gauge) — currently-alive child VMsforkd_build_info{version="X.Y.Z"}— always 1, label carries the build version
Build a snapshot of a freshly booted parent VM and register it under
<tag>. Blocks for boot_wait_secs while userspace warms up inside
the guest.
Request:
{
"tag": "py",
"kernel": "/var/lib/forkd/kernels/vmlinux-6.1",
"rootfs": "/var/lib/forkd/rootfs/python.ext4",
"rw": true,
"tap": "forkd-tap0",
"boot_wait_secs": 10
}Response (201 Created):
{
"tag": "py",
"dir": "/var/lib/forkd/snapshots/py",
"created_at_unix": 1717000000
}Errors:
400 Bad Request— invalidtag, missingkernel/rootfs, snapshot already exists500 Internal Server Error— Firecracker boot/snapshot failure (see logs)
List registered snapshots: [SnapshotInfo, ...].
Remove the registry entry and delete the on-disk snapshot files.
Returns 204 No Content. 404 if no such tag is registered and no
on-disk files exist.
Fork N children from a registered snapshot tag.
Request:
{
"snapshot_tag": "py",
"n": 10,
"per_child_netns": true,
"memory_limit_mib": 256,
"live_fork": true
}n— 1 ≤ n ≤ 1000per_child_netns— when true, each child is placed inforkd-child-<i>; the host must have provisioned those namespaces viascripts/netns-setup.sh Nfirst.memory_limit_mib— setsmemory.maxon a per-child cgroup v2 leaf. Requires cgroup v2 unified hierarchy and write access to/sys/fs/cgroup/forkd/.live_fork(v0.4+, defaultfalse) — boot the sandbox with a memfd-backed RAM region so laterPOST .../branchcalls can usemode: "live". Requires Linux ≥ 5.7 and the vendored Firecracker fork (seedocs/VENDORED-FIRECRACKER.md).forkd doctorprobes both prerequisites.
Response (201 Created): [SandboxInfo, ...].
List active sandboxes.
One sandbox's metadata.
Terminate. Kills the Firecracker process and removes the cgroup leaf.
Returns 204 No Content.
Round-trip to the guest agent inside the VM.
{ "pong": true, "numpy_version": "1.26.4", "pid": 1 }Spawn a subprocess in the sandbox.
Request:
{ "args": ["python3", "-c", "print(2+2)"], "timeout_secs": 30 }Response:
{ "stdout": "4\n", "stderr": "", "exit_code": 0 }Evaluate a Python expression against the already-warmed interpreter running as PID 1.
Request: { "code": "numpy.zeros(5).sum()" }
Response: { "result": "0.0", "error": null, "exit_code": 0 }
Pause a running sandbox, snapshot its memory + vmstate to a new tag, resume it. The resulting snapshot is independent of the source's lifecycle — fork from it or delete it regardless of whether the source sandbox is still alive. Volumes from the source snapshot are inherited automatically, so grandchildren see the same persistent disks.
Request:
{ "tag": "checkpoint-1", "mode": "live", "wait": false }tagis optional. When unset the daemon generatesbranch-<source-id>-<unix-ts>. Must match^[A-Za-z0-9_][A-Za-z0-9._-]{0,63}$.mode(v0.4+) is one of"full","diff","live". Defaults to"full"when unset."live"requires the source sandbox to have been spawned withlive_fork: trueand the host to support UFFD_WP- memfd_create (
forkd doctorprobes both).
- memfd_create (
diff: trueis the legacy v0.3 equivalent ofmode: "diff"; kept for compatibility. Mutually exclusive withmode— sending both yields400 Bad Request.wait(v0.4+, defaulttrue) is only meaningful withmode: "live". Whenfalse, the daemon returnsSnapshotInfowithstatus: "writing"as soon as the source resumes (~10 ms); the background memory copy finishes asynchronously and the snapshot'sstatusflips to"ready"(or"failed"). PollGET /v1/snapshotsto detect completion.
Response (201 Created): SnapshotInfo with
branched_from set to the source sandbox id and pause_ms
populated with the measured pause window in milliseconds. With
mode: "live", also returns status ("writing" when
wait: false, otherwise "ready").
Errors:
400 Bad Request— bothmodeanddiffset404 Not Found— source sandbox id not inlive_vms409 Conflict— tag already exists on disk;DELETEit first409 Conflict— a BRANCH for this exact tag is already in flight503 Service Unavailable— daemon at branch concurrency cap (default 4)500 Internal Server Error— pause / snapshot / resume failure
Pause-window semantics by mode. The source's user-visible pause:
mode: "full"— 0.5–8 s, whole guest RAM written.mode: "diff"— ~200 ms idle source, sub-second for typical agent workloads (v0.3+; seebench/pause-window/RESULTS-v0.3.md).mode: "live"— sub-50 ms; dirty pages captured asynchronously via UFFD_WP (v0.4+).
The source is paused at the vCPU level (kernel state and TCP sockets
stay; application-level keepalives may time out for "full").
Modal's "branch" operation has comparable semantics to mode: "full".
If resume fails after a successful snapshot the snapshot file is
intact and returned to the caller; the source sandbox may be left in
an unknown state. The controller logs this as a warning rather than
failing the request, because the user's primary expectation (a valid
new snapshot) has been met.
See docs/design/branching.md for the full
rationale, use cases, and follow-up roadmap.
{
"id": "sb-67a1b3-0000",
"snapshot_tag": "py",
"netns": "forkd-child-1",
"guest_addr": "10.42.0.2:8888",
"created_at_unix": 1717000123,
"pid": 314159,
"memory_limit_mib": 256
}{
"tag": "py",
"dir": "/var/lib/forkd/snapshots/py",
"created_at_unix": 1717000000,
"branched_from": "sb-67a1b3-0000",
"pause_ms": 1820,
"status": "ready"
}branched_fromis omitted when the snapshot was built from kernel + rootfs viaPOST /v1/snapshots; it is present (carrying the source sandbox id) only when the snapshot was produced viaPOST /v1/sandboxes/:id/branch. Use this field to trace snapshot lineage / audit.pause_msis the measured source-VM pause window in milliseconds (pause() → resume()envelope). Omitted for snapshots not produced via BRANCH. This is the daemon's ground truth; the application- observed pause (TCP stalls, missed pings) can be longer due to OS retransmit timers.status(v0.4+, optional) —"writing"while a live BRANCH's background memory copy is in flight (only seen withmode: "live"wait: false),"ready"once the snapshot is consumable,"failed"if the background copy hit an error. Omitted on snapshots from Diff or Full BRANCH (they're synchronous, so the daemon only returns once they'reready).
Every 4xx and 5xx response carries:
{ "error": "human-readable message" }