A container runtime written in Go demonstrating core Linux containerization technologies. Implements namespace isolation, cgroups resource limits, filesystem layering, security hardening, and container lifecycle management.
All 7 phases completed. Fully functional container runtime with comprehensive test coverage and CI/CD integration.
Metrics:
- Total codebase: ~6,400 lines of Go
- Test coverage: 32.3% overall (70–85% of non-root-only code)
- Unit tests: 700+
- Performance: ~10,000 containers/sec state creation rate
- Supported OS: Linux with cgroups v2
Prerequisites:
- Go 1.21+
- Linux kernel 5.13+ with cgroups v2 enabled
- Root privileges for container execution
Build:
go mod download
go build -o miniDockerRun a container:
sudo ./miniDocker run /var/lib/images/alpine /bin/shList containers:
./miniDocker psView logs:
./miniDocker logs <container-id>Container Process (PID 1)
|
+-- Namespace Isolation
| - UTS: hostname
| - PID: process tree
| - Network: network stack
| - Mount: filesystem
| - IPC: system V IPC
| - User: UID/GID mapping
|
+-- Filesystem Isolation (OverlayFS)
| - Lower: Read-only image
| - Upper: Read-write container layer
| - Merged: Union view inside container
|
+-- Resource Limits (Cgroups v2)
| - Memory: byte limit, OOM kill
| - CPU: quota/period throttling
| - PIDs: process count limit
| - Swap: optional swap limit
|
+-- Network Isolation
| - Bridge: miniBridge0 on host (172.20.0.1/16)
| - IPAM: allocates 172.20.x.x to containers
| - veth: virtual ethernet pair per container
|
+-- Security Hardening
- Capabilities: 27 dangerous ones dropped
- Seccomp BPF: ~200 syscalls blocked
- Path masking: /proc/kcore, /proc/mem, etc. hidden
- NO_NEW_PRIVS: prevents setuid/capability escalation
Re-execs the binary with an init sentinel to enter the container namespace,
then isolates UTS, PID, Network, IPC, Mount, and User namespaces. Maps
container UID 0 to the host UID without privilege escalation.
Key files: container/run.go, container/init.go
Namespace flags:
CLONE_NEWUTS - Isolated hostname/domainname
CLONE_NEWPID - Container init becomes PID 1
CLONE_NEWNET - Isolated network interfaces
CLONE_NEWIPC - Isolated IPC (message queues, semaphores, shared memory)
CLONE_NEWNS - Isolated mount tree
CLONE_NEWUSER - UID/GID mapping (container 0 → host UID)
Mounts an OverlayFS combining a read-only image layer with a per-container
read-write upper layer, then calls pivot_root to make the merged view the
container's root. Mounts /proc, /sys, /dev, /dev/pts, and /dev/shm.
Key files: fs/overlay.go, fs/mount.go
pivot_root sequence:
- Bind-mount newRoot onto itself (
MS_BIND | MS_REC) - Create
put_olddirectory inside newRoot syscall.PivotRoot(newRoot, putOld)chdir("/")inside new root- Unmount and remove
put_old
Creates a cgroup under /sys/fs/cgroup/miniDocker/<id>, writes resource
limits, then attaches the container process via cgroup.procs.
Key file: cgroups/cgroup.go
Supported limits:
| Resource | Minimum | Format |
|---|---|---|
| Memory | 4 MiB | 256m, 1g |
| CPU | — | 0.5, 2.0 (fraction of one core) |
| CPU weight | — | 1–10000 |
| PIDs | — | integer (0 = unlimited) |
| Swap | ≥ memory | same as memory |
Note on timing: Cgroups are attached after the container process starts.
There is a brief window between cmd.Start() and cgroup attachment during
which the process runs without resource limits. This is a known limitation of
the current architecture; production runtimes use CLONE_INTO_CGROUP or a
pipe-based synchronization to eliminate this window.
Creates a miniBridge0 bridge on the host (172.20.0.1/16), allocates a
container IP via IPAM, and establishes a veth pair connecting the container's
network namespace to the bridge. Enables IP forwarding and installs an
iptables MASQUERADE rule for outbound traffic.
Key files: network/bridge.go, network/ipam.go, network/veth.go
IPAM:
- Range: 172.20.0.2 – 172.20.255.254 (gateway 172.20.0.1 skipped)
- State:
/var/run/miniDocker/ipam.json(atomic write via.tmprename) - Thread-safe with
sync.Mutex; refreshes state from disk on each allocation
Applies a five-layer security model inside the container after the overlay
mount and before exec:
-
Path masking — bind-mounts
/dev/nullover 13 sensitive kernel paths (/proc/kcore,/proc/kmem,/sys/firmware, etc.) and remounts 5 paths read-only. Best-effort; missing paths are silently skipped. -
Capability dropping — removes 27 capabilities from bounding, effective, permitted, and inheritable sets. Notable drops:
CAP_SYS_ADMIN,CAP_SYS_PTRACE,CAP_SYS_MODULE,CAP_NET_ADMIN. -
NO_NEW_PRIVS —
prctl(PR_SET_NO_NEW_PRIVS, 1)prevents setuid binaries and file capabilities from elevating privilege. Inherited by children. -
Seccomp BPF — whitelist filter: ~150 safe syscalls allowed, everything else returns
EPERM. Notable omissions from the allowlist:SYS_IOCTL— no argument filtering, significant container-escape surfaceSYS_PRCTL— would allow installing a nested seccomp filter to re-allow blocked syscallsSYS_CLONE3— without argument filtering, can enable unsafe clone flags
Consequence:
PR_SET_NAMEandPR_GET_DUMPABLEare blocked, which may affect Go runtime thread naming and some Java/JVM diagnostics.
Key files: security/capabilities.go, security/seccomp.go,
security/paths.go
Full CLI with run, ps, logs, stop, rm, inspect, and stats
commands. State is persisted to JSON files under
/var/run/miniDocker/containers/<id>/.
State machine: created → running → exited | error
Key files: cmd/, state/
Note on stop: The PID stored in container state is the host-side PID of
the miniDocker run parent process, not the container's PID 1. Signaling it
is correct — the parent's exit triggers cleanup — but the naming in the code
reflects this explicitly.
Note on detached containers: Cleanup (overlay unmount, cgroup removal, IP
release) is deferred to when RunContainer returns. All containers are
therefore tied to the lifetime of their parent miniDocker run process. There
is no daemon mode.
Atomic state writes via .tmp-then-rename, state recovery on restart by
scanning the containers directory, log rotation at 10 MiB with up to 3
rotations, and comprehensive benchmarks.
Key files: state/state.go, state/logs.go, state/lifecycle.go
Usage: miniDocker <command> [options]
Container lifecycle:
run [options] <image-path> <cmd> [args...] Start a new container
stop [-t <sec>] [-f] <id> [...] Stop a running container
rm [-f] <id> [...] Remove a container
Inspection:
ps [-a] [-q] [--format table|json|ids] List containers
logs [-f] [--tail N] [--timestamps] <id> Fetch container logs
inspect [--pretty=false] <id> [...] Show container metadata
stats [-a] [--no-stream] [--interval ms] <id> Display resource usage
run options:
--memory <limit> Memory limit (e.g. 256m, 1g)
--swap <limit> Swap limit (must be >= memory)
--cpu <frac> CPU fraction (e.g. 0.5, 2.0)
--cpu-weight <n> CPU scheduling weight (1–10000)
--pids <n> Max PIDs (0 = unlimited)
# Unit tests (no root required)
go test -v ./tests/...
# Integration tests (requires root)
sudo go test -v ./tests/...
# Coverage report
go test -v -coverprofile=coverage.out ./tests/...
go tool cover -html=coverage.out -o coverage.html
# Benchmarks
sudo go test -v -bench=. -timeout 120s ./tests/...Coverage breakdown:
- 32.3% overall (expected for system software — most untested lines require root, live kernel features, or running containers)
- 70–85% of non-root-only code
- Root-only and kernel-dependent paths: integration tests or graceful skips
miniDocker/
main.go Entry point and command dispatcher
go.mod, go.sum Module dependencies
.github/workflows/ci.yml GitHub Actions CI/CD
container/
run.go Parent-side: namespace setup, process launch
init.go Child-side: overlay mount, pivot_root, security
fs/
overlay.go OverlayFS mounting
mount.go pivot_root and pseudo-filesystem mounts
cgroups/
cgroup.go Resource limit validation and application
network/
bridge.go miniBridge0 lifecycle and iptables rules
ipam.go IP address allocation and persistence
veth.go veth pair setup and container-side config
security/
capabilities.go Capability dropping (27 caps, all sets)
seccomp.go Seccomp BPF filter (~150-syscall whitelist)
paths.go Sensitive path masking and read-only remounts
state/
types.go ContainerStatus, ContainerConfig, error types
state.go Persistent state manager (JSON, RWMutex)
lifecycle.go State transition coordinator
logs.go Log file I/O and rotation
cmd/
run.go, ps.go, logs.go CLI commands
stop.go, rm.go CLI commands
inspect.go, stats.go CLI commands
helpers.go Shared formatting utilities
tests/
cmd_test.go CLI and helper unit tests
cgroups_test.go Resource limit validation tests
phase3_cgroups_test.go Additional cgroups tests
integration_test.go Full lifecycle tests (root-only)
network_bridge_veth_test.go Bridge and veth tests
network_ipam_test.go IPAM allocation tests
security_test.go Security layer tests
state_test.go State persistence tests
run_test.go Container execution tests
init_test.go Init process tests
- No daemon mode. Containers are tied to the lifetime of the
miniDocker runprocess. There is no background daemon or detach flag. - Cgroup timing window. Resource limits are applied after process start, not before. A brief uncontrolled window exists between fork and cgroup attachment.
- x86-64 only. The seccomp BPF filter validates
AUDIT_ARCH_X86_64at load time and kills the process on architecture mismatch. prctlblocked.SYS_PRCTLis excluded from the seccomp whitelist to prevent nested filter installation. This breaksPR_SET_NAME(thread naming) used by Go, Java, and some C runtimes. Workloads relying on thread names may behave unexpectedly.- No volume mounts. Only OverlayFS layering is supported; there is no
bind-mount or
-vequivalent. - No image pull. Images must be pre-existing directory trees on the host.
Container won't start:
- Verify the image path exists:
ls <image-path> - Check cgroup controllers:
cat /sys/fs/cgroup/cgroup.controllers - Confirm namespace support:
ls /proc/sys/kernel/unprivileged_userns_clone
Network not working:
- Check bridge:
ip link show miniBridge0 - Check IPAM state:
cat /var/run/miniDocker/ipam.json - Verify iptables:
sudo iptables -t nat -L POSTROUTING
Security verification:
- Seccomp: attempt
ptraceinside container — should returnEPERM - Path masking:
cat /proc/kcoreinside container — should read as empty - Capabilities:
cat /proc/self/status | grep Capinside container
man 2 clone,man 2 pivot_root,man 7 cgroups,man 2 seccomp,man 7 capabilities- Kernel cgroup v2 docs
- Kernel OverlayFS docs
- Kernel seccomp docs
- OCI Runtime Spec
golang.org/x/sys— syscall bindingsgithub.com/vishvananda/netlink— netlink interface