Skip to content

blue-samarth/MiniDocker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

104 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

miniDocker: Container Runtime Implementation

A container runtime written in Go demonstrating core Linux containerization technologies. Implements namespace isolation, cgroups resource limits, filesystem layering, security hardening, and container lifecycle management.

Project Status

All 7 phases completed. Fully functional container runtime with comprehensive test coverage and CI/CD integration.

Metrics:

  • Total codebase: ~6,400 lines of Go
  • Test coverage: 32.3% overall (70–85% of non-root-only code)
  • Unit tests: 700+
  • Performance: ~10,000 containers/sec state creation rate
  • Supported OS: Linux with cgroups v2

Quick Start

Prerequisites:

  • Go 1.21+
  • Linux kernel 5.13+ with cgroups v2 enabled
  • Root privileges for container execution

Build:

go mod download
go build -o miniDocker

Run a container:

sudo ./miniDocker run /var/lib/images/alpine /bin/sh

List containers:

./miniDocker ps

View logs:

./miniDocker logs <container-id>

Architecture Overview

Container Process (PID 1)
  |
  +-- Namespace Isolation
  |    - UTS: hostname
  |    - PID: process tree
  |    - Network: network stack
  |    - Mount: filesystem
  |    - IPC: system V IPC
  |    - User: UID/GID mapping
  |
  +-- Filesystem Isolation (OverlayFS)
  |    - Lower: Read-only image
  |    - Upper: Read-write container layer
  |    - Merged: Union view inside container
  |
  +-- Resource Limits (Cgroups v2)
  |    - Memory: byte limit, OOM kill
  |    - CPU: quota/period throttling
  |    - PIDs: process count limit
  |    - Swap: optional swap limit
  |
  +-- Network Isolation
  |    - Bridge: miniBridge0 on host (172.20.0.1/16)
  |    - IPAM: allocates 172.20.x.x to containers
  |    - veth: virtual ethernet pair per container
  |
  +-- Security Hardening
       - Capabilities: 27 dangerous ones dropped
       - Seccomp BPF: ~200 syscalls blocked
       - Path masking: /proc/kcore, /proc/mem, etc. hidden
       - NO_NEW_PRIVS: prevents setuid/capability escalation

Implementation Phases

Phase 1: Process Execution and Namespace Isolation

Re-execs the binary with an init sentinel to enter the container namespace, then isolates UTS, PID, Network, IPC, Mount, and User namespaces. Maps container UID 0 to the host UID without privilege escalation.

Key files: container/run.go, container/init.go

Namespace flags:

CLONE_NEWUTS   - Isolated hostname/domainname
CLONE_NEWPID   - Container init becomes PID 1
CLONE_NEWNET   - Isolated network interfaces
CLONE_NEWIPC   - Isolated IPC (message queues, semaphores, shared memory)
CLONE_NEWNS    - Isolated mount tree
CLONE_NEWUSER  - UID/GID mapping (container 0 → host UID)

Phase 2: Filesystem Isolation

Mounts an OverlayFS combining a read-only image layer with a per-container read-write upper layer, then calls pivot_root to make the merged view the container's root. Mounts /proc, /sys, /dev, /dev/pts, and /dev/shm.

Key files: fs/overlay.go, fs/mount.go

pivot_root sequence:

  1. Bind-mount newRoot onto itself (MS_BIND | MS_REC)
  2. Create put_old directory inside newRoot
  3. syscall.PivotRoot(newRoot, putOld)
  4. chdir("/") inside new root
  5. Unmount and remove put_old

Phase 3: Resource Control via Cgroups v2

Creates a cgroup under /sys/fs/cgroup/miniDocker/<id>, writes resource limits, then attaches the container process via cgroup.procs.

Key file: cgroups/cgroup.go

Supported limits:

Resource Minimum Format
Memory 4 MiB 256m, 1g
CPU 0.5, 2.0 (fraction of one core)
CPU weight 110000
PIDs integer (0 = unlimited)
Swap ≥ memory same as memory

Note on timing: Cgroups are attached after the container process starts. There is a brief window between cmd.Start() and cgroup attachment during which the process runs without resource limits. This is a known limitation of the current architecture; production runtimes use CLONE_INTO_CGROUP or a pipe-based synchronization to eliminate this window.

Phase 4: Network Stack

Creates a miniBridge0 bridge on the host (172.20.0.1/16), allocates a container IP via IPAM, and establishes a veth pair connecting the container's network namespace to the bridge. Enables IP forwarding and installs an iptables MASQUERADE rule for outbound traffic.

Key files: network/bridge.go, network/ipam.go, network/veth.go

IPAM:

  • Range: 172.20.0.2 – 172.20.255.254 (gateway 172.20.0.1 skipped)
  • State: /var/run/miniDocker/ipam.json (atomic write via .tmp rename)
  • Thread-safe with sync.Mutex; refreshes state from disk on each allocation

Phase 5: Security Hardening

Applies a five-layer security model inside the container after the overlay mount and before exec:

  1. Path masking — bind-mounts /dev/null over 13 sensitive kernel paths (/proc/kcore, /proc/kmem, /sys/firmware, etc.) and remounts 5 paths read-only. Best-effort; missing paths are silently skipped.

  2. Capability dropping — removes 27 capabilities from bounding, effective, permitted, and inheritable sets. Notable drops: CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_SYS_MODULE, CAP_NET_ADMIN.

  3. NO_NEW_PRIVSprctl(PR_SET_NO_NEW_PRIVS, 1) prevents setuid binaries and file capabilities from elevating privilege. Inherited by children.

  4. Seccomp BPF — whitelist filter: ~150 safe syscalls allowed, everything else returns EPERM. Notable omissions from the allowlist:

    • SYS_IOCTL — no argument filtering, significant container-escape surface
    • SYS_PRCTL — would allow installing a nested seccomp filter to re-allow blocked syscalls
    • SYS_CLONE3 — without argument filtering, can enable unsafe clone flags

    Consequence: PR_SET_NAME and PR_GET_DUMPABLE are blocked, which may affect Go runtime thread naming and some Java/JVM diagnostics.

Key files: security/capabilities.go, security/seccomp.go, security/paths.go

Phase 6: CLI and Container Management

Full CLI with run, ps, logs, stop, rm, inspect, and stats commands. State is persisted to JSON files under /var/run/miniDocker/containers/<id>/.

State machine: created → running → exited | error

Key files: cmd/, state/

Note on stop: The PID stored in container state is the host-side PID of the miniDocker run parent process, not the container's PID 1. Signaling it is correct — the parent's exit triggers cleanup — but the naming in the code reflects this explicitly.

Note on detached containers: Cleanup (overlay unmount, cgroup removal, IP release) is deferred to when RunContainer returns. All containers are therefore tied to the lifetime of their parent miniDocker run process. There is no daemon mode.

Phase 7: State Persistence

Atomic state writes via .tmp-then-rename, state recovery on restart by scanning the containers directory, log rotation at 10 MiB with up to 3 rotations, and comprehensive benchmarks.

Key files: state/state.go, state/logs.go, state/lifecycle.go

CLI Reference

Usage: miniDocker <command> [options]

Container lifecycle:
  run     [options] <image-path> <cmd> [args...]   Start a new container
  stop    [-t <sec>] [-f] <id> [...]               Stop a running container
  rm      [-f] <id> [...]                          Remove a container

Inspection:
  ps      [-a] [-q] [--format table|json|ids]      List containers
  logs    [-f] [--tail N] [--timestamps] <id>      Fetch container logs
  inspect [--pretty=false] <id> [...]              Show container metadata
  stats   [-a] [--no-stream] [--interval ms] <id>  Display resource usage

run options:
  --memory  <limit>    Memory limit (e.g. 256m, 1g)
  --swap    <limit>    Swap limit (must be >= memory)
  --cpu     <frac>     CPU fraction (e.g. 0.5, 2.0)
  --cpu-weight <n>     CPU scheduling weight (1–10000)
  --pids    <n>        Max PIDs (0 = unlimited)

Testing

# Unit tests (no root required)
go test -v ./tests/...

# Integration tests (requires root)
sudo go test -v ./tests/...

# Coverage report
go test -v -coverprofile=coverage.out ./tests/...
go tool cover -html=coverage.out -o coverage.html

# Benchmarks
sudo go test -v -bench=. -timeout 120s ./tests/...

Coverage breakdown:

  • 32.3% overall (expected for system software — most untested lines require root, live kernel features, or running containers)
  • 70–85% of non-root-only code
  • Root-only and kernel-dependent paths: integration tests or graceful skips

Directory Structure

miniDocker/
  main.go                       Entry point and command dispatcher
  go.mod, go.sum                Module dependencies
  .github/workflows/ci.yml      GitHub Actions CI/CD

  container/
    run.go                      Parent-side: namespace setup, process launch
    init.go                     Child-side: overlay mount, pivot_root, security

  fs/
    overlay.go                  OverlayFS mounting
    mount.go                    pivot_root and pseudo-filesystem mounts

  cgroups/
    cgroup.go                   Resource limit validation and application

  network/
    bridge.go                   miniBridge0 lifecycle and iptables rules
    ipam.go                     IP address allocation and persistence
    veth.go                     veth pair setup and container-side config

  security/
    capabilities.go             Capability dropping (27 caps, all sets)
    seccomp.go                  Seccomp BPF filter (~150-syscall whitelist)
    paths.go                    Sensitive path masking and read-only remounts

  state/
    types.go                    ContainerStatus, ContainerConfig, error types
    state.go                    Persistent state manager (JSON, RWMutex)
    lifecycle.go                State transition coordinator
    logs.go                     Log file I/O and rotation

  cmd/
    run.go, ps.go, logs.go      CLI commands
    stop.go, rm.go              CLI commands
    inspect.go, stats.go        CLI commands
    helpers.go                  Shared formatting utilities

  tests/
    cmd_test.go                 CLI and helper unit tests
    cgroups_test.go             Resource limit validation tests
    phase3_cgroups_test.go      Additional cgroups tests
    integration_test.go         Full lifecycle tests (root-only)
    network_bridge_veth_test.go Bridge and veth tests
    network_ipam_test.go        IPAM allocation tests
    security_test.go            Security layer tests
    state_test.go               State persistence tests
    run_test.go                 Container execution tests
    init_test.go                Init process tests

Known Limitations

  • No daemon mode. Containers are tied to the lifetime of the miniDocker run process. There is no background daemon or detach flag.
  • Cgroup timing window. Resource limits are applied after process start, not before. A brief uncontrolled window exists between fork and cgroup attachment.
  • x86-64 only. The seccomp BPF filter validates AUDIT_ARCH_X86_64 at load time and kills the process on architecture mismatch.
  • prctl blocked. SYS_PRCTL is excluded from the seccomp whitelist to prevent nested filter installation. This breaks PR_SET_NAME (thread naming) used by Go, Java, and some C runtimes. Workloads relying on thread names may behave unexpectedly.
  • No volume mounts. Only OverlayFS layering is supported; there is no bind-mount or -v equivalent.
  • No image pull. Images must be pre-existing directory trees on the host.

Troubleshooting

Container won't start:

  • Verify the image path exists: ls <image-path>
  • Check cgroup controllers: cat /sys/fs/cgroup/cgroup.controllers
  • Confirm namespace support: ls /proc/sys/kernel/unprivileged_userns_clone

Network not working:

  • Check bridge: ip link show miniBridge0
  • Check IPAM state: cat /var/run/miniDocker/ipam.json
  • Verify iptables: sudo iptables -t nat -L POSTROUTING

Security verification:

  • Seccomp: attempt ptrace inside container — should return EPERM
  • Path masking: cat /proc/kcore inside container — should read as empty
  • Capabilities: cat /proc/self/status | grep Cap inside container

References

About

Container runtime in Go — namespaces, cgroups v2, OverlayFS, seccomp, and veth networking from scratch

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages