Skip to content

sandbox: slim image from 1.65GB to ~570MB#230

Open
samcm wants to merge 5 commits into
masterfrom
worktree-slim-sandbox-image
Open

sandbox: slim image from 1.65GB to ~570MB#230
samcm wants to merge 5 commits into
masterfrom
worktree-slim-sandbox-image

Conversation

@samcm

@samcm samcm commented Jun 17, 2026

Copy link
Copy Markdown
Member

Trims the sandbox image from 1.65GB to ~570MB.

  • requirements.in: drop libs nothing in the product references (polars, scikit-learn, statsmodels, networkx, dask, and the redundant viz stacks altair/vl-convert/bokeh/plotnine/pygwalker/kaleido). The only advertised dataframe type is pandas; matplotlib/seaborn cover chart export (the visualization eval path) and plotly stays for interactive HTML. Recompiled the hash-locked requirements.txt.
  • Replace pyarrow (~140MB of bundled Arrow C++) with fastparquet (~9MB). Nothing uses Arrow format and clickhouse-connect runs on its native format; the only parquet use is the getting-started example's engine-agnostic df.to_parquet()/pd.read_parquet(), which pandas auto-routes to fastparquet.
  • Multi-stage build installing into a staging prefix that is copied wholesale onto the final /usr/local, so neither uv nor a C toolchain ships while console scripts and package data are preserved. Deps install --only-binary=:all:, so a missing wheel fails loudly rather than compiling an unlocked sdist; build-essential is gone entirely.

Note: plotly static export (kaleido) was already broken in the shipped image because it needs a Chrome install that was never present, so dropping kaleido removes dead weight rather than a working capability.

Trim the sandbox Python deps to a curated core and stop persisting
build-essential in the final image.

- requirements.in: drop unused heavy libs (polars, scikit-learn,
  statsmodels, networkx, dask, fastparquet) and redundant viz stacks
  (altair, vl-convert-python, bokeh, plotnine, pygwalker, kaleido).
  Nothing in product code, docs, or LLM-facing content references them;
  the only advertised dataframe type is pandas. matplotlib/seaborn cover
  static chart export (the visualization eval path); plotly is kept for
  interactive HTML. kaleido is dropped because plotly static export
  already required a Chrome install that was never present.
- Dockerfile: install build-essential only to compile any sdist-only
  deps and purge it in the same layer so it never lands in the image.

site-packages 1090MB -> 551MB; build tools removed from the final image.
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

🐼 Smoke eval — 5fff6b9: ✅ 6/6 pass

📊 Interactive report — tokens p50 13,922 · tokens/solve 13,842.

Reference points: worktree-slim-sandbox-image@8d0f828 100% · worktree-slim-sandbox-image@890f67f 100% · worktree-slim-sandbox-image@f3fc642 100%.

question result tokens tools
forky_node_coverage 13,334 4
tracoor_node_coverage 12,927 3
mainnet_block_arrival_p50 15,541 8
list_datasources 12,061 2
block_count_24h 14,676 10
missed_slots_24h 14,510 6
🔭 Langfuse traces (6 runs; ⚠️ = failed)

The report walks this branch's commits against the master baseline and the most recent release. A self-contained copy is in the run's eval-smoke-* artifact.

samcm added 3 commits June 17, 2026 13:41
Move uv + build-essential into a builder stage and copy only the
resolved site-packages into the final image. Neither uv (~50MB) nor the
C toolchain ships anymore, and splitting the toolchain install out of the
dependency install keeps each layer cached independently: editing
requirements.txt no longer re-runs the build-essential apt install.

Final image: 749MB -> 701MB (1.65GB baseline). Session-mode shell helpers
(sh/sleep/find/mkdir/chmod/base64/rm) and the full python stack verified
present.
pyarrow bundles the entire Arrow C++ stack (~140MB: libarrow, Flight,
Substrait, Acero, compute, parquet) but nothing in product code uses
Arrow format or parquet beyond a MIME-type mapping. clickhouse-connect
runs fine on its native format without it. fastparquet (~9MB + cramjam)
gives pandas the same read_parquet/to_parquet capability for
sandbox-authored code, and pandas auto-selects it when pyarrow is absent.

Final image: 701MB -> 574MB (1.65GB baseline).
…oolchain)

Address review feedback on the multi-stage build:

- Install deps into a staging prefix and COPY the whole /install onto
  /usr/local, so the final image keeps console scripts and wheel data
  files, not just site-packages.
- Add --only-binary=:all: so a missing wheel fails the build loudly
  instead of silently compiling an sdist against builder-only libraries.
- Since every dependency now resolves to a prebuilt wheel and the
  ethpandaops package is pure Python, drop build-essential entirely from
  the builder (smaller, faster, deterministic).

Final image: 568MB (1.65GB baseline). Verified: shell helpers for
session mode, full python stack, parquet via fastparquet, and chart
export all work; uv and the C toolchain are absent.
@samcm samcm changed the title sandbox: slim image from 1.65GB to ~750MB sandbox: slim image from 1.65GB to ~570MB Jun 17, 2026
The fastparquet floor was carelessly set to 2024.11.0; bump it to
2026.3.0 (resolves to 2026.5.0, released 2026-05-15). Recompile the lock
with --exclude-newer=2026-06-03 so every resolved package is recent but
at least two weeks old rather than a brand-new release. No other pins
changed - they were already settled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant