Skip to content

VC3D: matter-compressor render cache (one .mca per volume, mc_cache resident) + fetch/render perf#1036

Open
SuperOptimizer wants to merge 67 commits into
ScrollPrize:mainfrom
SuperOptimizer:mca_cache
Open

VC3D: matter-compressor render cache (one .mca per volume, mc_cache resident) + fetch/render perf#1036
SuperOptimizer wants to merge 67 commits into
ScrollPrize:mainfrom
SuperOptimizer:mca_cache

Conversation

@SuperOptimizer

Copy link
Copy Markdown
Contributor

What

Replaces VC3D's render caching with matter-compressor: remote (and local) chunks are fetched once, re-encoded into ONE persistent volume.mca per volume (all LODs, crash-safe, appendable), and served as 16³ blocks through mc_cache (sharded S3FIFO resident cache inside the library). VC3D-side caching scaffolding is deleted; the ChunkCache keeps only per-256³-region fetch status plus the async orchestration (worker pool, listeners). Non-uint8 volumes keep the legacy byte-LRU path.

Vendored matter-compressor is at 1e53dbb (format v6: per-axis dims, self-contained blocks, SIMD decode). Stale/older-format cache files are detected and rebuilt automatically.

Fetch path

  • per-region single-flight; worker-pool tasks interleave across regions so 48 I/O workers fetch distinct regions concurrently (workers are ~85% network-blocked)
  • remote sharded zarr reads: shard index table cached after one GET — one S3 round trip per chunk instead of two (halves cold-fetch latency)
  • 128³-chunk sources assemble their 256³ region with parallel sub-fetches
  • download stats/status bar count only bytes actually pulled from the source; archive serves are not "downloading"

Rendering / frontend

Profile-driven (perf + --profile logs, before/after in the commit messages):

  • interactive preview system deleted (~420 lines): no more transformed-stale-frame previews, no 140ms settle timer, no 250/500ms chunk-ready coalescing windows (which starved repaints while chunks streamed). Interactions schedule normal full renders behind the single 16ms debounce; cache-side chunk-ready callbacks are throttled to ~30Hz with a guaranteed final notify
  • speculative prefetch deleted: render-worker tryGetChunk misses queue exactly what requested frames need; the mca archive makes refetches cheap
  • stale-render coalescing: a submit while the worker is busy only invalidates the in-flight frame if view parameters changed (fingerprint); data-only refreshes display it. Discarded renders: 19% → 5%
  • pool submission moved outside the cache mutex; ChunkKey hash packs to 64 bits + one mix; shift/mask chunk indexing in the trilinear sampler hot path
  • main-thread stalls (formerly up to 284ms in submitRender) are gone — the UI thread no longer registers in CPU profiles

Also

  • fixed a segfault on every exit: gnutls/libtasn1 DSO destructors free through mimalloc after its teardown in _dl_fini; VC3D now runs real teardown then _exit()

Measured end state

Steady-state (warm archive): blocks decode from volume.mca via mc_cache, zero network, ~4% CPU in mc block decode. Cold exploration: c3d decode + mc encode dominate pool threads (once per region ever, amortizes out). Render frames are 94% sampler kernel — follow-up territory (pan reuse, kernel work).

Known: test_volume_local exact-roundtrip fails on this branch (reads go through the lossy q=8 mca; pre-existing design question, not a regression).

SuperOptimizer added 15 commits June 9, 2026 15:16
Vendor SuperOptimizer/matter-compressor @ ab0649c into libs/. This snapshot has the
appendable dense-node archive: mc_writer_open / mc_append_chunk_raw /
mc_append_chunk_compressed / mc_writer_close (persistent, crash-safe, reopened across
runs) + mc_open_streaming (byte-source range-GET reader). Foundation for the mca
streaming/re-encode cache wired into the chunk fetch path.
…nto vc_core

- re-vendor matter-compressor @ 1524688 (unified mc_archive read+write handle,
  vendoring-friendly CMake)
- add_subdirectory(libs/matter-compressor) + link matter_compressor into vc_core
- MatterArchive: RAII C++ wrapper around mc_archive_open/append_chunk_raw/
  chunk_offset/decode_block/close. Storage/encode unit 256^3; decode/serve unit 16^3
  (the granularity the resident chunk cache will key on). Append is thread-safe; decode
  serialized by the underlying archive (codec quality is process-global).
MatterCacheFetcher decorates each level's source (zarr/c3d) fetcher to re-express the
volume through one persistent matter-compressor (.mca) archive: 'fetch native, serve
mca-native'.

- The volume is reported to the ChunkCache at mca's native 16^3 chunk granularity (the
  resident cache resides 4KB blocks).
- fetch(16^3 key) -> enclosing 256^3 mca region. On a miss, fetch the SOURCE's native
  chunks covering it (256^3 c3d = 1:1; 128^3 zarr-v2 = eager 2x2x2 coalesce), assemble
  one 256^3 u8 buffer, encode it into the .mca once. Then decode the requested 16^3
  block out of the .mca.
- One .mca holds all chunks at all 8 LODs and persists across runs (skips re-fetch on a
  warm cache). Region-materialization is memoized + checks the persisted archive.

Wired in createChunkCache, gated by VCA_MCA_CACHE=<path.mca> (+ VCA_MCA_QUALITY,
default 8); uint8 volumes only (mca is u8). Off by default. Full VC3D builds + links.
…cache dir

The mca integration was in ZarrChunkFetcher's createChunkCache, which VC3D does NOT call
-- the live path is Volume::createChunkCache. Move it there:
- ONE persistent volume.mca lives in the volume's existing remote cache dir
  (remoteCacheRoot_/id()), NOT /tmp -- same place other chunks are cached.
- When mca engages it REPLACES the old per-chunk-file persistent cache (mca IS the
  persistence), so persistentCachePath is left unset; the old per-chunk cache remains
  only as the fallback when mca is off (non-uint8, local, or VCA_NO_MCA_CACHE).
- Default ON for remote uint8 volumes; VCA_NO_MCA_CACHE disables; VCA_MCA_QUALITY sets q.

applyMatterCache() factored into ZarrChunkFetcher (wrap fetchers + 16^3 LevelInfo).
Also deleted the stale 503GB .vca .vcacache disk-cache artifact (superseded by mca).
Rip out the entire per-chunk-file persistent cache. On-disk caching is now ONLY the
single per-volume matter-compressor archive (volume.mca), and ALL volumes -- remote
AND local -- go through it.

- ChunkCache: deleted readPersistent/readPersistentEmpty/queuePersistentWrite/
  queuePersistentEmptyWrite/writePersistent/writePersistentEmpty/persistentPath/
  persistentEmptyPath/persistentCacheBytes(dir)/persistentCacheWriterPool + the
  Entry::persisted field + the persist-on-evict path. The fetch worker just calls
  fetch() (the MatterCacheFetcher owns mca read/write).
- Options::persistentCachePath -> Options::mcaPath (path to the single volume.mca);
  the 'disk' cache-size stat now reports that file's size (a throttled file_size),
  not a recursive dir scan of millions of per-chunk files.
- ChunkFetchResult: dropped persistentBytes/hasPersistentBytes; IChunkFetcher dropped
  persistentCacheExtension/decodePersistentBytes; ZarrChunkFetcher dropped its
  encoded-c3d persistent path.
- Volume::createChunkCache: mca for every volume (remote -> remote cache dir; local ->
  <dataset>.mcacache/), uint8 only, VCA_NO_MCA_CACHE disables. Viewer no longer sets a
  per-chunk persistent path.
- Tests: removed the per-chunk-cache TEST_CASEs (kept the unrelated ones); all pass.

Also deleted the stale 503GB .vca .vcacache and the 81GB per-chunk chunk_cache on disk.
…port

Root cause: Volume::createChunkCache resolved the mca dir from Volume::remoteCacheRoot_,
but that member was empty for the render path's volumes -- the GUI resolves the cache
root (remoteCacheRootForState: /volpkgs|/ephemeral|settings) and the old code passed it
via Options, which I'd removed. So cacheDir came up empty, applyMatterCache was skipped,
and the slow raw S3-streaming path ran (pegging CPU + 22 S3 conns) while 'disk' read 0
(options_.mcaPath unset).

Fix: add Volume::setRemoteCacheRoot(); the viewer pushes the GUI-resolved root into the
volume before createChunkCache builds the cache. Now mca engages (verified: 'mca cache
enabled', one volume.mca written + grown, chunks served from it, disk stat reads its
size). Added diagnostics: createChunkCache logs isRemote/cacheDir/levels, applyMatterCache
logs why it skips.
Two fixes for 'spinning, barely downloading' (CPU-bound, no progress):
- Re-vendor lock-free mc_archive_decode_block (was serializing every 16^3 decode on one
  mutex -> all cache-IO threads spun on one lock).
- MatterCacheFetcher::ensureRegion: per-256^3-region single-flight. A render touches up
  to 4096 16^3 blocks of the same region nearly simultaneously; before, EACH thread
  passed the not-done check and redundantly re-fetched the same 8 source chunks +
  re-encoded the same 256^3 region. Now the first thread claims the region (InFlight),
  assembles+encodes it once, and publishes Present/Absent; the rest wait on a condvar for
  that one assembly. Eliminates the redundant fetch/encode storm.
Brings the unified archive format v4 (self-contained blocks, trained priors),
the mc_cache sharded CLOCK/NRU decoded-block cache, and lock-free decode.
- MatterArchive owns an mc_cache bound to the archive; decodeBlock is a
  mc_cache_get_copy. Stale/incompatible volume.mca is deleted and rebuilt.
- ChunkCache in mca mode keeps NO decoded bytes and tracks status per 256^3
  REGION (corner-block key): one entry + one fetch task per region instead of
  4096 per-block entries duplicating what the archive + mc_cache already know.
  Resolved blocks decode straight from mc_cache. Legacy byte-LRU remains only
  for non-uint8 volumes (mca is u8).
- fetch throughput: pool tasks interleave across regions (rank priority),
  48 I/O workers (they are mostly network-blocked), parallel sub-chunk
  assembly for 128^3 sources, prefetch keys snapped to region corners.
- pool submission moved outside the cache mutex; chunk-ready callbacks
  throttled to ~30Hz with a guaranteed final notify on drain.
- ChunkKey hash packs to 64 bits (3 lod + 3x20 coord bits) + one mix.
- download stats count only bytes actually pulled from the source; blocks
  served from the on-disk archive are not downloads.
…unk)

Remote sharded reads did two ranged GETs per chunk (16-byte index entry +
payload); over S3 the index round trip doubled per-chunk latency. Fetch the
whole index table once per shard and serve entries from RAM.
- interactive preview deleted outright (~420 lines): coarse axis-slice
  preview, stable-frame transform previews, 50ms preview rate cap, 140ms
  settle timer. Interactions schedule normal full-quality renders.
- chunk-ready 250/500ms restart-on-arrival windows deleted; they starved
  repaints for as long as chunks kept streaming. The cache-side 30Hz
  throttle + the 16ms render debounce are the only coalescing layers.
- speculative prefetch disabled (visible-set warming, viewport halos,
  normal-direction neighbors): render-worker tryGetChunk misses queue
  exactly what requested frames need; the mca archive makes refetches cheap.
- prefetch key enumeration is region-granular via IChunkedArray::prefetchShape
  (was 512x more keys than designed for after the 16^3 re-expression).
- stale-render fix: a busy-time submit only invalidates the in-flight frame
  when view params actually changed (fingerprint); data-only refreshes let it
  display. Discarded renders drop from 19% to 5%.
- status bar only reports downloading when bytes move from the source.
Every exit crashed in _dl_fini: gnutls/libtasn1 destructors free through
mimalloc after its own teardown. Run real teardown (CWindow dtor, settings),
flush, _exit().
Chunk shapes are powers of two (16 mca, 128/256 zarr); replaces 3 idivs per
voxel read in trilinear/nearest sampling.
v6: per-axis dims, per-chunk q, xxh64, per-volume priors; SIMD decode kernels;
S3FIFO default mc_cache eviction (scan-resistant for render loops).
MatterArchive opens with the real volume shape instead of a padded cube;
existing v4 caches are auto-recreated.
The prefetch paths were disabled; remove them: plane/surface halos,
normal-direction neighbors, visible-set warming, the surface tile prefetch
cache, and IChunkedArray::prefetchShape (only consumer). ~550 lines.
@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

Someone is attempting to deploy a commit to the scroll Team on Vercel.

A member of the Team first needs to authorize it.

SuperOptimizer and others added 5 commits June 10, 2026 10:19
…ored tests

Upstream consolidated 11 source files into one header + one implementation;
VC vendors just the pair (same code, same format — existing caches unaffected).
…p,cpp}

One file pair for the whole mca layer: the archive/mc_cache RAII wrapper and
the IChunkFetcher decorator that fills it from the source volume.
@SuperOptimizer

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

💡 Codex Review

const int srcEdge = opened.chunkShapes[i][0]; // assume cubic native chunks

P1 Badge Preserve non-cubic Zarr chunk shapes

This enables the MCA wrapper for every uint8 pyramid but collapses each source chunk shape to opened.chunkShapes[i][0]; the rest of MatterCacheFetcher then computes source chunk indices and copies srcEdge_^3 bytes as if all axes had that same edge. openLocalZarrPyramid/openHttpZarrPyramid preserve arbitrary 3-D chunk shapes from Zarr metadata, so a valid volume chunked like {64,64,128} will have half of each x row ignored and neighboring source chunk coordinates computed incorrectly once MCA is enabled. Please either skip MCA unless all chunk dimensions are equal and divide 256, or pass the full chunk shape through the fetcher.


} else if (!path_.empty()) {
cacheDir = path_.parent_path() / (path_.filename().string() + ".mcacache");

P1 Badge Invalidate local MCA archives after writes

Creating a persistent .mcacache/volume.mca for local volumes makes mutable local datasets serve stale data: after a region has been cached, Volume::writeZYX updates the Zarr files and only calls invalidateCache() (line 1819), which resets the in-memory ChunkCache but does not remove or update this sibling archive. The next read reopens the old archive and MatterCacheFetcher::ensureRegion trusts archive_->hasChunk, so edited chunks can keep returning pre-write bytes until the cache directory is manually deleted. Please avoid persistent MCA for local mutable volumes or invalidate the corresponding archive contents on writes.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

SuperOptimizer added 2 commits June 10, 2026 15:54
Vendors SuperOptimizer/libs3 @01577e0 (minimal C23 S3 client: GET/HEAD/range,
SigV4, env/INI/SSO/IMDSv2 credential chain, retries, process abort flag).
utils::HttpClient keeps its API but becomes a ~180-line adapter; aws_auth.cpp
and the 500-line curl implementation are deleted. Consumers unchanged.
Credential discovery now finds SSO profiles without AWS_PROFILE exported.
…tion

- applyMatterCache requires cubic source chunks whose edge divides 256
  (the region assembler's assumption); other shapes fall back to raw cache.
- Volume::invalidateCache (write paths only) deletes the local sibling
  .mcacache so edited local volumes never serve stale pre-write bytes.
@SuperOptimizer

Copy link
Copy Markdown
Contributor Author

Both codex P1s addressed in 8daf20c:

  • Non-cubic chunk shapes: applyMatterCache now requires cubic source chunks whose edge divides 256 on every level (the region assembler's assumption) and falls back to the raw cache otherwise. No known producer emits non-cubic chunks, so this is a guard rather than full anisotropic support.
  • Stale local archives after writes: Volume::invalidateCache — which is called from every write path (writeZYX, removeChunk, pyramid rebuild) and nowhere else — now also deletes the local sibling .mcacache, so reads after an edit rebuild from the updated zarr. Remote archives are unaffected (immutable sources).

@SuperOptimizer

Copy link
Copy Markdown
Contributor Author

@codex review

Reads go through the lossy mca render cache (q=8); the exact-value checks
predate that. MAE < 16 / PSNR > 22 dB (measured: ~7.7 / ~26 dB).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 95f5993ce7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +182 to +183
if (sub.status != ChunkFetchStatus::Found || sub.bytes.empty())
return; // missing/air sub-chunk -> stays zero in the region

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate sub-fetch failures before encoding the region

When a 256³ MCA region is assembled from multiple native chunks (e.g. 128³ remote zarr), this treats every non-Found sub-fetch the same as an intentionally missing/air chunk. If one subrequest returns HttpError/DecodeError while another subchunk succeeds, anyData becomes true and the partially zero-filled region is appended to volume.mca, so later reads (and future runs) serve permanently corrupted zeros instead of retrying or surfacing the error.

Useful? React with 👍 / 👎.

Comment thread volume-cartographer/core/src/Volume.cpp Outdated
std::vector<vc::render::ChunkCache::LevelInfo> mcaLevels;
const bool mcaDisabled = std::getenv("VCA_NO_MCA_CACHE") != nullptr;
bool mcaOn = false;
if (!mcaDisabled && !cacheDir.empty()) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep local volume reads exact by default

This enables the default q=8 MCA path for local volumes too, but Volume::readZYX reads through chunkedCache() and matter-compressor is lossy at that quality. For any local uint8 dataset with a cache directory (including immediately after writeZYX), API reads and samples now return reconstructed values rather than the exact zarr bytes; the cache should be restricted to render-only/remote/opt-in or use a lossless mode for local editable volumes.

Useful? React with 👍 / 👎.

SuperOptimizer added 26 commits June 10, 2026 20:06
Bring the new matter-compressor streaming layer into volume-cartographer:
- vendor src/mc_zarr.c, mc_volume.c, c3d.c (+ headers) from matter-compressor
- add matter_compressor_volume static lib (links VC's libs/libs3 + zstd + c3d),
  built whether standalone or vendored so the VC3D render shim can link it
- update libs/libs3 to latest upstream (async batch API, s3_get_range_into,
  range coalescing); was a strict superset, all existing callers unaffected
- order add_subdirectory(libs3) before matter-compressor (volume lib links it)
- adopt the archive determinism flags (-ffp-contract=off, no -ffast-math) on
  both matter_compressor and the c3d TU, matching upstream

mc_volume is the remote-zarr stream/transcode/cache/prefetch layer that will
replace VC3D's MatterCache render path. Builds clean in VC (dev-gcc).
The mc_volume-backed render adapter: a thin IChunkedArray pass-through with
no entry table / LRU / fetchers. tryGetChunk -> mc_volume_try_block (present:
copy the 16^3 block; absent: async kick + MissQueued -> coarser-LOD fallback);
getChunkBlocking -> mc_volume_get_block; chunk-ready listeners driven by
mc_volume's transcode-complete callback; stats from mc_volume_get_stats.

Replaces the ChunkCache+MatterArchive+ZarrChunkFetcher stack on the render
path. Links matter_compressor_volume into vc_core; builds clean (dev-gcc).
Step 3a: the GUI render path now serves remote zarr (c3d/blosc) volumes from
mc_volume instead of ChunkCache+MatterArchive+ZarrChunkFetcher.

- Volume::createChunkCache returns IChunkedArray; for remote non-.mca URLs it
  builds a McVolumeArray (mc_volume) and returns it. .mca-mirror + local-zarr
  paths unchanged for now.
- Lift the GUI-facing surface onto IChunkedArray: Stats, stats(), shardBatch(),
  prefetchShardBlocking(), beginViewRequest() (default no-ops). ChunkCache and
  McVolumeArray both implement it; the sampler already took IChunkedArray&.
- Retype _chunkArray / chunkedCache_ / VolumePrefetcher cache / GUI helpers from
  ChunkCache to IChunkedArray. The GUI calls only interface methods now.

VC3D + vc_render_tifxyz + vc_cache_prefetch build+link clean (dev-gcc).
ChunkCache/ZarrChunkFetcher remain for the tracer (Chunked3d) path.
VC3D's render path now goes entirely through matter-compressor. renderFrame
unifies plane and quad onto surf->gen() -> McVolumeArray::render() (mc_render),
deleting ChunkedPlaneSampler (.cpp/.hpp + 4 tests, ~2350 lines), the coverage
mask, and the C++ composite layer-stack loops. Vendor mc_render/mc_sample/mc_s3.

Fixes:
- extern "C" guards on mc_render.h/mc_sample.h (undefined mc_render_pick_lod).
- QuadSurface::gen() returns a non-continuous ROI view; clone to continuous
  before handing the flat ptr to mc_render (was shearing/streaking the quad).
- Pyramid is power-of-two with chunk-padded shapes, so LOD scale stays 2^L.
…split)

Pull upstream a88266a (mc_render 3D resampling: surface volumes + oriented
boxes) and the decode-vs-encode timing split into VC's vendored copy.
…ter_compressor

Upstream 9a28fb7 folded mc_sample.{c,h}, mc_render.{c,h}, mc_sample_internal.h
into the single matter_compressor.{c,h} pair. Mirror that: delete the 5 folded
files, sync all sources, drop them from CMake (matter_compressor = just
matter_compressor.c), and include only mc_volume.h in McVolumeArray (it pulls in
matter_compressor.h). Also picks up the mc-decode/mc-download thread naming.
Upstream folded mc_volume/mc_zarr/mc_s3 into one matter_compressor.{c,h} pair.
Mirror that into VC's vendored copy: delete the 6 folded files, collapse the
CMake to a single matter_compressor target (matter_compressor.c + c3d.c, links
libs3 + zstd), keep matter_compressor_volume as an ALIAS so consumers resolve.
McVolumeArray now includes matter_compressor.h.

Wire up the runtime RAM-cache controls the merged TU exposes:
- McVolumeArray::stats() populates the decoded-RAM gauge (cache_used/cap_blocks)
  and a download-rate estimate (net_bytes delta / wall-clock, light EMA).
- IChunkedArray::setDecodedByteCapacity + McVolumeArray impl -> mc_volume_set_cache_bytes.
- Settings dialog applies the RAM cache GB live (resize the active volume's cache).
The .vca/.mca export + recompression flow lives in matter-compressor now; the
old SigV4/HttpClient-based zarr recompressor is dead. Removes one http_fetch +
c3d consumer.
createChunkCache now opens every volume via mc_volume — remote zarr streams +
transcodes into a local .mca (as before), and local zarr directories use
mc_volume's new local-filesystem source (sibling .mcacache dir). One render/
cache path; no ChunkCache/MatterCache/ZarrChunkFetcher construction. The zarr
metadata openers stay for shape/dtype discovery (zarrOpen/NewFromUrl).
…ccessor)

Picks up mc_volume local-filesystem source (file_read) and mc_volume_get_level_meta
so VC can read per-level pyramid metadata straight from an opened mc_volume.
mc_render composites inside matter-compressor now (min/mean/max/alpha along the
normal), so VC's per-pixel C++ compositing is dead: delete Compositing.cpp
(LayerStack, CompositeMethod::*, compositeLayerStack, methodRequiresLayerStorage,
buildTfLut256, computeLightingFactor) + test_compositing — zero non-test callers.
Keep CompositeParams/CompositeRenderSettings (GUI settings carriers mapped to
mc_render params). Also drop orphaned _values/_coverage viewer members.
Both libs/c3d/c3d.{c,h} and libs/matter-compressor/src/c3d.{c,h} were byte-identical.
Delete the libs/c3d copy; the c3d target (utils_c3d_codec's dep) now compiles
matter-compressor's c3d.c. One source of truth, matching the 'matter-compressor
owns the codec' direction.
…rrChunkFetcher)

McVolumeArray (matter-compressor) is now the ONLY render/cache path for every
volume. Delete ChunkCache, MatterCache, ZarrChunkFetcher (.cpp/.hpp) + their
tests. Replace ChunkCache::Options with a minimal DecodedCacheOptions (only
decodedByteCapacity is consumed). New ZarrMetadata module reads pyramid
shapes/dtype/fill via utils::zarr for zarrOpen/NewFromUrl (no fetcher). Migrate
ChunkedTensor::openChunkedArrayCache (tracer local path), vc_cache_prefetch, and
vc_render_tifxyz to McVolumeArray/IChunkedArray. Remote .mca probe stubbed
(GUI-unreachable).

Co-developed via isolated worktree agent.
…set ladder)

- reuse one c3d decoder per decode-pool thread (~14% off each decode)
- cap decode pool at nproc/2 (memory-bandwidth-bound; ~40% lower per-region latency)
- upstream calibrated preset ladder
The download rate was a poll-to-poll net_bytes delta that snapped to 0 whenever
a poll fell between s3_get_batch arrivals (batches land every few seconds), and
the status bar also gated on remoteFetchesInFlight which flickers to 0 between
request bursts. Now: average bytes over a sliding 2s window, hold the last
nonzero rate for 3s of idle before declaring 0, and gate the readout on the
rate (not the in-flight count). Shows a steady 'downloading @ X MB/s' while
data is actually streaming.
…eadout

- per-region interactive fetch (no whole-shard download), net_bytes counts real
  transferred bytes
- McVolumeArray download-rate: sliding 2s window held through batch-burst gaps;
  status bar gates on rate not the flickery in-flight count
Fixes regions wrongly poisoned as air on transient read errors — they now stay
ABSENT and retry, only marking ZERO when S3 confirms air (c3d index air marker /
v2 404).
…ladder)

- staging queue decoupled from decode via a 2GB byte budget (network saturates
  ahead of CPU-bound decode); mc_volume_set_staging_bytes runtime API
- upstream calibrated compression preset ladder (MC_PRESET_ARCHIVAL..PREVIEW),
  pulled down but not yet used by VC
…render scratch reuse

- vendored matter-compressor: per-thread mc_codec_ctx (eliminates ~14% TLS
  overhead) + single-flight region fetch (re-decode redundancy 15x -> 1.00x)
- McVolumeArray::render: reuse thread-local pts/nrm scratch instead of a fresh
  13MB vector per frame (cuts render-worker page-fault churn)
…decode)

Fixes ~4% render CPU wasted rebuilding the quant table (4096 powf) on every
mc_cache-miss block read.
Cache (matter-compressor) is now fully lock-free on the render path:
- delete per-shard mutex; mc_cache_update partitions fill by shard
  ownership (worker t owns shards t::nt) so every mutation is
  single-owner. Reworked the async-update ticket the same way.
- getters (get/get_copy/contains/best_lod) are bare lock-free probes;
  removed rd_mu (single-owner reader decode) and encoder bm_mu (stripes
  write byte-disjoint bitmap regions). mc lock-ops 77->33, none in the
  cache/render path; survivors are all off-tick (ftruncate, S3 footer,
  download<->decode queue handoff, one-time init).
- coverage memo: mc_archive_chunk_coverage is an O(1) hash probe instead
  of a per-block mc_resolve_chunk tree walk; backfills from the tree.

Render-hitch fix: mc_block_range was O(n^2) over a chunk's blocks
(per-block 512B popcount + O(slot) prefix sum). Replaced with a
thread-local per-chunk bi->offset index built in one O(4096) pass;
O(1)/block. Also memoize the chunk_offset tree walk per chunk in
src_archive. Fresh-working-set spike 348ms->~35ms, worker median
79->25ms. Residual ~6ms is pure AVX2 trilinear (already optimal).

Render scheduling: the global tick is the only scheduler.
- chunk arrival never schedules a render; the tick renders while data is
  in flight (whole-pipeline depth collated at thaw into a plain frozen
  inflight_snapshot -- queued+downloading+decode-queue+undrained misses,
  read during frozen render, no atomic/lock).
- removed the worker-busy render queue: if a render worker is still busy
  at tick time, skip this tick and re-arm dirty; no completion
  self-scheduling. Kills the worker-finished re-render chains
  (was ~320/session, now 0).
- fold per-viewer timers into the global clock; Constants.hpp for the
  tick/debounce constants.
- clear the stale latency-origin stamp on superseded frames.
… gen)

Pull back the correctness fixes found by the upstream mc test suite:
content-key the per-chunk block-offset index (arc+off+xxh64) so a stale
index isn't served after chunk_off reuse/re-append, and add an archive
gen counter that invalidates src_archive's chunk_off memo on publish.
matter-compressor is now the single render/composite/colormap engine for
the GUI. Net ~1000 lines of VC3D render code deleted.

- mc_colormap (vendored): window/level + colormap LUT as baked static
  [256][3] gray->RGB tables (viridis/magma/fire bit-identical to OpenCV,
  + gray/r/g/b/cyan/magenta tints). The viewer calls mc_colormap_lut/
  mc_colormap_id instead of buildWindowLevelColormapLut.
- delete BOTH postprocess pipelines: util/PostProcess (CLAHE/raking/
  stretch/iso/remove-small-components) and render/PostProcess (the LUT
  builder mc replaced) + their tests. Drop SampleParams::postProcess and
  Volume::applyOptionalPostProcess (CLI 2D post, also gone).
- slim CompositeParams/CompositeRenderSettings to the mc-mapped fields:
  method, alphaMin/Opacity, percentile, and the SHADED knobs (light dir,
  ambient/diffuse/specular/shininess/absorption/shadow/sss/curvature).
  Deleted beerLambert, lighting, volume gradients, pre/post transfer
  functions, dvr/pbr, etc. -- mc's SHADED mode subsumes the relief/raking
  /lighting passes.
- remove the streamingCompositeUnsupported fallback; every method is an
  mc reduction now. Extend the method map: stddev=5 shaded=6 percentile=7
  depth=8. McVolumeArray::render takes a ShadeParams* for those modes.
- Colormaps.cpp keeps only the UI registry (specs/resolve/entries);
  makeColors/applyPackedLut pixel funcs removed (no live callers).

GUI verified: 281 frames rendered through the mc path, no errors.
CLI tools (Slicing.cpp, zarr reader) still use legacy paths -- deferred.
SuperOptimizer added 3 commits June 11, 2026 21:38
vendor: sync matter-compressor (1a62a52) -- fast thaw (region-granular
absent miss + ABSENT coverage memo + single-call fill), per-pixel LOD
fallback, pointer-into-arena sampler, LIFO decode queue.

VC3D side:
- McVolumeArray slice path uses mc_render_points_par_lod with native L0
  coords: per-pixel coarser-LOD fallback (coarse-not-black on zoom/pan,
  render never decodes). Composite path still single-level. ShadeParams
  pass-through for SHADED/PERCENTILE.
- THAW-GATE (correctness): onGlobalTick skips thaw/freeze/render while any
  viewer's async render worker is busy. The pointer-into-arena sampler
  holds raw arena pointers across the async render; thawing (the only
  mutator) mid-render would evict/realloc under them -> use-after-free.
  Gating restores the freeze/thaw invariant: no mutation while frozen
  reads are live. (renderWorkerBusy() accessor added.)
- "downloading N" status now reads regions_inflight (stable pipeline
  depth) while the render gate reads workPending (+ undrained misses), so
  the status no longer ping-pongs 0..thousands with the per-frame miss set.
- SurfacePatchIndex rtree rebuild moved to a dedicated 1-thread QThreadPool
  (was on the global pool shared with render bands -> a 1.6s startup
  rebuild stole render workers). Internals untouched; rebuild now timed.
Drops the per-entry 4KB block-memo buffer: the sampler caches the cache
arena pointer directly (owns_ptr source) instead of a 1MB/8MB per-band-
per-frame buffer. LOD sample L4 27ms->8ms, L0 13ms->5ms; kills the
sampler-alloc page-fault storm that was ~25% of render-thread time.
At the start of each global tick, before freeze, every viewer predicts the
256^3 regions it will sample this frame and the tick submits the downloads
up front -- so data front-runs the render instead of being discovered as a
miss a cycle late.

Prediction (predictWorkingSet): run the surface's OWN gen() at 1/8 res over
the exact render viewport, so we get only the IN-VIEWPORT coords (gen clips
to the screen rect; a warped sheet's off-screen patch is excluded). Map
each coord -> 16^3 block -> 256^3 region, dedup at region granularity. This
replaced an earlier control-grid walk that sprayed ~46k regions across the
volume (a folded sheet's padded control window) and flooded the download
stack ("downloading 45000"). Now ~hundreds of regions, stack depth maxes ~50.

Submit via McVolumeArray::prefetchChunks -> mc_volume_request_region (cheap,
absent-only, no decode). Cross-viewer + inflight/present dedup is handled by
the deduping download stack in mc, not here. vendor: sync mc 63e9601.
VCA_NO_PREDICT disables for A/B.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants