VC3D: matter-compressor render cache (one .mca per volume, mc_cache resident) + fetch/render perf by SuperOptimizer · Pull Request #1036 · ScrollPrize/villa

SuperOptimizer · 2026-06-10T15:17:10Z

What

Replaces VC3D's render caching with matter-compressor: remote (and local) chunks are fetched once, re-encoded into ONE persistent volume.mca per volume (all LODs, crash-safe, appendable), and served as 16³ blocks through mc_cache (sharded S3FIFO resident cache inside the library). VC3D-side caching scaffolding is deleted; the ChunkCache keeps only per-256³-region fetch status plus the async orchestration (worker pool, listeners). Non-uint8 volumes keep the legacy byte-LRU path.

Vendored matter-compressor is at 1e53dbb (format v6: per-axis dims, self-contained blocks, SIMD decode). Stale/older-format cache files are detected and rebuilt automatically.

Fetch path

per-region single-flight; worker-pool tasks interleave across regions so 48 I/O workers fetch distinct regions concurrently (workers are ~85% network-blocked)
remote sharded zarr reads: shard index table cached after one GET — one S3 round trip per chunk instead of two (halves cold-fetch latency)
128³-chunk sources assemble their 256³ region with parallel sub-fetches
download stats/status bar count only bytes actually pulled from the source; archive serves are not "downloading"

Rendering / frontend

Profile-driven (perf + --profile logs, before/after in the commit messages):

interactive preview system deleted (~420 lines): no more transformed-stale-frame previews, no 140ms settle timer, no 250/500ms chunk-ready coalescing windows (which starved repaints while chunks streamed). Interactions schedule normal full renders behind the single 16ms debounce; cache-side chunk-ready callbacks are throttled to ~30Hz with a guaranteed final notify
speculative prefetch deleted: render-worker tryGetChunk misses queue exactly what requested frames need; the mca archive makes refetches cheap
stale-render coalescing: a submit while the worker is busy only invalidates the in-flight frame if view parameters changed (fingerprint); data-only refreshes display it. Discarded renders: 19% → 5%
pool submission moved outside the cache mutex; ChunkKey hash packs to 64 bits + one mix; shift/mask chunk indexing in the trilinear sampler hot path
main-thread stalls (formerly up to 284ms in submitRender) are gone — the UI thread no longer registers in CPU profiles

Also

fixed a segfault on every exit: gnutls/libtasn1 DSO destructors free through mimalloc after its teardown in _dl_fini; VC3D now runs real teardown then _exit()

Measured end state

Steady-state (warm archive): blocks decode from volume.mca via mc_cache, zero network, ~4% CPU in mc block decode. Cold exploration: c3d decode + mc encode dominate pool threads (once per region ever, amortizes out). Render frames are 94% sampler kernel — follow-up territory (pan reuse, kernel work).

Known: test_volume_local exact-roundtrip fails on this branch (reads go through the lossy q=8 mca; pre-existing design question, not a regression).

Vendor SuperOptimizer/matter-compressor @ ab0649c into libs/. This snapshot has the appendable dense-node archive: mc_writer_open / mc_append_chunk_raw / mc_append_chunk_compressed / mc_writer_close (persistent, crash-safe, reopened across runs) + mc_open_streaming (byte-source range-GET reader). Foundation for the mca streaming/re-encode cache wired into the chunk fetch path.

…nto vc_core - re-vendor matter-compressor @ 1524688 (unified mc_archive read+write handle, vendoring-friendly CMake) - add_subdirectory(libs/matter-compressor) + link matter_compressor into vc_core - MatterArchive: RAII C++ wrapper around mc_archive_open/append_chunk_raw/ chunk_offset/decode_block/close. Storage/encode unit 256^3; decode/serve unit 16^3 (the granularity the resident chunk cache will key on). Append is thread-safe; decode serialized by the underlying archive (codec quality is process-global).

MatterCacheFetcher decorates each level's source (zarr/c3d) fetcher to re-express the volume through one persistent matter-compressor (.mca) archive: 'fetch native, serve mca-native'. - The volume is reported to the ChunkCache at mca's native 16^3 chunk granularity (the resident cache resides 4KB blocks). - fetch(16^3 key) -> enclosing 256^3 mca region. On a miss, fetch the SOURCE's native chunks covering it (256^3 c3d = 1:1; 128^3 zarr-v2 = eager 2x2x2 coalesce), assemble one 256^3 u8 buffer, encode it into the .mca once. Then decode the requested 16^3 block out of the .mca. - One .mca holds all chunks at all 8 LODs and persists across runs (skips re-fetch on a warm cache). Region-materialization is memoized + checks the persisted archive. Wired in createChunkCache, gated by VCA_MCA_CACHE=<path.mca> (+ VCA_MCA_QUALITY, default 8); uint8 volumes only (mca is u8). Off by default. Full VC3D builds + links.

…cache dir The mca integration was in ZarrChunkFetcher's createChunkCache, which VC3D does NOT call -- the live path is Volume::createChunkCache. Move it there: - ONE persistent volume.mca lives in the volume's existing remote cache dir (remoteCacheRoot_/id()), NOT /tmp -- same place other chunks are cached. - When mca engages it REPLACES the old per-chunk-file persistent cache (mca IS the persistence), so persistentCachePath is left unset; the old per-chunk cache remains only as the fallback when mca is off (non-uint8, local, or VCA_NO_MCA_CACHE). - Default ON for remote uint8 volumes; VCA_NO_MCA_CACHE disables; VCA_MCA_QUALITY sets q. applyMatterCache() factored into ZarrChunkFetcher (wrap fetchers + 16^3 LevelInfo). Also deleted the stale 503GB .vca .vcacache disk-cache artifact (superseded by mca).

Rip out the entire per-chunk-file persistent cache. On-disk caching is now ONLY the single per-volume matter-compressor archive (volume.mca), and ALL volumes -- remote AND local -- go through it. - ChunkCache: deleted readPersistent/readPersistentEmpty/queuePersistentWrite/ queuePersistentEmptyWrite/writePersistent/writePersistentEmpty/persistentPath/ persistentEmptyPath/persistentCacheBytes(dir)/persistentCacheWriterPool + the Entry::persisted field + the persist-on-evict path. The fetch worker just calls fetch() (the MatterCacheFetcher owns mca read/write). - Options::persistentCachePath -> Options::mcaPath (path to the single volume.mca); the 'disk' cache-size stat now reports that file's size (a throttled file_size), not a recursive dir scan of millions of per-chunk files. - ChunkFetchResult: dropped persistentBytes/hasPersistentBytes; IChunkFetcher dropped persistentCacheExtension/decodePersistentBytes; ZarrChunkFetcher dropped its encoded-c3d persistent path. - Volume::createChunkCache: mca for every volume (remote -> remote cache dir; local -> <dataset>.mcacache/), uint8 only, VCA_NO_MCA_CACHE disables. Viewer no longer sets a per-chunk persistent path. - Tests: removed the per-chunk-cache TEST_CASEs (kept the unrelated ones); all pass. Also deleted the stale 503GB .vca .vcacache and the 81GB per-chunk chunk_cache on disk.

…port Root cause: Volume::createChunkCache resolved the mca dir from Volume::remoteCacheRoot_, but that member was empty for the render path's volumes -- the GUI resolves the cache root (remoteCacheRootForState: /volpkgs|/ephemeral|settings) and the old code passed it via Options, which I'd removed. So cacheDir came up empty, applyMatterCache was skipped, and the slow raw S3-streaming path ran (pegging CPU + 22 S3 conns) while 'disk' read 0 (options_.mcaPath unset). Fix: add Volume::setRemoteCacheRoot(); the viewer pushes the GUI-resolved root into the volume before createChunkCache builds the cache. Now mca engages (verified: 'mca cache enabled', one volume.mca written + grown, chunks served from it, disk stat reads its size). Added diagnostics: createChunkCache logs isRemote/cacheDir/levels, applyMatterCache logs why it skips.

Two fixes for 'spinning, barely downloading' (CPU-bound, no progress): - Re-vendor lock-free mc_archive_decode_block (was serializing every 16^3 decode on one mutex -> all cache-IO threads spun on one lock). - MatterCacheFetcher::ensureRegion: per-256^3-region single-flight. A render touches up to 4096 16^3 blocks of the same region nearly simultaneously; before, EACH thread passed the not-done check and redundantly re-fetched the same 8 source chunks + re-encoded the same 256^3 region. Now the first thread claims the region (InFlight), assembles+encodes it once, and publishes Present/Absent; the rest wait on a condvar for that one assembly. Eliminates the redundant fetch/encode storm.

Brings the unified archive format v4 (self-contained blocks, trained priors), the mc_cache sharded CLOCK/NRU decoded-block cache, and lock-free decode.

- MatterArchive owns an mc_cache bound to the archive; decodeBlock is a mc_cache_get_copy. Stale/incompatible volume.mca is deleted and rebuilt. - ChunkCache in mca mode keeps NO decoded bytes and tracks status per 256^3 REGION (corner-block key): one entry + one fetch task per region instead of 4096 per-block entries duplicating what the archive + mc_cache already know. Resolved blocks decode straight from mc_cache. Legacy byte-LRU remains only for non-uint8 volumes (mca is u8). - fetch throughput: pool tasks interleave across regions (rank priority), 48 I/O workers (they are mostly network-blocked), parallel sub-chunk assembly for 128^3 sources, prefetch keys snapped to region corners. - pool submission moved outside the cache mutex; chunk-ready callbacks throttled to ~30Hz with a guaranteed final notify on drain. - ChunkKey hash packs to 64 bits (3 lod + 3x20 coord bits) + one mix. - download stats count only bytes actually pulled from the source; blocks served from the on-disk archive are not downloads.

…unk) Remote sharded reads did two ranged GETs per chunk (16-byte index entry + payload); over S3 the index round trip doubled per-chunk latency. Fetch the whole index table once per shard and serve entries from RAM.

- interactive preview deleted outright (~420 lines): coarse axis-slice preview, stable-frame transform previews, 50ms preview rate cap, 140ms settle timer. Interactions schedule normal full-quality renders. - chunk-ready 250/500ms restart-on-arrival windows deleted; they starved repaints for as long as chunks kept streaming. The cache-side 30Hz throttle + the 16ms render debounce are the only coalescing layers. - speculative prefetch disabled (visible-set warming, viewport halos, normal-direction neighbors): render-worker tryGetChunk misses queue exactly what requested frames need; the mca archive makes refetches cheap. - prefetch key enumeration is region-granular via IChunkedArray::prefetchShape (was 512x more keys than designed for after the 16^3 re-expression). - stale-render fix: a busy-time submit only invalidates the in-flight frame when view params actually changed (fingerprint); data-only refreshes let it display. Discarded renders drop from 19% to 5%. - status bar only reports downloading when bytes move from the source.

Every exit crashed in _dl_fini: gnutls/libtasn1 destructors free through mimalloc after its own teardown. Run real teardown (CWindow dtor, settings), flush, _exit().

Chunk shapes are powers of two (16 mca, 128/256 zarr); replaces 3 idivs per voxel read in trilinear/nearest sampling.

v6: per-axis dims, per-chunk q, xxh64, per-volume priors; SIMD decode kernels; S3FIFO default mc_cache eviction (scan-resistant for render loops). MatterArchive opens with the real volume shape instead of a padded cube; existing v4 caches are auto-recreated.

The prefetch paths were disabled; remove them: plane/surface halos, normal-direction neighbors, visible-set warming, the surface tile prefetch cache, and IChunkedArray::prefetchShape (only consumer). ~550 lines.

vercel · 2026-06-10T15:17:21Z

Someone is attempting to deploy a commit to the scroll Team on Vercel.

A member of the Team first needs to authorize it.

…ored tests Upstream consolidated 11 source files into one header + one implementation; VC vendors just the pair (same code, same format — existing caches unaffected).

…cache

…p,cpp} One file pair for the whole mca layer: the archive/mc_cache RAII wrapper and the IChunkFetcher decorator that fills it from the source volume.

SuperOptimizer · 2026-06-10T15:43:24Z

@codex review

chatgpt-codex-connector · 2026-06-10T15:47:41Z

💡 Codex Review

villa/volume-cartographer/core/src/render/ZarrChunkFetcher.cpp

Line 435 in 056f6ee

const int srcEdge = opened.chunkShapes[i][0]; // assume cubic native chunks

Preserve non-cubic Zarr chunk shapes

This enables the MCA wrapper for every uint8 pyramid but collapses each source chunk shape to opened.chunkShapes[i][0]; the rest of MatterCacheFetcher then computes source chunk indices and copies srcEdge_^3 bytes as if all axes had that same edge. openLocalZarrPyramid/openHttpZarrPyramid preserve arbitrary 3-D chunk shapes from Zarr metadata, so a valid volume chunked like {64,64,128} will have half of each x row ignored and neighboring source chunk coordinates computed incorrectly once MCA is enabled. Please either skip MCA unless all chunk dimensions are equal and divide 256, or pass the full chunk shape through the fetcher.

villa/volume-cartographer/core/src/Volume.cpp

Lines 1502 to 1503 in 056f6ee

    
           } else if (!path_.empty()) { 
        
               cacheDir = path_.parent_path() / (path_.filename().string() + ".mcacache");

Invalidate local MCA archives after writes

Creating a persistent .mcacache/volume.mca for local volumes makes mutable local datasets serve stale data: after a region has been cached, Volume::writeZYX updates the Zarr files and only calls invalidateCache() (line 1819), which resets the in-memory ChunkCache but does not remove or update this sibling archive. The next read reopens the old archive and MatterCacheFetcher::ensureRegion trusts archive_->hasChunk, so edited chunks can keep returning pre-write bytes until the cache directory is manually deleted. Please avoid persistent MCA for local mutable volumes or invalidate the corresponding archive contents on writes.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Vendors SuperOptimizer/libs3 @01577e0 (minimal C23 S3 client: GET/HEAD/range, SigV4, env/INI/SSO/IMDSv2 credential chain, retries, process abort flag). utils::HttpClient keeps its API but becomes a ~180-line adapter; aws_auth.cpp and the 500-line curl implementation are deleted. Consumers unchanged. Credential discovery now finds SSO profiles without AWS_PROFILE exported.

…tion - applyMatterCache requires cubic source chunks whose edge divides 256 (the region assembler's assumption); other shapes fall back to raw cache. - Volume::invalidateCache (write paths only) deletes the local sibling .mcacache so edited local volumes never serve stale pre-write bytes.

SuperOptimizer · 2026-06-10T16:00:24Z

Both codex P1s addressed in 8daf20c:

Non-cubic chunk shapes: applyMatterCache now requires cubic source chunks whose edge divides 256 on every level (the region assembler's assumption) and falls back to the raw cache otherwise. No known producer emits non-cubic chunks, so this is a guard rather than full anisotropic support.
Stale local archives after writes: Volume::invalidateCache — which is called from every write path (writeZYX, removeChunk, pyramid rebuild) and nowhere else — now also deletes the local sibling .mcacache, so reads after an edit rebuild from the updated zarr. Remote archives are unaffected (immutable sources).

SuperOptimizer · 2026-06-10T16:02:54Z

@codex review

Reads go through the lossy mca render cache (q=8); the exact-value checks predate that. MAE < 16 / PSNR > 22 dB (measured: ~7.7 / ~26 dB).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 95f5993ce7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-10T16:07:40Z

+        if (sub.status != ChunkFetchStatus::Found || sub.bytes.empty())
+            return;   // missing/air sub-chunk -> stays zero in the region


Propagate sub-fetch failures before encoding the region

When a 256³ MCA region is assembled from multiple native chunks (e.g. 128³ remote zarr), this treats every non-Found sub-fetch the same as an intentionally missing/air chunk. If one subrequest returns HttpError/DecodeError while another subchunk succeeds, anyData becomes true and the partially zero-filled region is appended to volume.mca, so later reads (and future runs) serve permanently corrupted zeros instead of retrying or surfacing the error.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-10T16:07:40Z

+    std::vector<vc::render::ChunkCache::LevelInfo> mcaLevels;
+    const bool mcaDisabled = std::getenv("VCA_NO_MCA_CACHE") != nullptr;
+    bool mcaOn = false;
+    if (!mcaDisabled && !cacheDir.empty()) {


Keep local volume reads exact by default

This enables the default q=8 MCA path for local volumes too, but Volume::readZYX reads through chunkedCache() and matter-compressor is lossy at that quality. For any local uint8 dataset with a cache directory (including immediately after writeZYX), API reads and samples now return reconstructed values rather than the exact zarr bytes; the cache should be restricted to render-only/remote/opt-in or use a lossless mode for local editable volumes.

Useful? React with 👍 / 👎.

Bring the new matter-compressor streaming layer into volume-cartographer: - vendor src/mc_zarr.c, mc_volume.c, c3d.c (+ headers) from matter-compressor - add matter_compressor_volume static lib (links VC's libs/libs3 + zstd + c3d), built whether standalone or vendored so the VC3D render shim can link it - update libs/libs3 to latest upstream (async batch API, s3_get_range_into, range coalescing); was a strict superset, all existing callers unaffected - order add_subdirectory(libs3) before matter-compressor (volume lib links it) - adopt the archive determinism flags (-ffp-contract=off, no -ffast-math) on both matter_compressor and the c3d TU, matching upstream mc_volume is the remote-zarr stream/transcode/cache/prefetch layer that will replace VC3D's MatterCache render path. Builds clean in VC (dev-gcc).

The mc_volume-backed render adapter: a thin IChunkedArray pass-through with no entry table / LRU / fetchers. tryGetChunk -> mc_volume_try_block (present: copy the 16^3 block; absent: async kick + MissQueued -> coarser-LOD fallback); getChunkBlocking -> mc_volume_get_block; chunk-ready listeners driven by mc_volume's transcode-complete callback; stats from mc_volume_get_stats. Replaces the ChunkCache+MatterArchive+ZarrChunkFetcher stack on the render path. Links matter_compressor_volume into vc_core; builds clean (dev-gcc).

Step 3a: the GUI render path now serves remote zarr (c3d/blosc) volumes from mc_volume instead of ChunkCache+MatterArchive+ZarrChunkFetcher. - Volume::createChunkCache returns IChunkedArray; for remote non-.mca URLs it builds a McVolumeArray (mc_volume) and returns it. .mca-mirror + local-zarr paths unchanged for now. - Lift the GUI-facing surface onto IChunkedArray: Stats, stats(), shardBatch(), prefetchShardBlocking(), beginViewRequest() (default no-ops). ChunkCache and McVolumeArray both implement it; the sampler already took IChunkedArray&. - Retype _chunkArray / chunkedCache_ / VolumePrefetcher cache / GUI helpers from ChunkCache to IChunkedArray. The GUI calls only interface methods now. VC3D + vc_render_tifxyz + vc_cache_prefetch build+link clean (dev-gcc). ChunkCache/ZarrChunkFetcher remain for the tracer (Chunked3d) path.

VC3D's render path now goes entirely through matter-compressor. renderFrame unifies plane and quad onto surf->gen() -> McVolumeArray::render() (mc_render), deleting ChunkedPlaneSampler (.cpp/.hpp + 4 tests, ~2350 lines), the coverage mask, and the C++ composite layer-stack loops. Vendor mc_render/mc_sample/mc_s3. Fixes: - extern "C" guards on mc_render.h/mc_sample.h (undefined mc_render_pick_lod). - QuadSurface::gen() returns a non-continuous ROI view; clone to continuous before handing the flat ptr to mc_render (was shearing/streaking the quad). - Pyramid is power-of-two with chunk-padded shapes, so LOD scale stays 2^L.

…split) Pull upstream a88266a (mc_render 3D resampling: surface volumes + oriented boxes) and the decode-vs-encode timing split into VC's vendored copy.

…ter_compressor Upstream 9a28fb7 folded mc_sample.{c,h}, mc_render.{c,h}, mc_sample_internal.h into the single matter_compressor.{c,h} pair. Mirror that: delete the 5 folded files, sync all sources, drop them from CMake (matter_compressor = just matter_compressor.c), and include only mc_volume.h in McVolumeArray (it pulls in matter_compressor.h). Also picks up the mc-decode/mc-download thread naming.

Upstream folded mc_volume/mc_zarr/mc_s3 into one matter_compressor.{c,h} pair. Mirror that into VC's vendored copy: delete the 6 folded files, collapse the CMake to a single matter_compressor target (matter_compressor.c + c3d.c, links libs3 + zstd), keep matter_compressor_volume as an ALIAS so consumers resolve. McVolumeArray now includes matter_compressor.h. Wire up the runtime RAM-cache controls the merged TU exposes: - McVolumeArray::stats() populates the decoded-RAM gauge (cache_used/cap_blocks) and a download-rate estimate (net_bytes delta / wall-clock, light EMA). - IChunkedArray::setDecodedByteCapacity + McVolumeArray impl -> mc_volume_set_cache_bytes. - Settings dialog applies the RAM cache GB live (resize the active volume's cache).

The .vca/.mca export + recompression flow lives in matter-compressor now; the old SigV4/HttpClient-based zarr recompressor is dead. Removes one http_fetch + c3d consumer.

createChunkCache now opens every volume via mc_volume — remote zarr streams + transcodes into a local .mca (as before), and local zarr directories use mc_volume's new local-filesystem source (sibling .mcacache dir). One render/ cache path; no ChunkCache/MatterCache/ZarrChunkFetcher construction. The zarr metadata openers stay for shape/dtype discovery (zarrOpen/NewFromUrl).

…ccessor) Picks up mc_volume local-filesystem source (file_read) and mc_volume_get_level_meta so VC can read per-level pyramid metadata straight from an opened mc_volume.

mc_render composites inside matter-compressor now (min/mean/max/alpha along the normal), so VC's per-pixel C++ compositing is dead: delete Compositing.cpp (LayerStack, CompositeMethod::*, compositeLayerStack, methodRequiresLayerStorage, buildTfLut256, computeLightingFactor) + test_compositing — zero non-test callers. Keep CompositeParams/CompositeRenderSettings (GUI settings carriers mapped to mc_render params). Also drop orphaned _values/_coverage viewer members.

Both libs/c3d/c3d.{c,h} and libs/matter-compressor/src/c3d.{c,h} were byte-identical. Delete the libs/c3d copy; the c3d target (utils_c3d_codec's dep) now compiles matter-compressor's c3d.c. One source of truth, matching the 'matter-compressor owns the codec' direction.

…rrChunkFetcher) McVolumeArray (matter-compressor) is now the ONLY render/cache path for every volume. Delete ChunkCache, MatterCache, ZarrChunkFetcher (.cpp/.hpp) + their tests. Replace ChunkCache::Options with a minimal DecodedCacheOptions (only decodedByteCapacity is consumed). New ZarrMetadata module reads pyramid shapes/dtype/fill via utils::zarr for zarrOpen/NewFromUrl (no fetcher). Migrate ChunkedTensor::openChunkedArrayCache (tracer local path), vc_cache_prefetch, and vc_render_tifxyz to McVolumeArray/IChunkedArray. Remote .mca probe stubbed (GUI-unreachable). Co-developed via isolated worktree agent.

…set ladder) - reuse one c3d decoder per decode-pool thread (~14% off each decode) - cap decode pool at nproc/2 (memory-bandwidth-bound; ~40% lower per-region latency) - upstream calibrated preset ladder

The download rate was a poll-to-poll net_bytes delta that snapped to 0 whenever a poll fell between s3_get_batch arrivals (batches land every few seconds), and the status bar also gated on remoteFetchesInFlight which flickers to 0 between request bursts. Now: average bytes over a sliding 2s window, hold the last nonzero rate for 3s of idle before declaring 0, and gate the readout on the rate (not the in-flight count). Shows a steady 'downloading @ X MB/s' while data is actually streaming.

…eadout - per-region interactive fetch (no whole-shard download), net_bytes counts real transferred bytes - McVolumeArray download-rate: sliding 2s window held through batch-burst gaps; status bar gates on rate not the flickery in-flight count

…eue)

Fixes regions wrongly poisoned as air on transient read errors — they now stay ABSENT and retry, only marking ZERO when S3 confirms air (c3d index air marker / v2 404).

…ladder) - staging queue decoupled from decode via a 2GB byte budget (network saturates ahead of CPU-bound decode); mc_volume_set_staging_bytes runtime API - upstream calibrated compression preset ladder (MC_PRESET_ARCHIVAL..PREVIEW), pulled down but not yet used by VC

…render scratch reuse - vendored matter-compressor: per-thread mc_codec_ctx (eliminates ~14% TLS overhead) + single-flight region fetch (re-decode redundancy 15x -> 1.00x) - McVolumeArray::render: reuse thread-local pts/nrm scratch instead of a fresh 13MB vector per frame (cuts render-worker page-fault churn)

…decode) Fixes ~4% render CPU wasted rebuilding the quant table (4096 powf) on every mc_cache-miss block read.

Cache (matter-compressor) is now fully lock-free on the render path: - delete per-shard mutex; mc_cache_update partitions fill by shard ownership (worker t owns shards t::nt) so every mutation is single-owner. Reworked the async-update ticket the same way. - getters (get/get_copy/contains/best_lod) are bare lock-free probes; removed rd_mu (single-owner reader decode) and encoder bm_mu (stripes write byte-disjoint bitmap regions). mc lock-ops 77->33, none in the cache/render path; survivors are all off-tick (ftruncate, S3 footer, download<->decode queue handoff, one-time init). - coverage memo: mc_archive_chunk_coverage is an O(1) hash probe instead of a per-block mc_resolve_chunk tree walk; backfills from the tree. Render-hitch fix: mc_block_range was O(n^2) over a chunk's blocks (per-block 512B popcount + O(slot) prefix sum). Replaced with a thread-local per-chunk bi->offset index built in one O(4096) pass; O(1)/block. Also memoize the chunk_offset tree walk per chunk in src_archive. Fresh-working-set spike 348ms->~35ms, worker median 79->25ms. Residual ~6ms is pure AVX2 trilinear (already optimal). Render scheduling: the global tick is the only scheduler. - chunk arrival never schedules a render; the tick renders while data is in flight (whole-pipeline depth collated at thaw into a plain frozen inflight_snapshot -- queued+downloading+decode-queue+undrained misses, read during frozen render, no atomic/lock). - removed the worker-busy render queue: if a render worker is still busy at tick time, skip this tick and re-arm dirty; no completion self-scheduling. Kills the worker-finished re-render chains (was ~320/session, now 0). - fold per-viewer timers into the global clock; Constants.hpp for the tick/debounce constants. - clear the stale latency-origin stamp on superseded frames.

… gen) Pull back the correctness fixes found by the upstream mc test suite: content-key the per-chunk block-offset index (arc+off+xxh64) so a stale index isn't served after chunk_off reuse/re-append, and add an archive gen counter that invalidates src_archive's chunk_off memo on publish.

matter-compressor is now the single render/composite/colormap engine for the GUI. Net ~1000 lines of VC3D render code deleted. - mc_colormap (vendored): window/level + colormap LUT as baked static [256][3] gray->RGB tables (viridis/magma/fire bit-identical to OpenCV, + gray/r/g/b/cyan/magenta tints). The viewer calls mc_colormap_lut/ mc_colormap_id instead of buildWindowLevelColormapLut. - delete BOTH postprocess pipelines: util/PostProcess (CLAHE/raking/ stretch/iso/remove-small-components) and render/PostProcess (the LUT builder mc replaced) + their tests. Drop SampleParams::postProcess and Volume::applyOptionalPostProcess (CLI 2D post, also gone). - slim CompositeParams/CompositeRenderSettings to the mc-mapped fields: method, alphaMin/Opacity, percentile, and the SHADED knobs (light dir, ambient/diffuse/specular/shininess/absorption/shadow/sss/curvature). Deleted beerLambert, lighting, volume gradients, pre/post transfer functions, dvr/pbr, etc. -- mc's SHADED mode subsumes the relief/raking /lighting passes. - remove the streamingCompositeUnsupported fallback; every method is an mc reduction now. Extend the method map: stddev=5 shaded=6 percentile=7 depth=8. McVolumeArray::render takes a ShadeParams* for those modes. - Colormaps.cpp keeps only the UI registry (specs/resolve/entries); makeColors/applyPackedLut pixel funcs removed (no live callers). GUI verified: 281 frames rendered through the mc path, no errors. CLI tools (Slicing.cpp, zarr reader) still use legacy paths -- deferred.

vendor: sync matter-compressor (1a62a52) -- fast thaw (region-granular absent miss + ABSENT coverage memo + single-call fill), per-pixel LOD fallback, pointer-into-arena sampler, LIFO decode queue. VC3D side: - McVolumeArray slice path uses mc_render_points_par_lod with native L0 coords: per-pixel coarser-LOD fallback (coarse-not-black on zoom/pan, render never decodes). Composite path still single-level. ShadeParams pass-through for SHADED/PERCENTILE. - THAW-GATE (correctness): onGlobalTick skips thaw/freeze/render while any viewer's async render worker is busy. The pointer-into-arena sampler holds raw arena pointers across the async render; thawing (the only mutator) mid-render would evict/realloc under them -> use-after-free. Gating restores the freeze/thaw invariant: no mutation while frozen reads are live. (renderWorkerBusy() accessor added.) - "downloading N" status now reads regions_inflight (stable pipeline depth) while the render gate reads workPending (+ undrained misses), so the status no longer ping-pongs 0..thousands with the per-frame miss set. - SurfacePatchIndex rtree rebuild moved to a dedicated 1-thread QThreadPool (was on the global pool shared with render bands -> a 1.6s startup rebuild stole render workers). Internals untouched; rebuild now timed.

Drops the per-entry 4KB block-memo buffer: the sampler caches the cache arena pointer directly (owns_ptr source) instead of a 1MB/8MB per-band- per-frame buffer. LOD sample L4 27ms->8ms, L0 13ms->5ms; kills the sampler-alloc page-fault storm that was ~25% of render-thread time.

At the start of each global tick, before freeze, every viewer predicts the 256^3 regions it will sample this frame and the tick submits the downloads up front -- so data front-runs the render instead of being discovered as a miss a cycle late. Prediction (predictWorkingSet): run the surface's OWN gen() at 1/8 res over the exact render viewport, so we get only the IN-VIEWPORT coords (gen clips to the screen rect; a warped sheet's off-screen patch is excluded). Map each coord -> 16^3 block -> 256^3 region, dedup at region granularity. This replaced an earlier control-grid walk that sprayed ~46k regions across the volume (a folded sheet's padded control window) and flooded the download stack ("downloading 45000"). Now ~hundreds of regions, stack depth maxes ~50. Submit via McVolumeArray::prefetchChunks -> mc_volume_request_region (cheap, absent-only, no decode). Cross-viewer + inflight/present dedup is handled by the deduping download stack in mc, not here. vendor: sync mc 63e9601. VCA_NO_PREDICT disables for A/B.

SuperOptimizer added 15 commits June 9, 2026 15:16

vendor: sync matter-compressor to upstream 2666f4d (format v4, mc_cache)

d9bec9f

Brings the unified archive format v4 (self-contained blocks, trained priors), the mc_cache sharded CLOCK/NRU decoded-block cache, and lock-free decode.

VC3D: skip DSO finalizers on exit (mimalloc vs gnutls teardown segfault)

e2803d0

Every exit crashed in _dl_fini: gnutls/libtasn1 destructors free through mimalloc after its own teardown. Run real teardown (CWindow dtor, settings), flush, _exit().

render: shift/mask chunk indexing in the sampler hot path

c19d5fe

Chunk shapes are powers of two (16 mca, 128/256 zarr); replaces 3 idivs per voxel read in trilinear/nearest sampling.

VC3D: delete dead speculative-prefetch code

8c5e332

The prefetch paths were disabled; remove them: plane/surface halos, normal-direction neighbors, visible-set warming, the surface tile prefetch cache, and IChunkedArray::prefetchShape (only consumer). ~550 lines.

SuperOptimizer requested a review from hendrikschilling as a code owner June 10, 2026 15:17

SuperOptimizer and others added 5 commits June 10, 2026 10:19

Merge branch 'ScrollPrize:main' into mca_cache

aed197f

vendor: unified matter_compressor.{h,c} (upstream 4b5e7ed); drop vend…

10e34e4

…ored tests Upstream consolidated 11 source files into one header + one implementation; VC vendors just the pair (same code, same format — existing caches unaffected).

Merge branch 'mca_cache' of github.com:SuperOptimizer/villa into mca_…

5206f96

…cache

vendor: drop matter-compressor README + .gitignore (keep LICENSE)

7d6e3a1

render: unify MatterArchive + MatterCacheFetcher into MatterCache.{hp…

056f6ee

…p,cpp} One file pair for the whole mca layer: the archive/mc_cache RAII wrapper and the IChunkFetcher decorator that fills it from the source volume.

SuperOptimizer added 2 commits June 10, 2026 15:54

test: judge writeZYX/readZYX round-trip by MAE/PSNR, not exact equality

95f5993

Reads go through the lossy mca render cache (q=8); the exact-value checks predate that. MAE < 16 / PSNR > 22 dB (measured: ~7.7 / ~26 dB).

chatgpt-codex-connector Bot reviewed Jun 10, 2026

View reviewed changes

SuperOptimizer added 26 commits June 10, 2026 20:06

vendor: sync matter-compressor (3D resampling + decode/encode timing …

df53eed

…split) Pull upstream a88266a (mc_render 3D resampling: surface volumes + oriented boxes) and the decode-vs-encode timing split into VC's vendored copy.

apps: delete vc_zarr_recompress CLI (superseded by matter-compressor)

c4e4ac3

The .vca/.mca export + recompression flow lives in matter-compressor now; the old SigV4/HttpClient-based zarr recompressor is dead. Removes one http_fetch + c3d consumer.

vendor: sync matter-compressor (local file_read source + level-meta a…

440a906

…ccessor) Picks up mc_volume local-filesystem source (file_read) and mc_volume_get_level_meta so VC can read per-level pyramid metadata straight from an opened mc_volume.

vendor: sync matter-compressor (cap decode pool at nproc/2)

2e18fb3

vendor: sync matter-compressor (decoder reuse + decode-pool cap + pre…

a3e2ce8

…set ladder) - reuse one c3d decoder per decode-pool thread (~14% off each decode) - cap decode pool at nproc/2 (memory-bandwidth-bound; ~40% lower per-region latency) - upstream calibrated preset ladder

vendor: sync matter-compressor (batched per-region download + deep qu…

7aff1b4

…eue)

vendor: sync matter-compressor (config API for pipeline tuning)

7bff47b

vendor: sync matter-compressor (ZERO only on confirmed-air S3 answer)

8a105f1

Fixes regions wrongly poisoned as air on transient read errors — they now stay ABSENT and retry, only marking ZERO when S3 confirms air (c3d index air marker / v2 404).

vendor: sync matter-compressor (per-thread ctx for render-path block …

5a66461

…decode) Fixes ~4% render CPU wasted rebuilding the quant table (4096 powf) on every mc_cache-miss block read.

SuperOptimizer force-pushed the mca_cache branch from 4f6119c to db373d1 Compare June 11, 2026 20:35

SuperOptimizer added 3 commits June 11, 2026 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VC3D: matter-compressor render cache (one .mca per volume, mc_cache resident) + fetch/render perf#1036

VC3D: matter-compressor render cache (one .mca per volume, mc_cache resident) + fetch/render perf#1036
SuperOptimizer wants to merge 67 commits into
ScrollPrize:mainfrom
SuperOptimizer:mca_cache

SuperOptimizer commented Jun 10, 2026

Uh oh!

vercel Bot commented Jun 10, 2026

Uh oh!

SuperOptimizer commented Jun 10, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 10, 2026

Uh oh!

SuperOptimizer commented Jun 10, 2026

Uh oh!

SuperOptimizer commented Jun 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if (sub.status != ChunkFetchStatus::Found \|\| sub.bytes.empty())
		return; // missing/air sub-chunk -> stays zero in the region

Uh oh!

Conversation

SuperOptimizer commented Jun 10, 2026

What

Fetch path

Rendering / frontend

Also

Measured end state

Uh oh!

vercel Bot commented Jun 10, 2026

Uh oh!

SuperOptimizer commented Jun 10, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 10, 2026

💡 Codex Review

Uh oh!

SuperOptimizer commented Jun 10, 2026

Uh oh!

SuperOptimizer commented Jun 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants