VC3D: matter-compressor render cache (one .mca per volume, mc_cache resident) + fetch/render perf#1036
VC3D: matter-compressor render cache (one .mca per volume, mc_cache resident) + fetch/render perf#1036SuperOptimizer wants to merge 67 commits into
Conversation
Vendor SuperOptimizer/matter-compressor @ ab0649c into libs/. This snapshot has the appendable dense-node archive: mc_writer_open / mc_append_chunk_raw / mc_append_chunk_compressed / mc_writer_close (persistent, crash-safe, reopened across runs) + mc_open_streaming (byte-source range-GET reader). Foundation for the mca streaming/re-encode cache wired into the chunk fetch path.
…nto vc_core - re-vendor matter-compressor @ 1524688 (unified mc_archive read+write handle, vendoring-friendly CMake) - add_subdirectory(libs/matter-compressor) + link matter_compressor into vc_core - MatterArchive: RAII C++ wrapper around mc_archive_open/append_chunk_raw/ chunk_offset/decode_block/close. Storage/encode unit 256^3; decode/serve unit 16^3 (the granularity the resident chunk cache will key on). Append is thread-safe; decode serialized by the underlying archive (codec quality is process-global).
MatterCacheFetcher decorates each level's source (zarr/c3d) fetcher to re-express the volume through one persistent matter-compressor (.mca) archive: 'fetch native, serve mca-native'. - The volume is reported to the ChunkCache at mca's native 16^3 chunk granularity (the resident cache resides 4KB blocks). - fetch(16^3 key) -> enclosing 256^3 mca region. On a miss, fetch the SOURCE's native chunks covering it (256^3 c3d = 1:1; 128^3 zarr-v2 = eager 2x2x2 coalesce), assemble one 256^3 u8 buffer, encode it into the .mca once. Then decode the requested 16^3 block out of the .mca. - One .mca holds all chunks at all 8 LODs and persists across runs (skips re-fetch on a warm cache). Region-materialization is memoized + checks the persisted archive. Wired in createChunkCache, gated by VCA_MCA_CACHE=<path.mca> (+ VCA_MCA_QUALITY, default 8); uint8 volumes only (mca is u8). Off by default. Full VC3D builds + links.
…cache dir The mca integration was in ZarrChunkFetcher's createChunkCache, which VC3D does NOT call -- the live path is Volume::createChunkCache. Move it there: - ONE persistent volume.mca lives in the volume's existing remote cache dir (remoteCacheRoot_/id()), NOT /tmp -- same place other chunks are cached. - When mca engages it REPLACES the old per-chunk-file persistent cache (mca IS the persistence), so persistentCachePath is left unset; the old per-chunk cache remains only as the fallback when mca is off (non-uint8, local, or VCA_NO_MCA_CACHE). - Default ON for remote uint8 volumes; VCA_NO_MCA_CACHE disables; VCA_MCA_QUALITY sets q. applyMatterCache() factored into ZarrChunkFetcher (wrap fetchers + 16^3 LevelInfo). Also deleted the stale 503GB .vca .vcacache disk-cache artifact (superseded by mca).
Rip out the entire per-chunk-file persistent cache. On-disk caching is now ONLY the single per-volume matter-compressor archive (volume.mca), and ALL volumes -- remote AND local -- go through it. - ChunkCache: deleted readPersistent/readPersistentEmpty/queuePersistentWrite/ queuePersistentEmptyWrite/writePersistent/writePersistentEmpty/persistentPath/ persistentEmptyPath/persistentCacheBytes(dir)/persistentCacheWriterPool + the Entry::persisted field + the persist-on-evict path. The fetch worker just calls fetch() (the MatterCacheFetcher owns mca read/write). - Options::persistentCachePath -> Options::mcaPath (path to the single volume.mca); the 'disk' cache-size stat now reports that file's size (a throttled file_size), not a recursive dir scan of millions of per-chunk files. - ChunkFetchResult: dropped persistentBytes/hasPersistentBytes; IChunkFetcher dropped persistentCacheExtension/decodePersistentBytes; ZarrChunkFetcher dropped its encoded-c3d persistent path. - Volume::createChunkCache: mca for every volume (remote -> remote cache dir; local -> <dataset>.mcacache/), uint8 only, VCA_NO_MCA_CACHE disables. Viewer no longer sets a per-chunk persistent path. - Tests: removed the per-chunk-cache TEST_CASEs (kept the unrelated ones); all pass. Also deleted the stale 503GB .vca .vcacache and the 81GB per-chunk chunk_cache on disk.
…port Root cause: Volume::createChunkCache resolved the mca dir from Volume::remoteCacheRoot_, but that member was empty for the render path's volumes -- the GUI resolves the cache root (remoteCacheRootForState: /volpkgs|/ephemeral|settings) and the old code passed it via Options, which I'd removed. So cacheDir came up empty, applyMatterCache was skipped, and the slow raw S3-streaming path ran (pegging CPU + 22 S3 conns) while 'disk' read 0 (options_.mcaPath unset). Fix: add Volume::setRemoteCacheRoot(); the viewer pushes the GUI-resolved root into the volume before createChunkCache builds the cache. Now mca engages (verified: 'mca cache enabled', one volume.mca written + grown, chunks served from it, disk stat reads its size). Added diagnostics: createChunkCache logs isRemote/cacheDir/levels, applyMatterCache logs why it skips.
Two fixes for 'spinning, barely downloading' (CPU-bound, no progress): - Re-vendor lock-free mc_archive_decode_block (was serializing every 16^3 decode on one mutex -> all cache-IO threads spun on one lock). - MatterCacheFetcher::ensureRegion: per-256^3-region single-flight. A render touches up to 4096 16^3 blocks of the same region nearly simultaneously; before, EACH thread passed the not-done check and redundantly re-fetched the same 8 source chunks + re-encoded the same 256^3 region. Now the first thread claims the region (InFlight), assembles+encodes it once, and publishes Present/Absent; the rest wait on a condvar for that one assembly. Eliminates the redundant fetch/encode storm.
Brings the unified archive format v4 (self-contained blocks, trained priors), the mc_cache sharded CLOCK/NRU decoded-block cache, and lock-free decode.
- MatterArchive owns an mc_cache bound to the archive; decodeBlock is a mc_cache_get_copy. Stale/incompatible volume.mca is deleted and rebuilt. - ChunkCache in mca mode keeps NO decoded bytes and tracks status per 256^3 REGION (corner-block key): one entry + one fetch task per region instead of 4096 per-block entries duplicating what the archive + mc_cache already know. Resolved blocks decode straight from mc_cache. Legacy byte-LRU remains only for non-uint8 volumes (mca is u8). - fetch throughput: pool tasks interleave across regions (rank priority), 48 I/O workers (they are mostly network-blocked), parallel sub-chunk assembly for 128^3 sources, prefetch keys snapped to region corners. - pool submission moved outside the cache mutex; chunk-ready callbacks throttled to ~30Hz with a guaranteed final notify on drain. - ChunkKey hash packs to 64 bits (3 lod + 3x20 coord bits) + one mix. - download stats count only bytes actually pulled from the source; blocks served from the on-disk archive are not downloads.
…unk) Remote sharded reads did two ranged GETs per chunk (16-byte index entry + payload); over S3 the index round trip doubled per-chunk latency. Fetch the whole index table once per shard and serve entries from RAM.
- interactive preview deleted outright (~420 lines): coarse axis-slice preview, stable-frame transform previews, 50ms preview rate cap, 140ms settle timer. Interactions schedule normal full-quality renders. - chunk-ready 250/500ms restart-on-arrival windows deleted; they starved repaints for as long as chunks kept streaming. The cache-side 30Hz throttle + the 16ms render debounce are the only coalescing layers. - speculative prefetch disabled (visible-set warming, viewport halos, normal-direction neighbors): render-worker tryGetChunk misses queue exactly what requested frames need; the mca archive makes refetches cheap. - prefetch key enumeration is region-granular via IChunkedArray::prefetchShape (was 512x more keys than designed for after the 16^3 re-expression). - stale-render fix: a busy-time submit only invalidates the in-flight frame when view params actually changed (fingerprint); data-only refreshes let it display. Discarded renders drop from 19% to 5%. - status bar only reports downloading when bytes move from the source.
Every exit crashed in _dl_fini: gnutls/libtasn1 destructors free through mimalloc after its own teardown. Run real teardown (CWindow dtor, settings), flush, _exit().
Chunk shapes are powers of two (16 mca, 128/256 zarr); replaces 3 idivs per voxel read in trilinear/nearest sampling.
v6: per-axis dims, per-chunk q, xxh64, per-volume priors; SIMD decode kernels; S3FIFO default mc_cache eviction (scan-resistant for render loops). MatterArchive opens with the real volume shape instead of a padded cube; existing v4 caches are auto-recreated.
The prefetch paths were disabled; remove them: plane/surface halos, normal-direction neighbors, visible-set warming, the surface tile prefetch cache, and IChunkedArray::prefetchShape (only consumer). ~550 lines.
|
Someone is attempting to deploy a commit to the scroll Team on Vercel. A member of the Team first needs to authorize it. |
…ored tests Upstream consolidated 11 source files into one header + one implementation; VC vendors just the pair (same code, same format — existing caches unaffected).
…p,cpp} One file pair for the whole mca layer: the archive/mc_cache RAII wrapper and the IChunkFetcher decorator that fills it from the source volume.
|
@codex review |
💡 Codex ReviewThis enables the MCA wrapper for every uint8 pyramid but collapses each source chunk shape to villa/volume-cartographer/core/src/Volume.cpp Lines 1502 to 1503 in 056f6ee Creating a persistent ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Vendors SuperOptimizer/libs3 @01577e0 (minimal C23 S3 client: GET/HEAD/range, SigV4, env/INI/SSO/IMDSv2 credential chain, retries, process abort flag). utils::HttpClient keeps its API but becomes a ~180-line adapter; aws_auth.cpp and the 500-line curl implementation are deleted. Consumers unchanged. Credential discovery now finds SSO profiles without AWS_PROFILE exported.
…tion - applyMatterCache requires cubic source chunks whose edge divides 256 (the region assembler's assumption); other shapes fall back to raw cache. - Volume::invalidateCache (write paths only) deletes the local sibling .mcacache so edited local volumes never serve stale pre-write bytes.
|
Both codex P1s addressed in 8daf20c:
|
|
@codex review |
Reads go through the lossy mca render cache (q=8); the exact-value checks predate that. MAE < 16 / PSNR > 22 dB (measured: ~7.7 / ~26 dB).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 95f5993ce7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (sub.status != ChunkFetchStatus::Found || sub.bytes.empty()) | ||
| return; // missing/air sub-chunk -> stays zero in the region |
There was a problem hiding this comment.
Propagate sub-fetch failures before encoding the region
When a 256³ MCA region is assembled from multiple native chunks (e.g. 128³ remote zarr), this treats every non-Found sub-fetch the same as an intentionally missing/air chunk. If one subrequest returns HttpError/DecodeError while another subchunk succeeds, anyData becomes true and the partially zero-filled region is appended to volume.mca, so later reads (and future runs) serve permanently corrupted zeros instead of retrying or surfacing the error.
Useful? React with 👍 / 👎.
| std::vector<vc::render::ChunkCache::LevelInfo> mcaLevels; | ||
| const bool mcaDisabled = std::getenv("VCA_NO_MCA_CACHE") != nullptr; | ||
| bool mcaOn = false; | ||
| if (!mcaDisabled && !cacheDir.empty()) { |
There was a problem hiding this comment.
Keep local volume reads exact by default
This enables the default q=8 MCA path for local volumes too, but Volume::readZYX reads through chunkedCache() and matter-compressor is lossy at that quality. For any local uint8 dataset with a cache directory (including immediately after writeZYX), API reads and samples now return reconstructed values rather than the exact zarr bytes; the cache should be restricted to render-only/remote/opt-in or use a lossless mode for local editable volumes.
Useful? React with 👍 / 👎.
Bring the new matter-compressor streaming layer into volume-cartographer: - vendor src/mc_zarr.c, mc_volume.c, c3d.c (+ headers) from matter-compressor - add matter_compressor_volume static lib (links VC's libs/libs3 + zstd + c3d), built whether standalone or vendored so the VC3D render shim can link it - update libs/libs3 to latest upstream (async batch API, s3_get_range_into, range coalescing); was a strict superset, all existing callers unaffected - order add_subdirectory(libs3) before matter-compressor (volume lib links it) - adopt the archive determinism flags (-ffp-contract=off, no -ffast-math) on both matter_compressor and the c3d TU, matching upstream mc_volume is the remote-zarr stream/transcode/cache/prefetch layer that will replace VC3D's MatterCache render path. Builds clean in VC (dev-gcc).
The mc_volume-backed render adapter: a thin IChunkedArray pass-through with no entry table / LRU / fetchers. tryGetChunk -> mc_volume_try_block (present: copy the 16^3 block; absent: async kick + MissQueued -> coarser-LOD fallback); getChunkBlocking -> mc_volume_get_block; chunk-ready listeners driven by mc_volume's transcode-complete callback; stats from mc_volume_get_stats. Replaces the ChunkCache+MatterArchive+ZarrChunkFetcher stack on the render path. Links matter_compressor_volume into vc_core; builds clean (dev-gcc).
Step 3a: the GUI render path now serves remote zarr (c3d/blosc) volumes from mc_volume instead of ChunkCache+MatterArchive+ZarrChunkFetcher. - Volume::createChunkCache returns IChunkedArray; for remote non-.mca URLs it builds a McVolumeArray (mc_volume) and returns it. .mca-mirror + local-zarr paths unchanged for now. - Lift the GUI-facing surface onto IChunkedArray: Stats, stats(), shardBatch(), prefetchShardBlocking(), beginViewRequest() (default no-ops). ChunkCache and McVolumeArray both implement it; the sampler already took IChunkedArray&. - Retype _chunkArray / chunkedCache_ / VolumePrefetcher cache / GUI helpers from ChunkCache to IChunkedArray. The GUI calls only interface methods now. VC3D + vc_render_tifxyz + vc_cache_prefetch build+link clean (dev-gcc). ChunkCache/ZarrChunkFetcher remain for the tracer (Chunked3d) path.
VC3D's render path now goes entirely through matter-compressor. renderFrame unifies plane and quad onto surf->gen() -> McVolumeArray::render() (mc_render), deleting ChunkedPlaneSampler (.cpp/.hpp + 4 tests, ~2350 lines), the coverage mask, and the C++ composite layer-stack loops. Vendor mc_render/mc_sample/mc_s3. Fixes: - extern "C" guards on mc_render.h/mc_sample.h (undefined mc_render_pick_lod). - QuadSurface::gen() returns a non-continuous ROI view; clone to continuous before handing the flat ptr to mc_render (was shearing/streaking the quad). - Pyramid is power-of-two with chunk-padded shapes, so LOD scale stays 2^L.
…split) Pull upstream a88266a (mc_render 3D resampling: surface volumes + oriented boxes) and the decode-vs-encode timing split into VC's vendored copy.
…ter_compressor
Upstream 9a28fb7 folded mc_sample.{c,h}, mc_render.{c,h}, mc_sample_internal.h
into the single matter_compressor.{c,h} pair. Mirror that: delete the 5 folded
files, sync all sources, drop them from CMake (matter_compressor = just
matter_compressor.c), and include only mc_volume.h in McVolumeArray (it pulls in
matter_compressor.h). Also picks up the mc-decode/mc-download thread naming.
Upstream folded mc_volume/mc_zarr/mc_s3 into one matter_compressor.{c,h} pair.
Mirror that into VC's vendored copy: delete the 6 folded files, collapse the
CMake to a single matter_compressor target (matter_compressor.c + c3d.c, links
libs3 + zstd), keep matter_compressor_volume as an ALIAS so consumers resolve.
McVolumeArray now includes matter_compressor.h.
Wire up the runtime RAM-cache controls the merged TU exposes:
- McVolumeArray::stats() populates the decoded-RAM gauge (cache_used/cap_blocks)
and a download-rate estimate (net_bytes delta / wall-clock, light EMA).
- IChunkedArray::setDecodedByteCapacity + McVolumeArray impl -> mc_volume_set_cache_bytes.
- Settings dialog applies the RAM cache GB live (resize the active volume's cache).
The .vca/.mca export + recompression flow lives in matter-compressor now; the old SigV4/HttpClient-based zarr recompressor is dead. Removes one http_fetch + c3d consumer.
createChunkCache now opens every volume via mc_volume — remote zarr streams + transcodes into a local .mca (as before), and local zarr directories use mc_volume's new local-filesystem source (sibling .mcacache dir). One render/ cache path; no ChunkCache/MatterCache/ZarrChunkFetcher construction. The zarr metadata openers stay for shape/dtype discovery (zarrOpen/NewFromUrl).
…ccessor) Picks up mc_volume local-filesystem source (file_read) and mc_volume_get_level_meta so VC can read per-level pyramid metadata straight from an opened mc_volume.
mc_render composites inside matter-compressor now (min/mean/max/alpha along the normal), so VC's per-pixel C++ compositing is dead: delete Compositing.cpp (LayerStack, CompositeMethod::*, compositeLayerStack, methodRequiresLayerStorage, buildTfLut256, computeLightingFactor) + test_compositing — zero non-test callers. Keep CompositeParams/CompositeRenderSettings (GUI settings carriers mapped to mc_render params). Also drop orphaned _values/_coverage viewer members.
Both libs/c3d/c3d.{c,h} and libs/matter-compressor/src/c3d.{c,h} were byte-identical.
Delete the libs/c3d copy; the c3d target (utils_c3d_codec's dep) now compiles
matter-compressor's c3d.c. One source of truth, matching the 'matter-compressor
owns the codec' direction.
…rrChunkFetcher) McVolumeArray (matter-compressor) is now the ONLY render/cache path for every volume. Delete ChunkCache, MatterCache, ZarrChunkFetcher (.cpp/.hpp) + their tests. Replace ChunkCache::Options with a minimal DecodedCacheOptions (only decodedByteCapacity is consumed). New ZarrMetadata module reads pyramid shapes/dtype/fill via utils::zarr for zarrOpen/NewFromUrl (no fetcher). Migrate ChunkedTensor::openChunkedArrayCache (tracer local path), vc_cache_prefetch, and vc_render_tifxyz to McVolumeArray/IChunkedArray. Remote .mca probe stubbed (GUI-unreachable). Co-developed via isolated worktree agent.
…set ladder) - reuse one c3d decoder per decode-pool thread (~14% off each decode) - cap decode pool at nproc/2 (memory-bandwidth-bound; ~40% lower per-region latency) - upstream calibrated preset ladder
The download rate was a poll-to-poll net_bytes delta that snapped to 0 whenever a poll fell between s3_get_batch arrivals (batches land every few seconds), and the status bar also gated on remoteFetchesInFlight which flickers to 0 between request bursts. Now: average bytes over a sliding 2s window, hold the last nonzero rate for 3s of idle before declaring 0, and gate the readout on the rate (not the in-flight count). Shows a steady 'downloading @ X MB/s' while data is actually streaming.
…eadout - per-region interactive fetch (no whole-shard download), net_bytes counts real transferred bytes - McVolumeArray download-rate: sliding 2s window held through batch-burst gaps; status bar gates on rate not the flickery in-flight count
Fixes regions wrongly poisoned as air on transient read errors — they now stay ABSENT and retry, only marking ZERO when S3 confirms air (c3d index air marker / v2 404).
…ladder) - staging queue decoupled from decode via a 2GB byte budget (network saturates ahead of CPU-bound decode); mc_volume_set_staging_bytes runtime API - upstream calibrated compression preset ladder (MC_PRESET_ARCHIVAL..PREVIEW), pulled down but not yet used by VC
…render scratch reuse - vendored matter-compressor: per-thread mc_codec_ctx (eliminates ~14% TLS overhead) + single-flight region fetch (re-decode redundancy 15x -> 1.00x) - McVolumeArray::render: reuse thread-local pts/nrm scratch instead of a fresh 13MB vector per frame (cuts render-worker page-fault churn)
…decode) Fixes ~4% render CPU wasted rebuilding the quant table (4096 powf) on every mc_cache-miss block read.
Cache (matter-compressor) is now fully lock-free on the render path: - delete per-shard mutex; mc_cache_update partitions fill by shard ownership (worker t owns shards t::nt) so every mutation is single-owner. Reworked the async-update ticket the same way. - getters (get/get_copy/contains/best_lod) are bare lock-free probes; removed rd_mu (single-owner reader decode) and encoder bm_mu (stripes write byte-disjoint bitmap regions). mc lock-ops 77->33, none in the cache/render path; survivors are all off-tick (ftruncate, S3 footer, download<->decode queue handoff, one-time init). - coverage memo: mc_archive_chunk_coverage is an O(1) hash probe instead of a per-block mc_resolve_chunk tree walk; backfills from the tree. Render-hitch fix: mc_block_range was O(n^2) over a chunk's blocks (per-block 512B popcount + O(slot) prefix sum). Replaced with a thread-local per-chunk bi->offset index built in one O(4096) pass; O(1)/block. Also memoize the chunk_offset tree walk per chunk in src_archive. Fresh-working-set spike 348ms->~35ms, worker median 79->25ms. Residual ~6ms is pure AVX2 trilinear (already optimal). Render scheduling: the global tick is the only scheduler. - chunk arrival never schedules a render; the tick renders while data is in flight (whole-pipeline depth collated at thaw into a plain frozen inflight_snapshot -- queued+downloading+decode-queue+undrained misses, read during frozen render, no atomic/lock). - removed the worker-busy render queue: if a render worker is still busy at tick time, skip this tick and re-arm dirty; no completion self-scheduling. Kills the worker-finished re-render chains (was ~320/session, now 0). - fold per-viewer timers into the global clock; Constants.hpp for the tick/debounce constants. - clear the stale latency-origin stamp on superseded frames.
… gen) Pull back the correctness fixes found by the upstream mc test suite: content-key the per-chunk block-offset index (arc+off+xxh64) so a stale index isn't served after chunk_off reuse/re-append, and add an archive gen counter that invalidates src_archive's chunk_off memo on publish.
matter-compressor is now the single render/composite/colormap engine for the GUI. Net ~1000 lines of VC3D render code deleted. - mc_colormap (vendored): window/level + colormap LUT as baked static [256][3] gray->RGB tables (viridis/magma/fire bit-identical to OpenCV, + gray/r/g/b/cyan/magenta tints). The viewer calls mc_colormap_lut/ mc_colormap_id instead of buildWindowLevelColormapLut. - delete BOTH postprocess pipelines: util/PostProcess (CLAHE/raking/ stretch/iso/remove-small-components) and render/PostProcess (the LUT builder mc replaced) + their tests. Drop SampleParams::postProcess and Volume::applyOptionalPostProcess (CLI 2D post, also gone). - slim CompositeParams/CompositeRenderSettings to the mc-mapped fields: method, alphaMin/Opacity, percentile, and the SHADED knobs (light dir, ambient/diffuse/specular/shininess/absorption/shadow/sss/curvature). Deleted beerLambert, lighting, volume gradients, pre/post transfer functions, dvr/pbr, etc. -- mc's SHADED mode subsumes the relief/raking /lighting passes. - remove the streamingCompositeUnsupported fallback; every method is an mc reduction now. Extend the method map: stddev=5 shaded=6 percentile=7 depth=8. McVolumeArray::render takes a ShadeParams* for those modes. - Colormaps.cpp keeps only the UI registry (specs/resolve/entries); makeColors/applyPackedLut pixel funcs removed (no live callers). GUI verified: 281 frames rendered through the mc path, no errors. CLI tools (Slicing.cpp, zarr reader) still use legacy paths -- deferred.
4f6119c to
db373d1
Compare
vendor: sync matter-compressor (1a62a52) -- fast thaw (region-granular absent miss + ABSENT coverage memo + single-call fill), per-pixel LOD fallback, pointer-into-arena sampler, LIFO decode queue. VC3D side: - McVolumeArray slice path uses mc_render_points_par_lod with native L0 coords: per-pixel coarser-LOD fallback (coarse-not-black on zoom/pan, render never decodes). Composite path still single-level. ShadeParams pass-through for SHADED/PERCENTILE. - THAW-GATE (correctness): onGlobalTick skips thaw/freeze/render while any viewer's async render worker is busy. The pointer-into-arena sampler holds raw arena pointers across the async render; thawing (the only mutator) mid-render would evict/realloc under them -> use-after-free. Gating restores the freeze/thaw invariant: no mutation while frozen reads are live. (renderWorkerBusy() accessor added.) - "downloading N" status now reads regions_inflight (stable pipeline depth) while the render gate reads workPending (+ undrained misses), so the status no longer ping-pongs 0..thousands with the per-frame miss set. - SurfacePatchIndex rtree rebuild moved to a dedicated 1-thread QThreadPool (was on the global pool shared with render bands -> a 1.6s startup rebuild stole render workers). Internals untouched; rebuild now timed.
Drops the per-entry 4KB block-memo buffer: the sampler caches the cache arena pointer directly (owns_ptr source) instead of a 1MB/8MB per-band- per-frame buffer. LOD sample L4 27ms->8ms, L0 13ms->5ms; kills the sampler-alloc page-fault storm that was ~25% of render-thread time.
At the start of each global tick, before freeze, every viewer predicts the
256^3 regions it will sample this frame and the tick submits the downloads
up front -- so data front-runs the render instead of being discovered as a
miss a cycle late.
Prediction (predictWorkingSet): run the surface's OWN gen() at 1/8 res over
the exact render viewport, so we get only the IN-VIEWPORT coords (gen clips
to the screen rect; a warped sheet's off-screen patch is excluded). Map
each coord -> 16^3 block -> 256^3 region, dedup at region granularity. This
replaced an earlier control-grid walk that sprayed ~46k regions across the
volume (a folded sheet's padded control window) and flooded the download
stack ("downloading 45000"). Now ~hundreds of regions, stack depth maxes ~50.
Submit via McVolumeArray::prefetchChunks -> mc_volume_request_region (cheap,
absent-only, no decode). Cross-viewer + inflight/present dedup is handled by
the deduping download stack in mc, not here. vendor: sync mc 63e9601.
VCA_NO_PREDICT disables for A/B.
What
Replaces VC3D's render caching with matter-compressor: remote (and local) chunks are fetched once, re-encoded into ONE persistent
volume.mcaper volume (all LODs, crash-safe, appendable), and served as 16³ blocks through mc_cache (sharded S3FIFO resident cache inside the library). VC3D-side caching scaffolding is deleted; the ChunkCache keeps only per-256³-region fetch status plus the async orchestration (worker pool, listeners). Non-uint8 volumes keep the legacy byte-LRU path.Vendored matter-compressor is at
1e53dbb(format v6: per-axis dims, self-contained blocks, SIMD decode). Stale/older-format cache files are detected and rebuilt automatically.Fetch path
Rendering / frontend
Profile-driven (perf +
--profilelogs, before/after in the commit messages):tryGetChunkmisses queue exactly what requested frames need; the mca archive makes refetches cheapsubmitRender) are gone — the UI thread no longer registers in CPU profilesAlso
_dl_fini; VC3D now runs real teardown then_exit()Measured end state
Steady-state (warm archive): blocks decode from
volume.mcavia mc_cache, zero network, ~4% CPU in mc block decode. Cold exploration: c3d decode + mc encode dominate pool threads (once per region ever, amortizes out). Render frames are 94% sampler kernel — follow-up territory (pan reuse, kernel work).Known:
test_volume_localexact-roundtrip fails on this branch (reads go through the lossy q=8 mca; pre-existing design question, not a regression).