AdaWorldAPI · AdaWorldAPI · Jun 24, 2026 · Jun 24, 2026 · Jun 24, 2026 · Jun 24, 2026
diff --git a/Dockerfile b/Dockerfile
@@ -73,6 +73,28 @@ RUN git clone https://github.com/AdaWorldAPI/lance-graph.git \
  && git clone --depth 1 https://github.com/AdaWorldAPI/ndarray.git \
  && git clone --depth 1 https://github.com/AdaWorldAPI/neo4j-rs.git
 
+# CPU baseline: x86-64-v4 (the 4th microarch level — AVX-512F/BW/CD/DQ/VL on top
+# of v3's AVX2+FMA). This is the compile FLOOR; it flips on `target_feature =
+# "avx512f"`, so q2-ndarray's `simd.rs` dispatch selects its native `simd_avx512`
+# backend (`__m512`/`__m512d`/`__m512i`) instead of the v3 AVX2 default.
+#
+# BF16 + AMX 16x16 tile GEMM are NOT gated by this flag — they ride q2-ndarray's
+# CPU-AGNOSTIC runtime autodetect polyfill (`simd_caps()` + the AMX `arch_prctl`
+# XTILEDATA enable + CPU-model detect). The polyfill opportunistically lights them
+# up only when the *runtime* host actually has them, and always keeps the AVX2 /
+# scalar paths it compiled in as fallback. So: AVX-512 = compile baseline here;
+# BF16/AMX = runtime-detected; everything below v4 = polyfill fallback.
+#
+# ⚠ REQUIREMENT: a v4 floor makes the binary REQUIRE AVX-512 at run time — it
+# SIGILLs on the first `__m512` op on a host without it (the PR #170 failure mode,
+# one level up). The Railway *build* machine needs no AVX-512 (compiling != run),
+# but the *deploy* host does. AMX additionally needs a Sapphire/Emerald/Granite
+# Rapids Xeon at run time; on anything older the autodetect simply skips AMX (that
+# is the agnostic polyfill working as intended, not an error). If a deploy target
+# may lack AVX-512, drop this to `x86-64-v3` and rely on runtime dispatch for the
+# AVX-512/AMX paths — one portable binary, same hot paths when the silicon allows.
+ENV CARGO_BUILD_RUSTFLAGS="-C target-cpu=x86-64-v4"
+
 # Build the q2 binary with embedded frontend
 WORKDIR /build/q2
 RUN cargo build --release -p cockpit-server --features embed-cockpit,planner \

diff --git a/claude-notes/plans/2026-06-24-fma-torso-bodyparts3d-splat.md b/claude-notes/plans/2026-06-24-fma-torso-bodyparts3d-splat.md
@@ -108,3 +108,112 @@ Validates the design before wiring it into the render. Next increments:
       (node_row-bounded + normal-oriented = crisp colours in the render)
 - [ ] animation: deform node anchors -> motion-skinned gaussians follow
       (Motion-Blender GS; the partonomy is the rig)
+
+## Best shading + lazylock + adaptive-FPS + SPL4 (branch claude/torso-shading)
+
+User: "best possible shading and lazylock buffering to mitigate batching", then
+"adaptive framerate prediction + SIMD batching + v4", then the key insight: "the
+Motion is fixed Rotation ... so it could easily prebuffer 270 frames for 90 FPS".
+Scoping answers: framerate = BOTH (render-loop throttle now + codec P-frames as
+the SPL4 motion track); PR scope = all of the above incl SPL4 in one push.
+
+### Infra fact ("GitHub uses Cargo not Dockerfile?")
+q2 CI = pure Cargo+npm (`cargo fmt`/`xtask lint`/`clippy -D warnings`/`nextest`,
+wasm-pack/npm). The only `docker` in CI is `docker image prune` (free runner disk).
+The root `/Dockerfile` is Railway-deploy ONLY (`q2-cockpit` embeds the Vite cockpit,
+clones lance-graph/ndarray/neo4j for the graph hot path). This splat feature does
+not touch the Dockerfile.
+- [x] **Dockerfile CPU baseline -> x86-64-v4** (user ask): `ENV
+      CARGO_BUILD_RUSTFLAGS="-C target-cpu=x86-64-v4"` before the cockpit-server
+      build. Flips `target_feature="avx512f"` so q2-ndarray's `simd.rs` picks the
+      native `simd_avx512` backend. BF16+AMX tile GEMM ride ndarray's runtime
+      autodetect polyfill (`simd_caps()` + AMX arch_prctl/model-detect) — not gated
+      by the flag, lit only when the host has them, AVX2/scalar fallback always
+      compiled. ⚠ v4 = AVX-512 REQUIRED at runtime (SIGILL otherwise, the PR#170
+      mode one level up); AMX needs Sapphire/Emerald/Granite Rapids at runtime
+      (autodetect skips it otherwise = agnostic working as intended). Documented the
+      `x86-64-v3` fallback in the Dockerfile for non-AVX-512 deploy targets.
+
+### Shading (the lit look) — DONE
+- [x] Render driver (scratchpad, ndarray 1.95, OUT of q2 workspace): shade AT
+      RECONSTRUCTION from the per-vertex normal already in SPL2 — hemisphere ambient
+      (sky/ground) + key diffuse (n·L, L fixed in WORLD so camera orbits a still
+      light = consistent turntable) + soft fill. Shading MULTIPLIES the flat palette
+      colour, so the codec-free per-structure colour story is intact. 20-frame
+      shaded turntable rendered (9s/frame) → JPEG (67 KB/frame) →
+      cockpit/public/torso-frames/. Verified in-cockpit: volumetric depth, colours
+      preserved, no Warhol blob.
+
+### Prebuffer = the answer to BOTH (A) and (B)  [the user's insight]
+The demo motion is a FIXED, periodic, deterministic camera rotation. So you neither
+ADAPT the framerate nor PREDICT motion frame-by-frame — you PRECOMPUTE the closed
+loop once and replay → every frame free → guaranteed 90 fps. This is exactly the
+x265 GOP idea: a periodic camera path is a closed Group-of-Pictures; prebuffer the
+GOP, replay forever. It is ALSO the honest SPL4 (B) motion source: the orbit is a
+real known closed trajectory, so the 270 rotation steps ARE its P-frames — NO
+synthetic breathing deformation needed (drop that demo).
+- [ ] /torso turntable: bump FRAME_COUNT 20 → loop count over an exact 360° (frame
+      N == frame 0 for a seamless loop), 90 fps playback. Re-bake at the higher count
+      (background). Ship-size lever: 67 KB/frame × 270 ≈ 18 MB JPEG → offer WebM
+      encode (~3 MB) as the compaction. Mandatory here because CPU EWA splat is
+      9s/frame — live render impossible; prebuffer is THE technique, not an optim.
+- note: the live WebGL points view is already real-time; prebuffering full
+      framebuffers there is VRAM-prohibitive (270×810×1080×4 ≈ 945 MB) — so the
+      live-view win is lazylock + adaptive-FPS, and image-prebuffer stays on /torso.
+
+### Live views light up + lazylock + adaptive-FPS
+- [ ] /torso-live (TorsoSplat) + /torso-map (TorsoMap): decode SPL2 `normal 3i8`
+      into an aNormal attribute (both skip it today); port hemisphere+diffuse+fill
+      into the FRAG. Same L → CPU frames and live WebGL agree.
+- [ ] LazyLock build-once buffer: build geometry (pos+aColor+aNormal+aRow) ONCE;
+      mutate only via uniforms + draw-RANGE, never rebuild.
+- [ ] Adaptive-FPS: EMA of rAF delta; over budget → shrink draw-range over the
+      Morton-ordered buffer (prefix = uniform spatial subsample) + drop pixelRatio;
+      recover when cheap; log active fraction (no silent decimation).
+
+### SPL4 — ship the codec (static I-frame real, motion track reserved)
+- [ ] `spl_codec.py`: WRITE a real `.spl4` (helix-Morton order, per-node anchor
+      I-frame, motion-from-anchor + zig-zag residual, anchor-predicted palette colour
+      = 0 per-gaussian bytes, normals). Header `motion_track_count` (0 static) reserves
+      the P-frame slot without a format bump (RESERVE-DON'T-RECLAIM).
+- [ ] TS `decodeSpl4`: inverse — reconstruct pos/normal/rgb/row at load; all 3 views
+      switch to SPL4.
+- [ ] Fold deferred #55 nits: `import math` → module top; fix "round-trips it"
+      docstring; TorsoMap `ray.params.Points` mutate-not-replace.
+- [ ] (B) motion track = orbit-as-motion P-frames (above); ship the FORMAT slot +
+      decode contract; the camera trajectory is the demonstrator (honest, not faked).
+
+### Verify + ship
+- [ ] `cd cockpit && npm run build` (tsc clean); inspect shaded turntable + live
+      view; codec round-trip RMSE unchanged. Commit incrementally on
+      claude/torso-shading; ASK before push (GIT PUSH POLICY).
+
+## v4 — is_a-PRIMARY whole-body anatomical atlas (major pivot, 2026-06-24)
+
+Operator-driven pivot, several corrections of my assumptions:
+1. **Use is_a, not part-of, for classification + names.** part-of is REGIONAL
+   (walk up a muscle -> chest wall -> thorax, never "muscular system") and its
+   names aren't canonical. is_a is the TYPE tree: every structure resolves up to
+   its canonical type (`pectoralis minor` -> ... -> `muscle organ`); is_a ships
+   canonical names; is_a's mesh set is a SUPERSET of part-of (2234 vs 1258 FJ,
+   +976) with finer organ segmentation (no single "aorta"/"heart" — split into
+   ascending/arch/descending/abdominal, each its own mesh). Downloaded the 142 MB
+   is_a obj package + the small is_a relation/name txts.
+2. **container:identity / DN->GUID addressing.** tissue = walk the is_a TYPE tree
+   to the first type keyword (O(1), cached) = the DistinguishedName path, which
+   MATERIALISES to a numeric container:identity GUID (container = tissue class).
+   Stored per node: `tissue`, `is_a` (DN path, upper-ontology stripped),
+   `container`, `identity`, `guid`.
+3. **Whole body is the goal — NO spatial torso filter.** Region focus (torso, an
+   organ) is a future SELECT -> CAMERA-ZOOM feature on the full-body splat, driven
+   O(1) by each node's centroid+bbox in the SoA, not a bake-time clip.
+4. **Performance is the point.** Whole body = 602,341 gaussians / 1658 is_a
+   structures / 12.6 MB (414 arteries, 382 muscles, 221 veins, 203 bones, 126
+   nerves, full viscera). The deliberate load that motivates lazylock +
+   adaptive-FPS (live views) and the prebuffered turntable (CPU EWA).
+- bake = `bake_torso_splat.py` v4 (is_a-primary). Tissue atlas palette + depth-peel
+  opacity. Driver orientation fixed (+90 about X; head was landing down).
+- [ ] re-render upright whole-body turntable -> /torso; live views already decode
+      the unchanged SPL2 (extra nodes.json fields are ignored) — light them +
+      lazylock + adaptive-FPS to show + mitigate the 602K load.
+- research: `claude-notes/research/2026-06-24-torso-anatomy-coverage-gap.md`.