Skip to content

Direct .insv dual-fisheye → pinhole rig ingestion (skip the equirectangular intermediate)#1

Draft
kfarr wants to merge 6 commits into
mainfrom
claude/insv-dual-fisheye-rig
Draft

Direct .insv dual-fisheye → pinhole rig ingestion (skip the equirectangular intermediate)#1
kfarr wants to merge 6 commits into
mainfrom
claude/insv-dual-fisheye-rig

Conversation

@kfarr

@kfarr kfarr commented Jun 11, 2026

Copy link
Copy Markdown

Why

360° scenes processed through the equirectangular (ER) path show weak/missing ground, and both suspected causes are real and compound each other:

  1. The ER projection stretches the poles. By the time Insta360 Studio has stitched the dual fisheyes into a 2:1 image, the ground directly under the camera is smeared across the entire bottom pixel row.
  2. The ER virtual rig never looks down. pano_sfm.py renders 12 views at pitches (−35°, 0°, +35°) with 90° FOV — nothing below −80° pitch is covered, so the nadir gets zero direct observations.

This PR adds a path that crops virtual pinhole views straight out of the two raw fisheye sensor streams in an .insv file (the approach used by LichtFeld's fisheye mode):

  • Native sensor pixels everywhere — an equidistant fisheye has roughly uniform angular resolution, and the nadir sits ~90° off each lens axis, well inside a ~200° lens.
  • Default grid: 3×3 lens-local views at ±60° with 75° crops → 9 per lens, 18 per frame pair. The down-pitched views put the nadir ~30° from their axes, so the ground is covered by up to 6 views per frame pair instead of 0.
  • Insta360 Studio is no longer needed: upload the .insv directly.

What's in here

  • fisheye_projection.py — equidistant fisheye model (+ optional Kannala-Brandt k1–k4), MEI (unified omnidirectional) model for the factory calibration, lens-local view grids, rig rotations with per-lens mounting corrections, cv2.remap grid construction. Pure numpy/scipy, fully unit-tested.
  • insv_calibration.py — Insta360 per-unit factory calibration parsing: a minimal protobuf wire walker (no protobuf dependency) reads the trailer metadata record's offset_v3 (20 fields/lens, layout per telemetry-parser) and the .insv.pb sidecar's extended calibration (X5; 27 fields/lens incl. k4 + thin-prism, layout validated by insv-stitch against in-camera stitching), then rescales from the per-model reference resolution to the demuxed stream (window_crop_info-aware, centered aspect-fit fallback). Best-effort: any failure falls back to the idealized model.
  • insv_extract.py — ffmpeg demux of the three known .insv layouts (dual-stream single file; two-file _00_/_10_ pairs; side-by-side single stream with a dark-corner guard against mis-feeding ER video). Best-effort trailer parsing (ExifTool/Sub-Etha layout).
  • fisheye_sfm.py — renders the views with two mask sets (SfM masks = lens-validity ∩ closest-view partition, to avoid duplicate features in overlaps; training masks = validity only, so gsplat doesn't train on black corners), builds the 18-camera two-lens rig, runs the shared SfM. Lens intrinsics precedence: explicit --insv_calibration JSON > factory calibration > idealized model.
  • pano_sfm.py refactor — feature extraction → rig config → matching → mapping extracted into run_rig_sfm_pipeline(), shared by both pipelines. No behavior change to the ER path.
  • vid2scene.py--insv_fisheye (auto-enabled for .insv inputs), --insv_lens_fov, --insv_calibration, --insv_no_factory_calibration. Image budget matches the ER path (~89 frame pairs × 18 ≈ 1,600 images at the 800 default).
  • Tests — 51 unit tests (vid2scene_core/tests/): projection round-trips with distortion, MEI projection math, nadir-coverage of the view grid, remap validity, trailer record walking, protobuf wire decoding, both factory-calibration layouts, X4/X5 reference-to-stream scaling, companion-file detection.
  • docs/insv_fisheye.md — usage, calibration resolution order, factory-calibration sources, JSON schema, limitations.

Validated

  • All 51 unit tests pass.
  • End-to-end smoke test on a synthetic dual-stream file with pycolmap 3.13.0 (the version the worker pins): demux → render → feature extraction → rig config → sequential matching → mapping all execute; 18 cameras / 1 rig / N frames land correctly in the COLMAP database.
  • Mask conventions verified empirically against pycolmap 3.13: masks are found under both name.png and name.png.png, and an image without a mask is skipped with MASK_ERROR once mask_path is set — the processor therefore writes a mask for every rendered image.
  • The refactored ER path was smoke-tested end-to-end (renders → SfM steps run unchanged).
  • Factory-calibration parsing is exercised against synthetic data matching the community-documented layouts (telemetry-parser's prost definitions; insv-stitch's X5 findings, which were validated against in-camera stitching output).

Needs validation on real footage (why this is a draft)

  • Factory calibration on real recordings. Parsing follows community-documented layouts but hasn't run against a real X4/X5 file yet. Specifically to confirm: the reference-resolution scaling on models other than X4/X5, and the sign convention of the sub-degree mounting corrections. --insv_no_factory_calibration gives an immediate A/B fallback to the idealized model (principal point at center, inscribed circle, 200° FOV).
  • Rear-lens mounting uses the factory yaw/pitch/roll corrections when available, otherwise exactly 180° yaw / 0° roll (overridable in the calibration JSON).
  • Lens baseline (~2–3 cm) is ignored, same as the ER path; the factory per-lens translation is parsed but not applied (observed as zeros in community dumps; a metric value would also pin reconstruction scale, which needs deliberate handling).
  • No IMU use yet (horizon leveling, rolling-shutter correction), no SAM3 ego-masking on the fisheye path, no GPS extraction from the trailer.
  • Server upload flow only exposes equirectangular; .insv uploads through the web UI need a form field + pass-through (auto-detection already works at the pipeline level). Same for the cog/Modal wrapper (separate repo).

Suggested A/B test

Same Bernal .insv clip three ways: (a) ER path as-is, (b) ER path with training_max_num_gaussians=3M, (c) this path. If (c) fills in the ground where (a)/(b) don't, the input geometry was the bottleneck, as suspected. With factory calibration now in, (c) can additionally be run with --insv_no_factory_calibration to isolate how much the calibration itself contributes.

🤖 Generated with Claude Code

claude and others added 3 commits June 11, 2026 11:41
Move feature extraction, rig configuration, sequential matching, and
mapping into run_rig_sfm_pipeline so pipelines that render virtual
perspective views from other sources (e.g. dual fisheye streams) can
reuse the same SfM machinery. No behavior change for the
equirectangular path.

https://claude.ai/code/session_01MdiAmjGY3SEAQLHsxKBVac
Process Insta360 .insv recordings straight from their two raw fisheye
sensor streams instead of requiring a pre-stitched equirectangular
video. The equirectangular intermediate stretches the poles of the
sphere, and the ER virtual rig's lowest pitch (-35 deg, 90 deg FOV)
never sees below -80 deg, so the ground under the camera gets zero
direct, full-resolution observations. Cropping pinhole views directly
from each fisheye keeps native sensor pixels and the default view grid
(3x3 at +/-60 deg per lens, 75 deg crops, 18 views per frame pair)
covers the nadir with up to 6 views per frame pair.

- fisheye_projection.py: equidistant fisheye model with optional
  Kannala-Brandt distortion terms, lens-local view grids, and
  cv2.remap grid construction (pure numpy/scipy, unit-tested)
- insv_extract.py: ffmpeg demuxing of the three known .insv layouts
  (dual-stream, two-file _00_/_10_ pairs, side-by-side single stream)
  plus best-effort trailer metadata parsing for logging
- fisheye_sfm.py: renders the virtual views with validity and
  closest-view partition masks, builds the two-lens rig config, and
  runs the shared rig SfM pipeline
- vid2scene.py: --insv_fisheye / --insv_lens_fov / --insv_calibration
  flags, auto-enabled for .insv inputs
- tests: 29 unit tests for the projection math and container parsing

Lens intrinsics default to an idealized model; per-unit calibration
can be supplied as JSON (docs/insv_fisheye.md). Factory calibration
and IMU parsing from the trailer are follow-ups.

https://claude.ai/code/session_01MdiAmjGY3SEAQLHsxKBVac
…ings

Replace the idealized-lens assumption with Insta360's per-unit factory
calibration whenever it can be read from the recording itself, removing
the main registration risk flagged in the original PR.

Every Insta360 camera embeds an MEI (unified omnidirectional) camera
model per lens: mirror parameter xi, fx/fy/cx/cy, radial k1..k4,
tangential p1/p2 and thin-prism s1..s4 distortion, plus sub-degree
per-lens mounting corrections. Two sources are read, in order of
preference:

- the .insv.pb sidecar (X5; 27 fields per lens incl. k4 and thin-prism
  terms), layout validated by insv-stitch against in-camera stitching
- the trailer metadata record's offset_v3 string (20 fields per lens),
  layout per telemetry-parser's prost definitions

Implementation:
- insv_calibration.py: minimal protobuf wire-format walker (no protobuf
  dependency), offset_v3 / .pb sidecar parsers, and reference-resolution
  to stream-resolution conversion (window_crop_info-aware, centered
  aspect-fit fallback). Best-effort throughout: any failure falls back
  to the idealized model.
- fisheye_projection.MeiLensModel: the MEI forward projection with the
  same project_rays interface as the equidistant model; the FOV cone
  bounds validity since the projection itself accepts nearly all
  directions for xi >= 1. get_lens_from_rig_rotations moved here from
  fisheye_sfm and extended with per-lens mounting corrections (keeps
  tests pycolmap-free).
- fisheye_sfm.py: precedence explicit JSON > factory > idealized;
  factory mounting corrections feed the rig config.
- --insv_no_factory_calibration / --no_factory_calibration escape hatch
  in vid2scene.py and fisheye_sfm.py.
- 22 new unit tests (51 total): wire decoding, both calibration string
  layouts, full-frame cx normalization, X4/X5 scaling cases, MEI
  projection math, rig corrections.

Pending real-footage validation (documented in docs/insv_fisheye.md):
reference scaling on models other than X4/X5, and mounting-correction
sign conventions (sub-degree, so low risk either way).

https://claude.ai/code/session_01MdiAmjGY3SEAQLHsxKBVac

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@kfarr

kfarr commented Jun 12, 2026

Copy link
Copy Markdown
Author

Downstream wiring to make this testable through the 3DStreet app (generator#splat → "360 → Splat (.insv)") is now in draft PRs: vid2scene-cog#1 (insv_fisheye pass-through + image pin cog-phase2 = this PR's head) and 3dstreet#1673 (model entry + splat-tab UI). Deploy order and the real-footage test plan are in both PR descriptions.

kfarr and others added 3 commits June 12, 2026 12:35
First validation against real X5 footage (fw v1.9.6_build1) found two
gaps that silently dropped factory calibration to the idealized
fallback:

- The trailer chains an id-0 record at the top whose payload is an
  index table of (uint16 id, uint32 size, uint32 offset) entries,
  offsets relative to the trailer data start, small ids = legacy >> 8.
  Records below it no longer follow the strict payload+descriptor
  chain, so the walk now resolves everything through the table instead
  of treating id 0 as a terminator.

- offset_v3 writes 19 fields per lens (no per-lens flag) plus trailing
  file-level values, vs the X4-era 20. parse_offset_v3 now tries both
  layouts and validates the reference-dimension slots, where a
  misaligned block lands lens_type/flag-scale values.

With both fixes the real recording loads end-to-end from the trailer:
principal points land within 5 px of the 3840x3840 stream center
(full-frame cx shift + 5376->5312 window-crop scaling both verified),
fx scales 4280->3094, and mount corrections surface the ~90 deg
portrait-sensor roll confirmed by inspecting the demuxed frames -
sub-degree-only assumptions would have broken this camera.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Validated against a real X4 recording (fw v1.9.21_build5): newer X4
firmware also writes the v3 index trailer and 19-field offset_v3
blocks, with a per-lens landscape 8000x6000 reference (no halving) and
lens 1 cx in 16000-wide full-frame coordinates. Principal points land
within ~5 px of the 2880x2880 stream center after normalization, and
the demuxed frames confirm the same ~90 deg portrait-sensor roll as
the X5.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The idealized-lens fallback is no longer silent: real X4/X5 sensors are
portrait-mounted (~90 deg roll), which the idealized model doesn't know,
so a job that silently degraded would burn a full SfM+training run on a
rig that can't register. run_insv_sfm now raises
FactoryCalibrationError before any heavy work (frame extraction, render,
SfM, training) when no calibration parses; the idealized model remains
available as an explicit opt-in (--insv_no_factory_calibration) or via a
calibration JSON.

load_factory_calibration now logs which rung of the ladder broke (no
trailer / no file_info record / no offset_v3 / unparseable offset_v3)
and, for the unparseable case, the raw offset_v3 string verbatim - that
string is everything needed to add support for an unknown layout, as
the X4-20-field vs X5-fw1.9-19-field split demonstrated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants