Per-frame Metadata New Feature

# Per-frame Metadata New Feature

Follow-on to #1585. This proposes how to attach **per-frame metadata** to a DIVE dataset: show
capture-time telemetry (one record per image, for example timestamp, latitude/longitude, water depth,
and altitude) in a playhead-synced panel. Targeting web (Python/girder) and desktop
(Electron/TypeScript).

Feedback welcome.

## Summary

Per-frame telemetry is observed, read-only data that describes the imagery, much like resolution or
frame rate. We propose treating it as a **property of the media**:

- read it from its source file when a dataset opens,
- show it in a panel synced to the playhead,
- and persist nothing derived.

The source file the user already has next to the imagery is the single source of truth. Everything
else is an in-memory, read-time projection of it. This keeps the feature small: a read-only viewer
with no new storage, no import step, and no data that can drift out of sync.

## Approach

| Decision      | Choice                                                        | Why                                                                                                                                                                                                                                             |
| ------------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Storage model | Property of the media, not an annotation                      | It is read-only observed data describing the imagery; it gains nothing from the editable annotation/track system.                                                                                                                               |
| Ingest        | None; the raw file rides along and is read on load            | No import step means no second copy to maintain, so the file on disk stays the one source of truth.                                                                                                                                             |
| Source (v1)   | Delimited text files only (`.txt` / `.csv` next to the media) | The target data is filename-keyed text; embedded sources add a parser with no v1 payoff.                                                                                                                                                        |
| Join          | Filename value-match                                          | Each image is a named file and the telemetry supplies the filename column; joining by row order is fragile against re-sort, gaps, and partial uploads.                                                                                          |
| Persistence   | Derive on load, write nothing durable                         | The source file is the single source of truth and a persisted copy can only drift. The client requests a frame window so memory stays bounded however long the dataset is (see Scale and access); making large server reads cheap is follow-up. |
| Serve         | One read-time endpoint (`loadFrameMetadata`)                  | A single windowed read serves both backends; nothing is precomputed or stored to answer it.                                                                                                                                                     |
| Display       | 2 columns at the top of the Dataset Metadata side panel.                              | The values only make sense against the current frame, so they ride the playhead like the rest of the viewer.                                                                                                                                    |
| Export        | Out of v1 (see Future work)                                   | Users already hold the source file and can join it downstream by filename, so export can wait until a consumer needs it.                                                                                                                        |

## Non-goals

| Item                                           | Why                                                                                                                                                                                                                                                                                                                                                                                                                     |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Editing telemetry                              | Telemetry is observed, not authored.                                                                                                                                                                                                                                                                                                                                                                                    |
| Whole-frame classification and per-frame notes | These are editable, authored labels, a different kind of data from read-only observed telemetry. They are annotation-shaped (a whole-frame classification is a detection with a class and confidence; a note is a per-frame attribute) and, if built, belong in the annotation system rather than this media-property channel. Folding them in would mix editable authored data into a read-only, write-nothing design. |

Deferred features that the design is built to accommodate (export, embedded sources, charting,
training integration, and out-of-folder source selection) are described under Future work.

## Technical design

A prototype branch has validated the parse, serve, and panel path; the code below is the proposed
shape, with existing DIVE primitives referenced where we build on them.

**Principle:** infer per-frame metadata on load and never persist a duplicate. The `.txt` / `.csv` the
user dropped next to the imagery is the only stored form; everything else is an in-memory, read-time
projection.

### Source adapter seam

All source-specific logic sits behind one boundary that produces a normalized, frame-keyed table:

```
adapter(dataset, camera) -> records   // records: frame -> { field: value }, raw strings
```

- v1 ships one adapter: the text-file reader. A source is a delimited file with a header row, a
  filename column, and one row per image. The delimiter is sniffed (whitespace, comma, or tab);
  the NOAA samples are space-delimited `.txt`:

  ```
  port_image date time latitude longitude water_depth altitude ... starboard_image
  20191009.154056.00082_rect_color.tif 2019/10/09 15:40:56.1122 46.575870 -124.603094 192.80 2.78 ... 20191009.154056.00081_rect_color.tif
  ```

  The filename column is required: it joins each row to a frame by value (robust against re-sort,
  gaps, and partial uploads), lets discovery pick the file out of other text files, and never
  misaligns (a row matching no image is dropped, not shifted onto the wrong frame). A row may carry
  more than one filename column (stereo: `port_image` and `starboard_image`); each child folder
  matches its own column against its own media, so one shared row binds to both cameras (see
  Per-camera routing). Video, which has no per-frame filenames, is out of v1 and joins differently
  (see Future work).

- KLV or EXIF later is a second or third adapter behind the same boundary; the panel and load flow do
  not change.

### Read path (server)

The read path sniffs, dispatches on source, and routes each record to the current frame index:

```python
def load_frame_metadata_records(folder, user):
    children = multicam_children(folder)                  # None for a single dataset
    cameras = {}
    for child in (children or [folder]):
        # Match DIVE's runtime camera key: 'singleCam' for one camera, the camera name for multicam.
        camera = child['name'] if children else 'singleCam'
        media = valid_images(child, user)                 # ordered, gives frame index
        media_keys = valid_image_names_dict(media)        # name -> frame
        frames = cameras.setdefault(camera, {})
        for fname, values in read_source(folder, child, media_keys, user):
            frame = media_keys.get(normalize_key(fname))
            if frame is not None:
                frames[frame] = values                    # collision guard omitted for brevity
    return {'cameras': cameras}


def read_source(folder, child, media_keys, user):
    # v1: file only. Candidates are co-located .txt/.csv; discrimination is in is_frame_metadata.
    for item in candidate_table_items(folder, child):     # parent root plus this child folder
        text = download(item)
        if is_frame_metadata(text, media_keys):           # rejects DIVE formats, requires a filename match
            header, rows = parse_table(text)
            jc = find_join_columns(header, rows, media_keys)[0]
            return ((row[jc], dict(zip(header, row))) for row in rows)
    return ()                                              # exif / klv adapters slot in here later
```

Each branch yields `(filename-or-frame, fields)`, and the common loop routes by content to the
current frame index, so reordering media never stales the result.

**Per-camera routing and collisions.** Routing falls out of the per-child loop: `find_join_columns`
re-runs per child against that child's `media_keys`, so each camera self-selects its own filename
column. A stereo file at the dataset root with both `port_image` and `starboard_image` therefore
binds each row to both cameras (port matches its column, starboard matches its own), which is what
shared telemetry wants. The one guard: if a `(camera, frame)` gets two different value sets, it is
skipped rather than guessed (identical values are fine).

This reuses existing DIVE helpers (`valid_images`, `valid_image_names_dict` in
`server/dive_server/crud.py`) and the dataset serve endpoint pattern. The parser helpers
(`is_frame_metadata`, `parse_table`, `find_join_columns`, `normalize_key`) are new, in a
`server/dive_utils/serializers/frame_metadata.py` module.

### Source discovery and discrimination

A dataset folder already holds other delimited files, so the reader must pick the telemetry file
without grabbing an annotation file by mistake. Candidates are the dataset folder's own `.txt` /
`.csv` entries plus the parent root for multicam (on web these are Girder items in the folder; on
desktop they are files in the directory). The dangerous collision is the **VIAME annotation CSV**,
whose second column is the image identifier, so its values match the media filenames exactly. A
sniffer that only looked for "a column that matches the filenames" would select the annotation file.
`is_frame_metadata` therefore applies a fail-safe ladder, cheapest first:

1. **Extension filter.** Keep only `.txt` / `.csv`. This alone drops every JSON file (`meta.json`,
   `multiCam.json`, `*.dive.json`, COCO), calibration files (`.npz`, `.cam`, `.yml`, `.zip`), and
   `.pipe` pipelines. App-generated lists (`*_images.txt`, `labels.txt`, intermediate output CSVs)
   are written to a temporary working directory during a job and never land in the dataset folder, so
   they are not candidates.
2. **Reject DIVE's own formats by content.** Run the existing VIAME detector and skip anything that
   parses as VIAME: its rows are comment-headed (`# 1: Detection or Track-id`), have at least nine
   columns, and begin with an integer track id followed by a float bounding box
   (`load_csv_as_tracks_and_attributes` in `server/dive_utils/serializers/viame.py`). A telemetry file
   has a plain header row of field names and no `#` comment header, so it passes. Reusing DIVE's
   parser as a negative filter is more robust than inventing a header sniff.
3. **Require a positive filename match.** Among survivors, accept only a file with a column whose
   values match the media basenames (`normalize_key` strips extensions because `valid_image_names_dict`
   keys images without them). This is self-selecting: an unrelated `readme.txt` or `notes.csv` has no
   such column and is ignored without configuration.
4. **Require a payload, and skip ambiguity.** The matched file must carry at least one column beyond
   the join column, which rejects a bare image-list file. If two distinct files still value-match, the
   reader does not guess; it skips rather than attach the wrong one.

The effect is that the telemetry file needs no special name: the user drops it next to the imagery,
the annotation CSV is removed at rung 2, and whatever positively matches the filenames is the source.

### Serve contract

One `loadFrameMetadata` contract satisfied by both backends, taking a frame range (see Scale and
access) and returning the matching records keyed `camera -> frame -> values`. A frame's entry is just
the values read for it; there is no separate schema or status payload, and the keys may differ from
one frame to the next:

```jsonc
{
  "cameras": {
    "port": {
      "0": { "date": "2019/10/09", "latitude": "46.575870", "...": "..." },
      "1": { "date": "2019/10/09", "latitude": "46.575912", "...": "..." },
    },
    "starboard": { "0": { "...": "..." } },
  },
}
```

The camera map sits under a `cameras` namespace so later top-level keys (for example `status`) cannot
collide with a camera id. Single-camera datasets use the one key `singleCam`. Only frames with data
appear, so a windowed read returns just the present frames. The panel renders whatever keys a frame
carries, so the client stays a pass-through viewer.

### Client representation

The inference result is plain reactive state, the per-session cache and nothing heavier:

- a `ref` holding the served window, the `cameras` object refetched around the playhead as it moves
  (see Scale and access). It is already keyed `camera -> frame -> values`, so it is the cache as-is,
  with no re-indexing step;
- a `currentRows` computed off the playhead frame and the selected camera
  (`cameras[selectedCamera]?.[frame]`), showing the active frame's values in the order they appear,
  whatever keys are present.

It is deliberately not an annotation store: we keep it out of the annotation and attribute stores,
which carry edit/save/revision semantics this read-only data does not need.

**Why not the attribute system.** Attributes look like the natural home, but they require a track
parent: `belongs` is only `track` or `detection` (`client/src/use/AttributeTypes.ts`), and values
live at `track.features[frame].attributes[key]`. Telemetry has no track, so this would mean a
fabricated whole-dataset track with an invented box, columns added as schema-polluting attribute
definitions, and an editable save lifecycle, all against a read-only, write-nothing goal.

### Scale and access

A dataset can run to hundreds of thousands of frames with tens of fields each. The panel only needs
the current frame, so the client **requests a window around the playhead** and refetches as the user
scrubs, which bounds client memory however long the dataset is.

The window bounds client memory and transfer, not server work: a text source is not seekable by
frame, so v1 still parses the whole file per request to build the frame->row map. The remaining
concern is therefore server-side and is **follow-up work**: parse the source once when the dataset
opens and serve windows from the result, rather than re-parsing per request (which only bites a very
long source on a multi-worker web deployment). The fix is a shared cache of the parsed result, and it
changes neither the client nor the serve contract.

### Display

At the top of the Dataset Metadata side panel (the current frame number and filename are already shown in the playback
controls via `FileNameTimeDisplay`, so the panel does not repeat them):

- the active frame's values in the order they arrive, verbatim (pass-through strings, no type
  inference), whatever keys are present that frame;
- empty states: platform unsupported, no metadata for the dataset (with a hint to drop a telemetry
  file next to the imagery), no metadata for the current frame.

Multicam display selects the active camera's records.

### Cross-backend

Web (Python) and desktop (TypeScript) are mirrored implementations governed by shared fixtures (sample
`.txt` files plus expected parsed output). The desktop resolver re-reads at load and likewise writes
nothing.

## Future work

Each item below is out of v1, but the design leaves a clean path to it.

- **Selecting a source file from another location.** Add an explicit "select telemetry file" action
  that records a pointer to the chosen source (a local path on desktop, an item id on web), which the
  resolver reads instead of sniffing the dataset folder. This follows an existing DIVE pattern:
  calibration is not inferable from folder contents, so the dataset stores a `calibrationItemId`
  pointer in `meta.json` (`MultiCamMetaStorage` in `server/dive_utils/models.py`); an out-of-folder
  telemetry source would store an analogous pointer in the same place. The contrast also explains why
  v1 needs no pointer: telemetry sitting next to the imagery is inferable, so it is sniffed rather than
  referenced. The adapter seam already separates "where the source is" from "how it is parsed," so this
  is a new way to locate a source rather than a new pipeline, and it still writes no derived copy (the
  pointer references the source, not a parsed duplicate).
- **Export to KWCOCO.** Serialize the loaded records into an `info.dive_frame_metadata` block keyed by
  file name and advertised in `info.dive_extensions`. The normalized frame-to-fields table is already
  built on load, so export is just serializing what is already in memory.
- **Embedded sources (KLV in video, EXIF in images).** Add one adapter per source behind the existing
  seam. The boundary is already frame-keyed and the read path already dispatches on source kind, so
  the panel, serve contract, and client cache are unchanged.
- **Video telemetry.** A single video has no per-frame filenames, so a video adapter joins on a
  frame-index or timestamp column (the filename column survives as a constant video name for
  discovery). Timestamp is the safer key, since DIVE transcodes video and can renumber frames. The
  preferred long-term path is embedded KLV (above), frame-synced by construction; a text sidecar
  joined by timestamp is a second-class option. Either is a new adapter behind the seam, so the panel,
  serve contract, and client are unchanged.
- **Over-time charts (for example depth or altitude against time).** Point a timeline chart at the
  same in-memory cache. Because the data is already a frame-keyed reactive store, a chart is just a
  second reader of it rather than a new data path.
- **Telemetry as a model training input.** Emit the values in the KWCOCO export above so an external
  trainer that reads COCO can join them to each image by file name. Values are kept as pass-through
  strings joinable by file name, so nothing is lost or coerced on the way out.
- **Server-side caching for very large datasets.** Hold the once-parsed source in a shared cache so
  windowed reads never re-parse a large file, which matters mainly on a multi-worker web deployment.
  The client and serve contract are already frame-windowed, so this is a server-only change (see Scale
  and access).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Per-frame Metadata New Feature #1725

Per-frame Metadata New Feature

Summary

Approach

Non-goals

Technical design

Source adapter seam

Read path (server)

Source discovery and discrimination

Serve contract

Client representation

Scale and access

Display

Cross-backend

Future work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Decision	Choice	Why
Storage model	Property of the media, not an annotation	It is read-only observed data describing the imagery; it gains nothing from the editable annotation/track system.
Ingest	None; the raw file rides along and is read on load	No import step means no second copy to maintain, so the file on disk stays the one source of truth.
Source (v1)	Delimited text files only (`.txt` / `.csv` next to the media)	The target data is filename-keyed text; embedded sources add a parser with no v1 payoff.
Join	Filename value-match	Each image is a named file and the telemetry supplies the filename column; joining by row order is fragile against re-sort, gaps, and partial uploads.
Persistence	Derive on load, write nothing durable	The source file is the single source of truth and a persisted copy can only drift. The client requests a frame window so memory stays bounded however long the dataset is (see Scale and access); making large server reads cheap is follow-up.
Serve	One read-time endpoint (`loadFrameMetadata`)	A single windowed read serves both backends; nothing is precomputed or stored to answer it.
Display	2 columns at the top of the Dataset Metadata side panel.	The values only make sense against the current frame, so they ride the playhead like the rest of the viewer.
Export	Out of v1 (see Future work)	Users already hold the source file and can join it downstream by filename, so export can wait until a consumer needs it.

Item	Why
Editing telemetry	Telemetry is observed, not authored.
Whole-frame classification and per-frame notes	These are editable, authored labels, a different kind of data from read-only observed telemetry. They are annotation-shaped (a whole-frame classification is a detection with a class and confidence; a note is a per-frame attribute) and, if built, belong in the annotation system rather than this media-property channel. Folding them in would mix editable authored data into a read-only, write-nothing design.

Uh oh!

Per-frame Metadata New Feature #1725

Description

Per-frame Metadata New Feature

Summary

Approach

Non-goals

Technical design

Source adapter seam

Read path (server)

Source discovery and discrimination

Serve contract

Client representation

Scale and access

Display

Cross-backend

Future work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions