Skip to content

Per-frame Metadata New Feature #1725

Description

@PaulHax

Per-frame Metadata New Feature

Follow-on to #1585. This proposes how to attach per-frame metadata to a DIVE dataset: show
capture-time telemetry (one record per image, for example timestamp, latitude/longitude, water depth,
and altitude) in a playhead-synced panel. Targeting web (Python/girder) and desktop
(Electron/TypeScript).

Feedback welcome.

Summary

Per-frame telemetry is observed, read-only data that describes the imagery, much like resolution or
frame rate. We propose treating it as a property of the media:

  • read it from its source file when a dataset opens,
  • show it in a panel synced to the playhead,
  • and persist nothing derived.

The source file the user already has next to the imagery is the single source of truth. Everything
else is an in-memory, read-time projection of it. This keeps the feature small: a read-only viewer
with no new storage, no import step, and no data that can drift out of sync.

Approach

Decision Choice Why
Storage model Property of the media, not an annotation It is read-only observed data describing the imagery; it gains nothing from the editable annotation/track system.
Ingest None; the raw file rides along and is read on load No import step means no second copy to maintain, so the file on disk stays the one source of truth.
Source (v1) Delimited text files only (.txt / .csv next to the media) The target data is filename-keyed text; embedded sources add a parser with no v1 payoff.
Join Filename value-match Each image is a named file and the telemetry supplies the filename column; joining by row order is fragile against re-sort, gaps, and partial uploads.
Persistence Derive on load, write nothing durable The source file is the single source of truth and a persisted copy can only drift. The client requests a frame window so memory stays bounded however long the dataset is (see Scale and access); making large server reads cheap is follow-up.
Serve One read-time endpoint (loadFrameMetadata) A single windowed read serves both backends; nothing is precomputed or stored to answer it.
Display 2 columns at the top of the Dataset Metadata side panel. The values only make sense against the current frame, so they ride the playhead like the rest of the viewer.
Export Out of v1 (see Future work) Users already hold the source file and can join it downstream by filename, so export can wait until a consumer needs it.

Non-goals

Item Why
Editing telemetry Telemetry is observed, not authored.
Whole-frame classification and per-frame notes These are editable, authored labels, a different kind of data from read-only observed telemetry. They are annotation-shaped (a whole-frame classification is a detection with a class and confidence; a note is a per-frame attribute) and, if built, belong in the annotation system rather than this media-property channel. Folding them in would mix editable authored data into a read-only, write-nothing design.

Deferred features that the design is built to accommodate (export, embedded sources, charting,
training integration, and out-of-folder source selection) are described under Future work.

Technical design

A prototype branch has validated the parse, serve, and panel path; the code below is the proposed
shape, with existing DIVE primitives referenced where we build on them.

Principle: infer per-frame metadata on load and never persist a duplicate. The .txt / .csv the
user dropped next to the imagery is the only stored form; everything else is an in-memory, read-time
projection.

Source adapter seam

All source-specific logic sits behind one boundary that produces a normalized, frame-keyed table:

adapter(dataset, camera) -> records   // records: frame -> { field: value }, raw strings
  • v1 ships one adapter: the text-file reader. A source is a delimited file with a header row, a
    filename column, and one row per image. The delimiter is sniffed (whitespace, comma, or tab);
    the NOAA samples are space-delimited .txt:

    port_image date time latitude longitude water_depth altitude ... starboard_image
    20191009.154056.00082_rect_color.tif 2019/10/09 15:40:56.1122 46.575870 -124.603094 192.80 2.78 ... 20191009.154056.00081_rect_color.tif
    

    The filename column is required: it joins each row to a frame by value (robust against re-sort,
    gaps, and partial uploads), lets discovery pick the file out of other text files, and never
    misaligns (a row matching no image is dropped, not shifted onto the wrong frame). A row may carry
    more than one filename column (stereo: port_image and starboard_image); each child folder
    matches its own column against its own media, so one shared row binds to both cameras (see
    Per-camera routing). Video, which has no per-frame filenames, is out of v1 and joins differently
    (see Future work).

  • KLV or EXIF later is a second or third adapter behind the same boundary; the panel and load flow do
    not change.

Read path (server)

The read path sniffs, dispatches on source, and routes each record to the current frame index:

def load_frame_metadata_records(folder, user):
    children = multicam_children(folder)                  # None for a single dataset
    cameras = {}
    for child in (children or [folder]):
        # Match DIVE's runtime camera key: 'singleCam' for one camera, the camera name for multicam.
        camera = child['name'] if children else 'singleCam'
        media = valid_images(child, user)                 # ordered, gives frame index
        media_keys = valid_image_names_dict(media)        # name -> frame
        frames = cameras.setdefault(camera, {})
        for fname, values in read_source(folder, child, media_keys, user):
            frame = media_keys.get(normalize_key(fname))
            if frame is not None:
                frames[frame] = values                    # collision guard omitted for brevity
    return {'cameras': cameras}


def read_source(folder, child, media_keys, user):
    # v1: file only. Candidates are co-located .txt/.csv; discrimination is in is_frame_metadata.
    for item in candidate_table_items(folder, child):     # parent root plus this child folder
        text = download(item)
        if is_frame_metadata(text, media_keys):           # rejects DIVE formats, requires a filename match
            header, rows = parse_table(text)
            jc = find_join_columns(header, rows, media_keys)[0]
            return ((row[jc], dict(zip(header, row))) for row in rows)
    return ()                                              # exif / klv adapters slot in here later

Each branch yields (filename-or-frame, fields), and the common loop routes by content to the
current frame index, so reordering media never stales the result.

Per-camera routing and collisions. Routing falls out of the per-child loop: find_join_columns
re-runs per child against that child's media_keys, so each camera self-selects its own filename
column. A stereo file at the dataset root with both port_image and starboard_image therefore
binds each row to both cameras (port matches its column, starboard matches its own), which is what
shared telemetry wants. The one guard: if a (camera, frame) gets two different value sets, it is
skipped rather than guessed (identical values are fine).

This reuses existing DIVE helpers (valid_images, valid_image_names_dict in
server/dive_server/crud.py) and the dataset serve endpoint pattern. The parser helpers
(is_frame_metadata, parse_table, find_join_columns, normalize_key) are new, in a
server/dive_utils/serializers/frame_metadata.py module.

Source discovery and discrimination

A dataset folder already holds other delimited files, so the reader must pick the telemetry file
without grabbing an annotation file by mistake. Candidates are the dataset folder's own .txt /
.csv entries plus the parent root for multicam (on web these are Girder items in the folder; on
desktop they are files in the directory). The dangerous collision is the VIAME annotation CSV,
whose second column is the image identifier, so its values match the media filenames exactly. A
sniffer that only looked for "a column that matches the filenames" would select the annotation file.
is_frame_metadata therefore applies a fail-safe ladder, cheapest first:

  1. Extension filter. Keep only .txt / .csv. This alone drops every JSON file (meta.json,
    multiCam.json, *.dive.json, COCO), calibration files (.npz, .cam, .yml, .zip), and
    .pipe pipelines. App-generated lists (*_images.txt, labels.txt, intermediate output CSVs)
    are written to a temporary working directory during a job and never land in the dataset folder, so
    they are not candidates.
  2. Reject DIVE's own formats by content. Run the existing VIAME detector and skip anything that
    parses as VIAME: its rows are comment-headed (# 1: Detection or Track-id), have at least nine
    columns, and begin with an integer track id followed by a float bounding box
    (load_csv_as_tracks_and_attributes in server/dive_utils/serializers/viame.py). A telemetry file
    has a plain header row of field names and no # comment header, so it passes. Reusing DIVE's
    parser as a negative filter is more robust than inventing a header sniff.
  3. Require a positive filename match. Among survivors, accept only a file with a column whose
    values match the media basenames (normalize_key strips extensions because valid_image_names_dict
    keys images without them). This is self-selecting: an unrelated readme.txt or notes.csv has no
    such column and is ignored without configuration.
  4. Require a payload, and skip ambiguity. The matched file must carry at least one column beyond
    the join column, which rejects a bare image-list file. If two distinct files still value-match, the
    reader does not guess; it skips rather than attach the wrong one.

The effect is that the telemetry file needs no special name: the user drops it next to the imagery,
the annotation CSV is removed at rung 2, and whatever positively matches the filenames is the source.

Serve contract

One loadFrameMetadata contract satisfied by both backends, taking a frame range (see Scale and
access) and returning the matching records keyed camera -> frame -> values. A frame's entry is just
the values read for it; there is no separate schema or status payload, and the keys may differ from
one frame to the next:

{
  "cameras": {
    "port": {
      "0": { "date": "2019/10/09", "latitude": "46.575870", "...": "..." },
      "1": { "date": "2019/10/09", "latitude": "46.575912", "...": "..." },
    },
    "starboard": { "0": { "...": "..." } },
  },
}

The camera map sits under a cameras namespace so later top-level keys (for example status) cannot
collide with a camera id. Single-camera datasets use the one key singleCam. Only frames with data
appear, so a windowed read returns just the present frames. The panel renders whatever keys a frame
carries, so the client stays a pass-through viewer.

Client representation

The inference result is plain reactive state, the per-session cache and nothing heavier:

  • a ref holding the served window, the cameras object refetched around the playhead as it moves
    (see Scale and access). It is already keyed camera -> frame -> values, so it is the cache as-is,
    with no re-indexing step;
  • a currentRows computed off the playhead frame and the selected camera
    (cameras[selectedCamera]?.[frame]), showing the active frame's values in the order they appear,
    whatever keys are present.

It is deliberately not an annotation store: we keep it out of the annotation and attribute stores,
which carry edit/save/revision semantics this read-only data does not need.

Why not the attribute system. Attributes look like the natural home, but they require a track
parent: belongs is only track or detection (client/src/use/AttributeTypes.ts), and values
live at track.features[frame].attributes[key]. Telemetry has no track, so this would mean a
fabricated whole-dataset track with an invented box, columns added as schema-polluting attribute
definitions, and an editable save lifecycle, all against a read-only, write-nothing goal.

Scale and access

A dataset can run to hundreds of thousands of frames with tens of fields each. The panel only needs
the current frame, so the client requests a window around the playhead and refetches as the user
scrubs, which bounds client memory however long the dataset is.

The window bounds client memory and transfer, not server work: a text source is not seekable by
frame, so v1 still parses the whole file per request to build the frame->row map. The remaining
concern is therefore server-side and is follow-up work: parse the source once when the dataset
opens and serve windows from the result, rather than re-parsing per request (which only bites a very
long source on a multi-worker web deployment). The fix is a shared cache of the parsed result, and it
changes neither the client nor the serve contract.

Display

At the top of the Dataset Metadata side panel (the current frame number and filename are already shown in the playback
controls via FileNameTimeDisplay, so the panel does not repeat them):

  • the active frame's values in the order they arrive, verbatim (pass-through strings, no type
    inference), whatever keys are present that frame;
  • empty states: platform unsupported, no metadata for the dataset (with a hint to drop a telemetry
    file next to the imagery), no metadata for the current frame.

Multicam display selects the active camera's records.

Cross-backend

Web (Python) and desktop (TypeScript) are mirrored implementations governed by shared fixtures (sample
.txt files plus expected parsed output). The desktop resolver re-reads at load and likewise writes
nothing.

Future work

Each item below is out of v1, but the design leaves a clean path to it.

  • Selecting a source file from another location. Add an explicit "select telemetry file" action
    that records a pointer to the chosen source (a local path on desktop, an item id on web), which the
    resolver reads instead of sniffing the dataset folder. This follows an existing DIVE pattern:
    calibration is not inferable from folder contents, so the dataset stores a calibrationItemId
    pointer in meta.json (MultiCamMetaStorage in server/dive_utils/models.py); an out-of-folder
    telemetry source would store an analogous pointer in the same place. The contrast also explains why
    v1 needs no pointer: telemetry sitting next to the imagery is inferable, so it is sniffed rather than
    referenced. The adapter seam already separates "where the source is" from "how it is parsed," so this
    is a new way to locate a source rather than a new pipeline, and it still writes no derived copy (the
    pointer references the source, not a parsed duplicate).
  • Export to KWCOCO. Serialize the loaded records into an info.dive_frame_metadata block keyed by
    file name and advertised in info.dive_extensions. The normalized frame-to-fields table is already
    built on load, so export is just serializing what is already in memory.
  • Embedded sources (KLV in video, EXIF in images). Add one adapter per source behind the existing
    seam. The boundary is already frame-keyed and the read path already dispatches on source kind, so
    the panel, serve contract, and client cache are unchanged.
  • Video telemetry. A single video has no per-frame filenames, so a video adapter joins on a
    frame-index or timestamp column (the filename column survives as a constant video name for
    discovery). Timestamp is the safer key, since DIVE transcodes video and can renumber frames. The
    preferred long-term path is embedded KLV (above), frame-synced by construction; a text sidecar
    joined by timestamp is a second-class option. Either is a new adapter behind the seam, so the panel,
    serve contract, and client are unchanged.
  • Over-time charts (for example depth or altitude against time). Point a timeline chart at the
    same in-memory cache. Because the data is already a frame-keyed reactive store, a chart is just a
    second reader of it rather than a new data path.
  • Telemetry as a model training input. Emit the values in the KWCOCO export above so an external
    trainer that reads COCO can join them to each image by file name. Values are kept as pass-through
    strings joinable by file name, so nothing is lost or coerced on the way out.
  • Server-side caching for very large datasets. Hold the once-parsed source in a shared cache so
    windowed reads never re-parse a large file, which matters mainly on a multi-worker web deployment.
    The client and serve contract are already frame-windowed, so this is a server-only change (see Scale
    and access).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions