Per-frame Metadata New Feature
Follow-on to #1585. This proposes how to attach per-frame metadata to a DIVE dataset: show
capture-time telemetry (one record per image, for example timestamp, latitude/longitude, water depth,
and altitude) in a playhead-synced panel. Targeting web (Python/girder) and desktop
(Electron/TypeScript).
Feedback welcome.
Summary
Per-frame telemetry is observed, read-only data that describes the imagery, much like resolution or
frame rate. We propose treating it as a property of the media:
- read it from its source file when a dataset opens,
- show it in a panel synced to the playhead,
- and persist nothing derived.
The source file the user already has next to the imagery is the single source of truth. Everything
else is an in-memory, read-time projection of it. This keeps the feature small: a read-only viewer
with no new storage, no import step, and no data that can drift out of sync.
Approach
| Decision |
Choice |
Why |
| Storage model |
Property of the media, not an annotation |
It is read-only observed data describing the imagery; it gains nothing from the editable annotation/track system. |
| Ingest |
None; the raw file rides along and is read on load |
No import step means no second copy to maintain, so the file on disk stays the one source of truth. |
| Source (v1) |
Delimited text files only (.txt / .csv next to the media) |
The target data is filename-keyed text; embedded sources add a parser with no v1 payoff. |
| Join |
Filename value-match |
Each image is a named file and the telemetry supplies the filename column; joining by row order is fragile against re-sort, gaps, and partial uploads. |
| Persistence |
Derive on load, write nothing durable |
The source file is the single source of truth and a persisted copy can only drift. The client requests a frame window so memory stays bounded however long the dataset is (see Scale and access); making large server reads cheap is follow-up. |
| Serve |
One read-time endpoint (loadFrameMetadata) |
A single windowed read serves both backends; nothing is precomputed or stored to answer it. |
| Display |
2 columns at the top of the Dataset Metadata side panel. |
The values only make sense against the current frame, so they ride the playhead like the rest of the viewer. |
| Export |
Out of v1 (see Future work) |
Users already hold the source file and can join it downstream by filename, so export can wait until a consumer needs it. |
Non-goals
| Item |
Why |
| Editing telemetry |
Telemetry is observed, not authored. |
| Whole-frame classification and per-frame notes |
These are editable, authored labels, a different kind of data from read-only observed telemetry. They are annotation-shaped (a whole-frame classification is a detection with a class and confidence; a note is a per-frame attribute) and, if built, belong in the annotation system rather than this media-property channel. Folding them in would mix editable authored data into a read-only, write-nothing design. |
Deferred features that the design is built to accommodate (export, embedded sources, charting,
training integration, and out-of-folder source selection) are described under Future work.
Technical design
A prototype branch has validated the parse, serve, and panel path; the code below is the proposed
shape, with existing DIVE primitives referenced where we build on them.
Principle: infer per-frame metadata on load and never persist a duplicate. The .txt / .csv the
user dropped next to the imagery is the only stored form; everything else is an in-memory, read-time
projection.
Source adapter seam
All source-specific logic sits behind one boundary that produces a normalized, frame-keyed table:
adapter(dataset, camera) -> records // records: frame -> { field: value }, raw strings
-
v1 ships one adapter: the text-file reader. A source is a delimited file with a header row, a
filename column, and one row per image. The delimiter is sniffed (whitespace, comma, or tab);
the NOAA samples are space-delimited .txt:
port_image date time latitude longitude water_depth altitude ... starboard_image
20191009.154056.00082_rect_color.tif 2019/10/09 15:40:56.1122 46.575870 -124.603094 192.80 2.78 ... 20191009.154056.00081_rect_color.tif
The filename column is required: it joins each row to a frame by value (robust against re-sort,
gaps, and partial uploads), lets discovery pick the file out of other text files, and never
misaligns (a row matching no image is dropped, not shifted onto the wrong frame). A row may carry
more than one filename column (stereo: port_image and starboard_image); each child folder
matches its own column against its own media, so one shared row binds to both cameras (see
Per-camera routing). Video, which has no per-frame filenames, is out of v1 and joins differently
(see Future work).
-
KLV or EXIF later is a second or third adapter behind the same boundary; the panel and load flow do
not change.
Read path (server)
The read path sniffs, dispatches on source, and routes each record to the current frame index:
def load_frame_metadata_records(folder, user):
children = multicam_children(folder) # None for a single dataset
cameras = {}
for child in (children or [folder]):
# Match DIVE's runtime camera key: 'singleCam' for one camera, the camera name for multicam.
camera = child['name'] if children else 'singleCam'
media = valid_images(child, user) # ordered, gives frame index
media_keys = valid_image_names_dict(media) # name -> frame
frames = cameras.setdefault(camera, {})
for fname, values in read_source(folder, child, media_keys, user):
frame = media_keys.get(normalize_key(fname))
if frame is not None:
frames[frame] = values # collision guard omitted for brevity
return {'cameras': cameras}
def read_source(folder, child, media_keys, user):
# v1: file only. Candidates are co-located .txt/.csv; discrimination is in is_frame_metadata.
for item in candidate_table_items(folder, child): # parent root plus this child folder
text = download(item)
if is_frame_metadata(text, media_keys): # rejects DIVE formats, requires a filename match
header, rows = parse_table(text)
jc = find_join_columns(header, rows, media_keys)[0]
return ((row[jc], dict(zip(header, row))) for row in rows)
return () # exif / klv adapters slot in here later
Each branch yields (filename-or-frame, fields), and the common loop routes by content to the
current frame index, so reordering media never stales the result.
Per-camera routing and collisions. Routing falls out of the per-child loop: find_join_columns
re-runs per child against that child's media_keys, so each camera self-selects its own filename
column. A stereo file at the dataset root with both port_image and starboard_image therefore
binds each row to both cameras (port matches its column, starboard matches its own), which is what
shared telemetry wants. The one guard: if a (camera, frame) gets two different value sets, it is
skipped rather than guessed (identical values are fine).
This reuses existing DIVE helpers (valid_images, valid_image_names_dict in
server/dive_server/crud.py) and the dataset serve endpoint pattern. The parser helpers
(is_frame_metadata, parse_table, find_join_columns, normalize_key) are new, in a
server/dive_utils/serializers/frame_metadata.py module.
Source discovery and discrimination
A dataset folder already holds other delimited files, so the reader must pick the telemetry file
without grabbing an annotation file by mistake. Candidates are the dataset folder's own .txt /
.csv entries plus the parent root for multicam (on web these are Girder items in the folder; on
desktop they are files in the directory). The dangerous collision is the VIAME annotation CSV,
whose second column is the image identifier, so its values match the media filenames exactly. A
sniffer that only looked for "a column that matches the filenames" would select the annotation file.
is_frame_metadata therefore applies a fail-safe ladder, cheapest first:
- Extension filter. Keep only
.txt / .csv. This alone drops every JSON file (meta.json,
multiCam.json, *.dive.json, COCO), calibration files (.npz, .cam, .yml, .zip), and
.pipe pipelines. App-generated lists (*_images.txt, labels.txt, intermediate output CSVs)
are written to a temporary working directory during a job and never land in the dataset folder, so
they are not candidates.
- Reject DIVE's own formats by content. Run the existing VIAME detector and skip anything that
parses as VIAME: its rows are comment-headed (# 1: Detection or Track-id), have at least nine
columns, and begin with an integer track id followed by a float bounding box
(load_csv_as_tracks_and_attributes in server/dive_utils/serializers/viame.py). A telemetry file
has a plain header row of field names and no # comment header, so it passes. Reusing DIVE's
parser as a negative filter is more robust than inventing a header sniff.
- Require a positive filename match. Among survivors, accept only a file with a column whose
values match the media basenames (normalize_key strips extensions because valid_image_names_dict
keys images without them). This is self-selecting: an unrelated readme.txt or notes.csv has no
such column and is ignored without configuration.
- Require a payload, and skip ambiguity. The matched file must carry at least one column beyond
the join column, which rejects a bare image-list file. If two distinct files still value-match, the
reader does not guess; it skips rather than attach the wrong one.
The effect is that the telemetry file needs no special name: the user drops it next to the imagery,
the annotation CSV is removed at rung 2, and whatever positively matches the filenames is the source.
Serve contract
One loadFrameMetadata contract satisfied by both backends, taking a frame range (see Scale and
access) and returning the matching records keyed camera -> frame -> values. A frame's entry is just
the values read for it; there is no separate schema or status payload, and the keys may differ from
one frame to the next:
The camera map sits under a cameras namespace so later top-level keys (for example status) cannot
collide with a camera id. Single-camera datasets use the one key singleCam. Only frames with data
appear, so a windowed read returns just the present frames. The panel renders whatever keys a frame
carries, so the client stays a pass-through viewer.
Client representation
The inference result is plain reactive state, the per-session cache and nothing heavier:
- a
ref holding the served window, the cameras object refetched around the playhead as it moves
(see Scale and access). It is already keyed camera -> frame -> values, so it is the cache as-is,
with no re-indexing step;
- a
currentRows computed off the playhead frame and the selected camera
(cameras[selectedCamera]?.[frame]), showing the active frame's values in the order they appear,
whatever keys are present.
It is deliberately not an annotation store: we keep it out of the annotation and attribute stores,
which carry edit/save/revision semantics this read-only data does not need.
Why not the attribute system. Attributes look like the natural home, but they require a track
parent: belongs is only track or detection (client/src/use/AttributeTypes.ts), and values
live at track.features[frame].attributes[key]. Telemetry has no track, so this would mean a
fabricated whole-dataset track with an invented box, columns added as schema-polluting attribute
definitions, and an editable save lifecycle, all against a read-only, write-nothing goal.
Scale and access
A dataset can run to hundreds of thousands of frames with tens of fields each. The panel only needs
the current frame, so the client requests a window around the playhead and refetches as the user
scrubs, which bounds client memory however long the dataset is.
The window bounds client memory and transfer, not server work: a text source is not seekable by
frame, so v1 still parses the whole file per request to build the frame->row map. The remaining
concern is therefore server-side and is follow-up work: parse the source once when the dataset
opens and serve windows from the result, rather than re-parsing per request (which only bites a very
long source on a multi-worker web deployment). The fix is a shared cache of the parsed result, and it
changes neither the client nor the serve contract.
Display
At the top of the Dataset Metadata side panel (the current frame number and filename are already shown in the playback
controls via FileNameTimeDisplay, so the panel does not repeat them):
- the active frame's values in the order they arrive, verbatim (pass-through strings, no type
inference), whatever keys are present that frame;
- empty states: platform unsupported, no metadata for the dataset (with a hint to drop a telemetry
file next to the imagery), no metadata for the current frame.
Multicam display selects the active camera's records.
Cross-backend
Web (Python) and desktop (TypeScript) are mirrored implementations governed by shared fixtures (sample
.txt files plus expected parsed output). The desktop resolver re-reads at load and likewise writes
nothing.
Future work
Each item below is out of v1, but the design leaves a clean path to it.
- Selecting a source file from another location. Add an explicit "select telemetry file" action
that records a pointer to the chosen source (a local path on desktop, an item id on web), which the
resolver reads instead of sniffing the dataset folder. This follows an existing DIVE pattern:
calibration is not inferable from folder contents, so the dataset stores a calibrationItemId
pointer in meta.json (MultiCamMetaStorage in server/dive_utils/models.py); an out-of-folder
telemetry source would store an analogous pointer in the same place. The contrast also explains why
v1 needs no pointer: telemetry sitting next to the imagery is inferable, so it is sniffed rather than
referenced. The adapter seam already separates "where the source is" from "how it is parsed," so this
is a new way to locate a source rather than a new pipeline, and it still writes no derived copy (the
pointer references the source, not a parsed duplicate).
- Export to KWCOCO. Serialize the loaded records into an
info.dive_frame_metadata block keyed by
file name and advertised in info.dive_extensions. The normalized frame-to-fields table is already
built on load, so export is just serializing what is already in memory.
- Embedded sources (KLV in video, EXIF in images). Add one adapter per source behind the existing
seam. The boundary is already frame-keyed and the read path already dispatches on source kind, so
the panel, serve contract, and client cache are unchanged.
- Video telemetry. A single video has no per-frame filenames, so a video adapter joins on a
frame-index or timestamp column (the filename column survives as a constant video name for
discovery). Timestamp is the safer key, since DIVE transcodes video and can renumber frames. The
preferred long-term path is embedded KLV (above), frame-synced by construction; a text sidecar
joined by timestamp is a second-class option. Either is a new adapter behind the seam, so the panel,
serve contract, and client are unchanged.
- Over-time charts (for example depth or altitude against time). Point a timeline chart at the
same in-memory cache. Because the data is already a frame-keyed reactive store, a chart is just a
second reader of it rather than a new data path.
- Telemetry as a model training input. Emit the values in the KWCOCO export above so an external
trainer that reads COCO can join them to each image by file name. Values are kept as pass-through
strings joinable by file name, so nothing is lost or coerced on the way out.
- Server-side caching for very large datasets. Hold the once-parsed source in a shared cache so
windowed reads never re-parse a large file, which matters mainly on a multi-worker web deployment.
The client and serve contract are already frame-windowed, so this is a server-only change (see Scale
and access).
Per-frame Metadata New Feature
Follow-on to #1585. This proposes how to attach per-frame metadata to a DIVE dataset: show
capture-time telemetry (one record per image, for example timestamp, latitude/longitude, water depth,
and altitude) in a playhead-synced panel. Targeting web (Python/girder) and desktop
(Electron/TypeScript).
Feedback welcome.
Summary
Per-frame telemetry is observed, read-only data that describes the imagery, much like resolution or
frame rate. We propose treating it as a property of the media:
The source file the user already has next to the imagery is the single source of truth. Everything
else is an in-memory, read-time projection of it. This keeps the feature small: a read-only viewer
with no new storage, no import step, and no data that can drift out of sync.
Approach
.txt/.csvnext to the media)loadFrameMetadata)Non-goals
Deferred features that the design is built to accommodate (export, embedded sources, charting,
training integration, and out-of-folder source selection) are described under Future work.
Technical design
A prototype branch has validated the parse, serve, and panel path; the code below is the proposed
shape, with existing DIVE primitives referenced where we build on them.
Principle: infer per-frame metadata on load and never persist a duplicate. The
.txt/.csvtheuser dropped next to the imagery is the only stored form; everything else is an in-memory, read-time
projection.
Source adapter seam
All source-specific logic sits behind one boundary that produces a normalized, frame-keyed table:
v1 ships one adapter: the text-file reader. A source is a delimited file with a header row, a
filename column, and one row per image. The delimiter is sniffed (whitespace, comma, or tab);
the NOAA samples are space-delimited
.txt:The filename column is required: it joins each row to a frame by value (robust against re-sort,
gaps, and partial uploads), lets discovery pick the file out of other text files, and never
misaligns (a row matching no image is dropped, not shifted onto the wrong frame). A row may carry
more than one filename column (stereo:
port_imageandstarboard_image); each child foldermatches its own column against its own media, so one shared row binds to both cameras (see
Per-camera routing). Video, which has no per-frame filenames, is out of v1 and joins differently
(see Future work).
KLV or EXIF later is a second or third adapter behind the same boundary; the panel and load flow do
not change.
Read path (server)
The read path sniffs, dispatches on source, and routes each record to the current frame index:
Each branch yields
(filename-or-frame, fields), and the common loop routes by content to thecurrent frame index, so reordering media never stales the result.
Per-camera routing and collisions. Routing falls out of the per-child loop:
find_join_columnsre-runs per child against that child's
media_keys, so each camera self-selects its own filenamecolumn. A stereo file at the dataset root with both
port_imageandstarboard_imagethereforebinds each row to both cameras (port matches its column, starboard matches its own), which is what
shared telemetry wants. The one guard: if a
(camera, frame)gets two different value sets, it isskipped rather than guessed (identical values are fine).
This reuses existing DIVE helpers (
valid_images,valid_image_names_dictinserver/dive_server/crud.py) and the dataset serve endpoint pattern. The parser helpers(
is_frame_metadata,parse_table,find_join_columns,normalize_key) are new, in aserver/dive_utils/serializers/frame_metadata.pymodule.Source discovery and discrimination
A dataset folder already holds other delimited files, so the reader must pick the telemetry file
without grabbing an annotation file by mistake. Candidates are the dataset folder's own
.txt/.csventries plus the parent root for multicam (on web these are Girder items in the folder; ondesktop they are files in the directory). The dangerous collision is the VIAME annotation CSV,
whose second column is the image identifier, so its values match the media filenames exactly. A
sniffer that only looked for "a column that matches the filenames" would select the annotation file.
is_frame_metadatatherefore applies a fail-safe ladder, cheapest first:.txt/.csv. This alone drops every JSON file (meta.json,multiCam.json,*.dive.json, COCO), calibration files (.npz,.cam,.yml,.zip), and.pipepipelines. App-generated lists (*_images.txt,labels.txt, intermediate output CSVs)are written to a temporary working directory during a job and never land in the dataset folder, so
they are not candidates.
parses as VIAME: its rows are comment-headed (
# 1: Detection or Track-id), have at least ninecolumns, and begin with an integer track id followed by a float bounding box
(
load_csv_as_tracks_and_attributesinserver/dive_utils/serializers/viame.py). A telemetry filehas a plain header row of field names and no
#comment header, so it passes. Reusing DIVE'sparser as a negative filter is more robust than inventing a header sniff.
values match the media basenames (
normalize_keystrips extensions becausevalid_image_names_dictkeys images without them). This is self-selecting: an unrelated
readme.txtornotes.csvhas nosuch column and is ignored without configuration.
the join column, which rejects a bare image-list file. If two distinct files still value-match, the
reader does not guess; it skips rather than attach the wrong one.
The effect is that the telemetry file needs no special name: the user drops it next to the imagery,
the annotation CSV is removed at rung 2, and whatever positively matches the filenames is the source.
Serve contract
One
loadFrameMetadatacontract satisfied by both backends, taking a frame range (see Scale andaccess) and returning the matching records keyed
camera -> frame -> values. A frame's entry is justthe values read for it; there is no separate schema or status payload, and the keys may differ from
one frame to the next:
{ "cameras": { "port": { "0": { "date": "2019/10/09", "latitude": "46.575870", "...": "..." }, "1": { "date": "2019/10/09", "latitude": "46.575912", "...": "..." }, }, "starboard": { "0": { "...": "..." } }, }, }The camera map sits under a
camerasnamespace so later top-level keys (for examplestatus) cannotcollide with a camera id. Single-camera datasets use the one key
singleCam. Only frames with dataappear, so a windowed read returns just the present frames. The panel renders whatever keys a frame
carries, so the client stays a pass-through viewer.
Client representation
The inference result is plain reactive state, the per-session cache and nothing heavier:
refholding the served window, thecamerasobject refetched around the playhead as it moves(see Scale and access). It is already keyed
camera -> frame -> values, so it is the cache as-is,with no re-indexing step;
currentRowscomputed off the playhead frame and the selected camera(
cameras[selectedCamera]?.[frame]), showing the active frame's values in the order they appear,whatever keys are present.
It is deliberately not an annotation store: we keep it out of the annotation and attribute stores,
which carry edit/save/revision semantics this read-only data does not need.
Why not the attribute system. Attributes look like the natural home, but they require a track
parent:
belongsis onlytrackordetection(client/src/use/AttributeTypes.ts), and valueslive at
track.features[frame].attributes[key]. Telemetry has no track, so this would mean afabricated whole-dataset track with an invented box, columns added as schema-polluting attribute
definitions, and an editable save lifecycle, all against a read-only, write-nothing goal.
Scale and access
A dataset can run to hundreds of thousands of frames with tens of fields each. The panel only needs
the current frame, so the client requests a window around the playhead and refetches as the user
scrubs, which bounds client memory however long the dataset is.
The window bounds client memory and transfer, not server work: a text source is not seekable by
frame, so v1 still parses the whole file per request to build the frame->row map. The remaining
concern is therefore server-side and is follow-up work: parse the source once when the dataset
opens and serve windows from the result, rather than re-parsing per request (which only bites a very
long source on a multi-worker web deployment). The fix is a shared cache of the parsed result, and it
changes neither the client nor the serve contract.
Display
At the top of the Dataset Metadata side panel (the current frame number and filename are already shown in the playback
controls via
FileNameTimeDisplay, so the panel does not repeat them):inference), whatever keys are present that frame;
file next to the imagery), no metadata for the current frame.
Multicam display selects the active camera's records.
Cross-backend
Web (Python) and desktop (TypeScript) are mirrored implementations governed by shared fixtures (sample
.txtfiles plus expected parsed output). The desktop resolver re-reads at load and likewise writesnothing.
Future work
Each item below is out of v1, but the design leaves a clean path to it.
that records a pointer to the chosen source (a local path on desktop, an item id on web), which the
resolver reads instead of sniffing the dataset folder. This follows an existing DIVE pattern:
calibration is not inferable from folder contents, so the dataset stores a
calibrationItemIdpointer in
meta.json(MultiCamMetaStorageinserver/dive_utils/models.py); an out-of-foldertelemetry source would store an analogous pointer in the same place. The contrast also explains why
v1 needs no pointer: telemetry sitting next to the imagery is inferable, so it is sniffed rather than
referenced. The adapter seam already separates "where the source is" from "how it is parsed," so this
is a new way to locate a source rather than a new pipeline, and it still writes no derived copy (the
pointer references the source, not a parsed duplicate).
info.dive_frame_metadatablock keyed byfile name and advertised in
info.dive_extensions. The normalized frame-to-fields table is alreadybuilt on load, so export is just serializing what is already in memory.
seam. The boundary is already frame-keyed and the read path already dispatches on source kind, so
the panel, serve contract, and client cache are unchanged.
frame-index or timestamp column (the filename column survives as a constant video name for
discovery). Timestamp is the safer key, since DIVE transcodes video and can renumber frames. The
preferred long-term path is embedded KLV (above), frame-synced by construction; a text sidecar
joined by timestamp is a second-class option. Either is a new adapter behind the seam, so the panel,
serve contract, and client are unchanged.
same in-memory cache. Because the data is already a frame-keyed reactive store, a chart is just a
second reader of it rather than a new data path.
trainer that reads COCO can join them to each image by file name. Values are kept as pass-through
strings joinable by file name, so nothing is lost or coerced on the way out.
windowed reads never re-parse a large file, which matters mainly on a multi-worker web deployment.
The client and serve contract are already frame-windowed, so this is a server-only change (see Scale
and access).