Skip to content

Stream data table rows into context files and render each context once#20

Open
krlittle wants to merge 2 commits into
mainfrom
prethink-ExportContext-stream-and-memoize
Open

Stream data table rows into context files and render each context once#20
krlittle wants to merge 2 commits into
mainfrom
prethink-ExportContext-stream-and-memoize

Conversation

@krlittle
Copy link
Copy Markdown

@krlittle krlittle commented Jun 1, 2026

Summary

  • ExportContext now renders each context once per cycle and caches the rendered output on its accumulator, so generate() and getVisitor() reuse it instead of re-aggregating the same data tables for every visited context file and again in the forced second export cycle (reads per table drop from 2 × (F + 2) to 2, for F context CSVs)
  • It now streams rows straight from the store into the CSV writer one row at a time instead of collecting the whole table into a List first; column headers come from DataTable.getType(), so no rows are needed up front
  • Output is byte-identical — only the redundant re-reads and the per-render row buffering are removed; the cached value is the finished output string (which becomes the generated PlainText and must exist anyway), never the table's rows

Problem

ExportContext reads its referenced data tables back to write .moderne/context/*.csv. It re-read and re-rendered each table once in generate(), once for every context CSV visited in getVisitor(), and again in the forced second cycle — roughly 2 × (F + 2) full reads per table (~42× for a repo emitting 19 context CSVs) — and each read materialized the whole table into a List (via aggregateMatchingTables) and built the entire CSV as one String. For large repositories (one row per method/class) both CPU and the export phase's peak memory scale with table size, repeated dozens of times per repo.

Solution

  • Render each context a single time per cycle, cache the resulting CSV/markdown on the accumulator, and have generate()/getVisitor() read from that cache. Build each CSV by streaming getRows(...) directly into the writer one row at a time rather than buffering a List. Pairs with Stream CsvDataTableStore.getRows from disk lazily instead of buffering the whole table rewrite#7858, which makes CsvDataTableStore.getRows itself parse lazily; once released, the store side streams too and a whole table is never held in memory at any layer.

Test plan

  • Existing ExportContextTest passes unchanged, including the exact CSV-row + markdown content assertions in aggregatesRowsFromMultipleInstancesOfSameDataTable and the cycle-trigger regression — confirming byte-identical output
  • Adds aggregatesEachReferencedTableExactlyOncePerRun, which installs a counting DataTableStore and asserts each referenced table is read once per export cycle (2 total), not once per visited context file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant