Skip to content

Stream CsvDataTableStore.getRows from disk lazily instead of buffering the whole table#7858

Open
krlittle wants to merge 1 commit into
mainfrom
lazy-csv-datatable-getrows
Open

Stream CsvDataTableStore.getRows from disk lazily instead of buffering the whole table#7858
krlittle wants to merge 1 commit into
mainfrom
lazy-csv-datatable-getrows

Conversation

@krlittle
Copy link
Copy Markdown
Contributor

@krlittle krlittle commented Jun 1, 2026

What's changed?

CsvDataTableStore.getRows(...) read every matching CSV file fully into a List<Object> and returned that list's stream, so reading back a large data table held all of its rows in memory at once. It now selects the matching file(s) by their @name/@group comment header and parses each file lazily, one row at a time via the Univocity parser, behind a closeable Stream. Rows are produced on demand, so a whole table is never materialized in memory.

The change is intended to be behavior-preserving:

  • same row order, prefix/suffix column stripping, group / @name matching, and typed-row (vs raw String[]) results;
  • the writer is still closed eagerly at the getRows call, so a mid-run read still finalizes the file — only row parsing is deferred;
  • error handling is unchanged: open/parse failures (which are unchecked) propagate, and only a stream-close IOException is swallowed.

What's your motivation?

Recipes that read their own data tables back — e.g. to export or aggregate them — can produce very large tables (one row per method/class across a large repository). Buffering the entire table into a List before the consumer sees a single row makes peak memory scale with table size; streaming bounds the store side to one row at a time.

Checklist

  • I've added unit tests to cover both positive and negative cases
  • I've read and applied the recipe conventions and best practices
  • I've used the IntelliJ IDEA auto-formatter on affected files

* Rows are produced one at a time, so a whole table's rows are never held in
* memory at once.
*/
private Stream<Object> streamRows(Path path, @Nullable RowMetadata meta, int prefixCount, int suffixCount) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CsvDataTableStore is used directly in the CLI you'll have to do a release which includes this before this will work at runtime. Recipe modules built with this running on un-updated CLI may encounter NoSuchMethodException here... unless you've verified that isn't true, you might want to guard invocation of this method with reflection until CLI with this method have been released for a little while

@github-project-automation github-project-automation Bot moved this from In Progress to Ready to Review in OpenRewrite Jun 1, 2026
Copy link
Copy Markdown
Contributor

@natedanner natedanner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy streaming looks right and the happy path is fine (both RecipeRun callers collect(), fully draining). Two things worth addressing:

  1. The returned stream now holds an open file handle until it's drained/closed. flatMap closes an inner stream only once drained, and the outer stream's close() doesn't reach a still-open inner stream — so a short-circuiting caller (findFirst/limit) leaks a handle even with try-with-resources. InMemoryDataTableStore returns a detached stream, so callers can't assume either way. Worth documenting on DataTableStore.getRows that the stream must be fully consumed/closed.
  2. No tests for the new semantics; existing ones all fully drain, so a short-circuit leak wouldn't be caught.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready to Review

Development

Successfully merging this pull request may close these issues.

3 participants