Stream CsvDataTableStore.getRows from disk lazily instead of buffering the whole table by krlittle · Pull Request #7858 · openrewrite/rewrite

krlittle · 2026-06-01T19:28:09Z

What's changed?

CsvDataTableStore.getRows(...) read every matching CSV file fully into a List<Object> and returned that list's stream, so reading back a large data table held all of its rows in memory at once. It now selects the matching file(s) by their @name/@group comment header and parses each file lazily, one row at a time via the Univocity parser, behind a closeable Stream. Rows are produced on demand, so a whole table is never materialized in memory.

The change is intended to be behavior-preserving:

same row order, prefix/suffix column stripping, group / @name matching, and typed-row (vs raw String[]) results;
the writer is still closed eagerly at the getRows call, so a mid-run read still finalizes the file — only row parsing is deferred;
error handling is unchanged: open/parse failures (which are unchecked) propagate, and only a stream-close IOException is swallowed.

What's your motivation?

Recipes that read their own data tables back — e.g. to export or aggregate them — can produce very large tables (one row per method/class across a large repository). Buffering the entire table into a List before the consumer sees a single row makes peak memory scale with table size; streaming bounds the store side to one row at a time.

Checklist

I've added unit tests to cover both positive and negative cases
I've read and applied the recipe conventions and best practices
I've used the IntelliJ IDEA auto-formatter on affected files

…g the whole table

sambsnyd · 2026-06-01T19:52:47Z

+     * Rows are produced one at a time, so a whole table's rows are never held in
+     * memory at once.
+     */
+    private Stream<Object> streamRows(Path path, @Nullable RowMetadata meta, int prefixCount, int suffixCount) {


Since CsvDataTableStore is used directly in the CLI you'll have to do a release which includes this before this will work at runtime. Recipe modules built with this running on un-updated CLI may encounter NoSuchMethodException here... unless you've verified that isn't true, you might want to guard invocation of this method with reflection until CLI with this method have been released for a little while

natedanner

Lazy streaming looks right and the happy path is fine (both RecipeRun callers collect(), fully draining). Two things worth addressing:

The returned stream now holds an open file handle until it's drained/closed. flatMap closes an inner stream only once drained, and the outer stream's close() doesn't reach a still-open inner stream — so a short-circuiting caller (findFirst/limit) leaks a handle even with try-with-resources. InMemoryDataTableStore returns a detached stream, so callers can't assume either way. Worth documenting on DataTableStore.getRows that the stream must be fully consumed/closed.
No tests for the new semantics; existing ones all fully drain, so a short-circuit leak wouldn't be caught.

Stream CsvDataTableStore.getRows from disk lazily instead of bufferin…

4952e61

…g the whole table

github-project-automation Bot added this to OpenRewrite Jun 1, 2026

github-project-automation Bot moved this to In Progress in OpenRewrite Jun 1, 2026

moderne-meeseeks Bot assigned krlittle Jun 1, 2026

krlittle mentioned this pull request Jun 1, 2026

Stream data table rows into context files and render each context once openrewrite/rewrite-prethink#20

Open

2 tasks

sambsnyd approved these changes Jun 1, 2026

View reviewed changes

github-project-automation Bot moved this from In Progress to Ready to Review in OpenRewrite Jun 1, 2026

natedanner reviewed Jun 1, 2026

View reviewed changes

krlittle mentioned this pull request Jun 1, 2026

Render each context once per cycle instead of re-reading per visited file openrewrite/rewrite-prethink#21

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream CsvDataTableStore.getRows from disk lazily instead of buffering the whole table#7858

Stream CsvDataTableStore.getRows from disk lazily instead of buffering the whole table#7858
krlittle wants to merge 1 commit into
mainfrom
lazy-csv-datatable-getrows

krlittle commented Jun 1, 2026

Uh oh!

sambsnyd Jun 1, 2026

Uh oh!

natedanner left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krlittle commented Jun 1, 2026

What's changed?

What's your motivation?

Checklist

Uh oh!

sambsnyd Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

natedanner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

natedanner left a comment •

edited

Loading