refactor: abstract CustomizableDataflow into pandas/polars subclasses by paddymul · Pull Request #548 · buckaroo-data/buckaroo

paddymul · 2026-02-23T12:18:34Z

Summary

Makes CustomizableDataflow an abstract base class with 4 abstract methods: _compute_processed_result, _build_error_dataframe, _get_summary_sd, _df_to_obj
Creates PandasCustomizableDataflow (new file) and PolarsCustomizableDataflow (new file) as concrete implementations
Widgets select their backend via a dataflow_klass class attribute instead of inheriting from the pandas widget
Removes pandas imports from dataflow.py and dataflow_extras.py base classes

Motivation

This decouples the dataflow pipeline from pandas, enabling pandas to become an optional dependency. Previously PolarsBuckarooWidget inherited from the pandas widget and overrode methods — now each backend has its own clean dataflow subclass.

Files changed

File	Change
`buckaroo/dataflow/dataflow.py`	Make `CustomizableDataflow` abstract, remove pandas imports
`buckaroo/dataflow/dataflow_extras.py`	Remove pandas imports, duck-type `sort_index` check
`buckaroo/dataflow/pandas_dataflow.py`	New: `PandasCustomizableDataflow`
`buckaroo/dataflow/polars_dataflow.py`	New: `PolarsCustomizableDataflow`
`buckaroo/buckaroo_widget.py`	Add `dataflow_klass`, use it in `InnerDataFlow`
`buckaroo/polars_buckaroo.py`	Set `dataflow_klass = PolarsCustomizableDataflow`
`buckaroo/server/data_loading.py`	Use `PandasCustomizableDataflow`
`tests/unit/dataflow/*.py`	Update imports to use `PandasCustomizableDataflow`

Test plan

All 586 unit tests pass locally (5 skipped)
CI passes

🤖 Generated with Claude Code

github-actions · 2026-02-23T12:20:35Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22353825774

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22353825774

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.9.dev22353825774" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

paddymul · 2026-02-23T14:30:03Z

Learnings & Self-Review

Was it an improvement?

Structurally, yes. The core win is that dataflow.py and dataflow_extras.py no longer import pandas. This is the prerequisite for making pandas an optional dependency. Before this PR, PolarsBuckarooWidget inherited from BuckarooWidget (a pandas widget) and monkey-patched methods like _df_to_obj and _build_error_dataframe. Now each backend has a clean subclass.

The dataflow_klass attribute pattern is cleaner than the old InnerDataFlow approach that delegated _df_to_obj back to the widget via closure — that mixed widget concerns into the dataflow layer.

However, there's significant code duplication. _compute_processed_result and _get_summary_sd are identical between PandasCustomizableDataflow and PolarsCustomizableDataflow. Only _build_error_dataframe (pd.DataFrame vs pl.DataFrame) and _df_to_obj (direct vs .to_pandas()) actually differ. A follow-up could keep only those two as abstract and hoist the shared logic back to the base class.

Performance implications around serialization

No change for pandas users. The _df_to_obj path is identical — pd_to_obj(self.sampling_klass.serialize_sample(df)).

The polars path is unchanged too — PolarsCustomizableDataflow._df_to_obj does the same df.to_pandas() conversion that PolarsBuckarooWidget._df_to_obj did before. The infinite scroll path (_handle_payload_args) writes polars directly to parquet and was not touched.

One subtle behavioral change in dataflow_extras.py: The base Sampling.pre_stats_sample used to call sampled.sort_index() on pandas DataFrames after sampling. That was removed from the base class (no pandas import), but PdSampling in buckaroo_widget.py already overrides pre_stats_sample with its own sort_index() call, so pandas widget path is unaffected. serialize_sample was changed from isinstance(sampled, pd.DataFrame) to hasattr(sampled, 'sort_index') — duck typing that works without importing pandas.

What this PR doesn't do yet

Doesn't make pandas optional at import time — serialization_utils.py and many other files still import pandas at module level
PolarsCustomizableDataflow still imports pandas (for the isinstance check in _df_to_obj)
autocleaning.py still defines SENTINEL_DF_1/SENTINEL_DF_2 with pandas — used in production code
_compute_processed_result and _get_summary_sd are copy-pasted between both subclasses — follow-up should hoist shared logic

Make CustomizableDataflow an abstract base class with concrete PandasCustomizableDataflow and PolarsCustomizableDataflow subclasses. Widgets select their backend via a dataflow_klass attribute. This decouples pandas from the core dataflow module, enabling pandas to become an optional dependency in the future. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

paddymul · 2026-02-24T13:54:42Z

@codex

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02ecf492b0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-24T13:58:14Z

buckaroo/buckaroo_widget.py

        self.exception = None
        kls = self.__class__
-        class InnerDataFlow(CustomizableDataflow):
+        class InnerDataFlow(kls.dataflow_klass):


Route _df_to_obj through widget overrides

This refactor removed the InnerDataFlow._df_to_obj bridge, so _df_to_obj implementations on widget subclasses are no longer used during serialization. That breaks built-in override paths like GeopandasBase._df_to_obj (buckaroo/geopandas_buckaroo.py), which exists to coerce geopandas frames before pd_to_obj; after this change, serialization falls back to PandasCustomizableDataflow._df_to_obj and bypasses that conversion, causing geopandas widget payload generation to fail or produce incorrect data.

Useful? React with 👍 / 👎.

paddymul temporarily deployed to testpypi February 23, 2026 12:19 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi February 23, 2026 16:07 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi February 23, 2026 16:32 — with GitHub Actions Inactive

paddymul force-pushed the refactor/abstract-dataflow branch from cd36705 to 1360e5e Compare February 24, 2026 03:51

paddymul temporarily deployed to testpypi February 24, 2026 03:52 — with GitHub Actions Inactive

paddymul force-pushed the refactor/abstract-dataflow branch from 1360e5e to 2db1042 Compare February 24, 2026 04:01

paddymul temporarily deployed to testpypi February 24, 2026 04:02 — with GitHub Actions Inactive

simple commit so openAI does a review

02ecf49

paddymul temporarily deployed to testpypi February 24, 2026 13:54 — with GitHub Actions Inactive

chatgpt-codex-connector bot reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

refactor: abstract CustomizableDataflow into pandas/polars subclasses#548

refactor: abstract CustomizableDataflow into pandas/polars subclasses#548
paddymul wants to merge 2 commits intomainfrom
refactor/abstract-dataflow

paddymul commented Feb 23, 2026

Uh oh!

github-actions bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

paddymul commented Feb 23, 2026

Uh oh!

paddymul commented Feb 24, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

paddymul commented Feb 23, 2026

Summary

Motivation

Files changed

Test plan

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 TestPyPI package published

MCP server for Claude Code

Uh oh!

paddymul commented Feb 23, 2026

Learnings & Self-Review

Was it an improvement?

Performance implications around serialization

What this PR doesn't do yet

Uh oh!

paddymul commented Feb 24, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 23, 2026 •

edited

Loading