Skip to content

Comments

refactor: abstract CustomizableDataflow into pandas/polars subclasses#548

Open
paddymul wants to merge 2 commits intomainfrom
refactor/abstract-dataflow
Open

refactor: abstract CustomizableDataflow into pandas/polars subclasses#548
paddymul wants to merge 2 commits intomainfrom
refactor/abstract-dataflow

Conversation

@paddymul
Copy link
Collaborator

Summary

  • Makes CustomizableDataflow an abstract base class with 4 abstract methods: _compute_processed_result, _build_error_dataframe, _get_summary_sd, _df_to_obj
  • Creates PandasCustomizableDataflow (new file) and PolarsCustomizableDataflow (new file) as concrete implementations
  • Widgets select their backend via a dataflow_klass class attribute instead of inheriting from the pandas widget
  • Removes pandas imports from dataflow.py and dataflow_extras.py base classes

Motivation

This decouples the dataflow pipeline from pandas, enabling pandas to become an optional dependency. Previously PolarsBuckarooWidget inherited from the pandas widget and overrode methods — now each backend has its own clean dataflow subclass.

Files changed

File Change
buckaroo/dataflow/dataflow.py Make CustomizableDataflow abstract, remove pandas imports
buckaroo/dataflow/dataflow_extras.py Remove pandas imports, duck-type sort_index check
buckaroo/dataflow/pandas_dataflow.py New: PandasCustomizableDataflow
buckaroo/dataflow/polars_dataflow.py New: PolarsCustomizableDataflow
buckaroo/buckaroo_widget.py Add dataflow_klass, use it in InnerDataFlow
buckaroo/polars_buckaroo.py Set dataflow_klass = PolarsCustomizableDataflow
buckaroo/server/data_loading.py Use PandasCustomizableDataflow
tests/unit/dataflow/*.py Update imports to use PandasCustomizableDataflow

Test plan

  • All 586 unit tests pass locally (5 skipped)
  • CI passes

🤖 Generated with Claude Code

@github-actions
Copy link

github-actions bot commented Feb 23, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22353825774

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22353825774

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.9.dev22353825774" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

@paddymul
Copy link
Collaborator Author

Learnings & Self-Review

Was it an improvement?

Structurally, yes. The core win is that dataflow.py and dataflow_extras.py no longer import pandas. This is the prerequisite for making pandas an optional dependency. Before this PR, PolarsBuckarooWidget inherited from BuckarooWidget (a pandas widget) and monkey-patched methods like _df_to_obj and _build_error_dataframe. Now each backend has a clean subclass.

The dataflow_klass attribute pattern is cleaner than the old InnerDataFlow approach that delegated _df_to_obj back to the widget via closure — that mixed widget concerns into the dataflow layer.

However, there's significant code duplication. _compute_processed_result and _get_summary_sd are identical between PandasCustomizableDataflow and PolarsCustomizableDataflow. Only _build_error_dataframe (pd.DataFrame vs pl.DataFrame) and _df_to_obj (direct vs .to_pandas()) actually differ. A follow-up could keep only those two as abstract and hoist the shared logic back to the base class.

Performance implications around serialization

No change for pandas users. The _df_to_obj path is identical — pd_to_obj(self.sampling_klass.serialize_sample(df)).

The polars path is unchanged tooPolarsCustomizableDataflow._df_to_obj does the same df.to_pandas() conversion that PolarsBuckarooWidget._df_to_obj did before. The infinite scroll path (_handle_payload_args) writes polars directly to parquet and was not touched.

One subtle behavioral change in dataflow_extras.py: The base Sampling.pre_stats_sample used to call sampled.sort_index() on pandas DataFrames after sampling. That was removed from the base class (no pandas import), but PdSampling in buckaroo_widget.py already overrides pre_stats_sample with its own sort_index() call, so pandas widget path is unaffected. serialize_sample was changed from isinstance(sampled, pd.DataFrame) to hasattr(sampled, 'sort_index') — duck typing that works without importing pandas.

What this PR doesn't do yet

  • Doesn't make pandas optional at import time — serialization_utils.py and many other files still import pandas at module level
  • PolarsCustomizableDataflow still imports pandas (for the isinstance check in _df_to_obj)
  • autocleaning.py still defines SENTINEL_DF_1/SENTINEL_DF_2 with pandas — used in production code
  • _compute_processed_result and _get_summary_sd are copy-pasted between both subclasses — follow-up should hoist shared logic

Make CustomizableDataflow an abstract base class with concrete
PandasCustomizableDataflow and PolarsCustomizableDataflow subclasses.
Widgets select their backend via a dataflow_klass attribute.

This decouples pandas from the core dataflow module, enabling
pandas to become an optional dependency in the future.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@paddymul
Copy link
Collaborator Author

@codex

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02ecf492b0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

self.exception = None
kls = self.__class__
class InnerDataFlow(CustomizableDataflow):
class InnerDataFlow(kls.dataflow_klass):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Route _df_to_obj through widget overrides

This refactor removed the InnerDataFlow._df_to_obj bridge, so _df_to_obj implementations on widget subclasses are no longer used during serialization. That breaks built-in override paths like GeopandasBase._df_to_obj (buckaroo/geopandas_buckaroo.py), which exists to coerce geopandas frames before pd_to_obj; after this change, serialization falls back to PandasCustomizableDataflow._df_to_obj and bypasses that conversion, causing geopandas widget payload generation to fail or produce incorrect data.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant