Implement DataFrame.count(axis=1) on the GPU#23016
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
DataFrame.count(axis=1) on the GPU
📝 WalkthroughSummary by CodeRabbit
Walkthrough
ChangesRow-wise count behavior
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cudf/cudf/tests/dataframe/methods/test_reductions.py (1)
252-268: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winExtend
count(axis=1)coverage to the dtypes the PR claims to support.The PR states
count(axis=1)matches pandas across datetimeNaT, pandas nullable, and Arrow-backed dtypes, but the parametrization only exercises numpy columns withNone/np.nan. Consider adding cases for a datetime column containingNaT, a pandas nullable/Arrow-backed column with<NA>, and an empty (0-row) frame to guard the row-wise path and thenumeric_onlyfiltering against regressions.As per path instructions: "Ensure test files provide comprehensive edge case coverage (empty, all-null, single-element, mixed types)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf/cudf/tests/dataframe/methods/test_reductions.py` around lines 252 - 268, Extend test_dataframe_count_axis1 to cover the dtypes and edge cases claimed by the PR: add parametrized inputs for a datetime column with NaT, a pandas nullable/Arrow-backed column with <NA>, and an empty 0-row DataFrame. Update the expected comparisons against pandas DataFrame.count(axis=1, numeric_only=...) so the row-wise path and numeric_only filtering are validated for these cases without changing the existing assert_eq flow.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@python/cudf/cudf/tests/dataframe/methods/test_reductions.py`:
- Around line 252-268: Extend test_dataframe_count_axis1 to cover the dtypes and
edge cases claimed by the PR: add parametrized inputs for a datetime column with
NaT, a pandas nullable/Arrow-backed column with <NA>, and an empty 0-row
DataFrame. Update the expected comparisons against pandas
DataFrame.count(axis=1, numeric_only=...) so the row-wise path and numeric_only
filtering are validated for these cases without changing the existing assert_eq
flow.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: abfa9332-f852-4f8b-ad9c-9d5619f9eaee
📒 Files selected for processing (2)
python/cudf/cudf/core/dataframe.pypython/cudf/cudf/tests/dataframe/methods/test_reductions.py
Description
DataFrame.count(axis=1)previously raisedNotImplementedError("Only axis=0 is currently supported"). Undercudf.pandasthis triggered a CPU fallback that copies the entire DataFrame from device to host before running the reduction in pandas — very expensive for wide / string-heavy frames.This PR implements
count(axis=1)directly on the GPU as the row-wise sum of each column's validity (Σ col.notnull()), returning anint64Seriesindexed by the frame's row index.numeric_onlyis now also honored for both axes (it was previously silently ignored).Behavior
Int64/boolean/Float64/StringDtype), Arrow-backed (*[pyarrow]), and mixed-dtype frames, includingNaN/<NA>/NaT(counted as missing) andnumeric_only=True.int64, matching pandas.Benchmark
NYC parking violations 2022 (~15.4M rows; 7 string + 3 int + 1 datetime columns),
df.count(axis=1):cudf.pandasbefore (CPU fallback)cudf.pandasafter (this PR, on GPU)The whole-frame device→host copy is eliminated — the reduction now stays on the GPU.
Reproducer
Checklist