Skip to content

Implement DataFrame.count(axis=1) on the GPU#23016

Open
galipremsagar wants to merge 1 commit into
rapidsai:mainfrom
galipremsagar:count
Open

Implement DataFrame.count(axis=1) on the GPU#23016
galipremsagar wants to merge 1 commit into
rapidsai:mainfrom
galipremsagar:count

Conversation

@galipremsagar

@galipremsagar galipremsagar commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Description

DataFrame.count(axis=1) previously raised NotImplementedError ("Only axis=0 is currently supported"). Under cudf.pandas this triggered a CPU fallback that copies the entire DataFrame from device to host before running the reduction in pandas — very expensive for wide / string-heavy frames.

This PR implements count(axis=1) directly on the GPU as the row-wise sum of each column's validity (Σ col.notnull()), returning an int64 Series indexed by the frame's row index. numeric_only is now also honored for both axes (it was previously silently ignored).

Behavior

  • Matches pandas across numpy, pandas nullable (Int64 / boolean / Float64 / StringDtype), Arrow-backed (*[pyarrow]), and mixed-dtype frames, including NaN / <NA> / NaT (counted as missing) and numeric_only=True.
  • Result dtype is int64, matching pandas.

Benchmark

NYC parking violations 2022 (~15.4M rows; 7 string + 3 int + 1 datetime columns), df.count(axis=1):

time speedup
cudf.pandas before (CPU fallback) 3.93 s
pure pandas 1.40 s
cudf.pandas after (this PR, on GPU) 0.023 s ~170× vs fallback, ~60× vs pandas

The whole-frame device→host copy is eliminated — the reduction now stays on the GPU.

Reproducer
import pandas as pd, time
# run with:  python -m cudf.pandas bench.py
# data:      https://data.rapids.ai/datasets/nyc_parking/nyc_parking_violations_2022.parquet
df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=[
        "Registration State", "Violation Code", "Vehicle Body Type", "Vehicle Make",
        "Violation Time", "Violation County", "Vehicle Year", "Violation Description",
        "Issue Date", "Summons Number",
    ],
)
df["Issue Date"] = df["Issue Date"].astype("datetime64[ms]")
df["issue_weekday"] = df["Issue Date"].dt.weekday

df.count(axis=1)  # warm
t = time.time(); df.count(axis=1); print(f"{time.time() - t:.4f} s")

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot

copy-pr-bot Bot commented Jun 27, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Python Affects Python cuDF API. label Jun 27, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Jun 27, 2026
@galipremsagar galipremsagar added bug Something isn't working non-breaking Non-breaking change labels Jun 27, 2026
@galipremsagar galipremsagar changed the title Implement count=1 Implement DataFrame.count(axis=1) on the GPU Jun 27, 2026
@galipremsagar galipremsagar marked this pull request as ready for review June 27, 2026 00:41
@galipremsagar galipremsagar requested a review from a team as a code owner June 27, 2026 00:41
@galipremsagar galipremsagar requested review from Matt711 and bdice June 27, 2026 00:41
@galipremsagar galipremsagar added the 3 - Ready for Review Ready for review by team label Jun 27, 2026
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • DataFrame.count() now supports row-wise counting with axis=1.
    • Added support for numeric_only when counting, matching expected results on numeric-only data.
  • Bug Fixes

    • count(axis=1) no longer raises an error and now returns per-row non-null counts.
    • Improved behavior for empty dataframes with no columns.
  • Tests

    • Added coverage to validate count(axis=1) against pandas across mixed data and missing values.

Walkthrough

DataFrame.count now supports axis=1 row-wise counts, applies numeric_only column filtering, updates the docstring example, and adds tests that compare the new behavior with pandas.

Changes

Row-wise count behavior

Layer / File(s) Summary
Count implementation and docs
python/cudf/cudf/core/dataframe.py
DataFrame.count now handles axis=1, selects numeric columns when numeric_only=True, sums non-null masks across columns, handles empty-column frames, and updates the docstring example.
Row-wise count validation
python/cudf/cudf/tests/dataframe/methods/test_reductions.py
The axis=1 unsupported-op list no longer includes count, and new parametrized cases compare count(axis=1, numeric_only=...) against pandas across mixed inputs.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding DataFrame.count(axis=1) support on the GPU.
Description check ✅ Passed The description accurately matches the implemented row-wise count support, numeric_only behavior, and benchmark motivation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/cudf/cudf/tests/dataframe/methods/test_reductions.py (1)

252-268: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Extend count(axis=1) coverage to the dtypes the PR claims to support.

The PR states count(axis=1) matches pandas across datetime NaT, pandas nullable, and Arrow-backed dtypes, but the parametrization only exercises numpy columns with None/np.nan. Consider adding cases for a datetime column containing NaT, a pandas nullable/Arrow-backed column with <NA>, and an empty (0-row) frame to guard the row-wise path and the numeric_only filtering against regressions.

As per path instructions: "Ensure test files provide comprehensive edge case coverage (empty, all-null, single-element, mixed types)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf/cudf/tests/dataframe/methods/test_reductions.py` around lines 252
- 268, Extend test_dataframe_count_axis1 to cover the dtypes and edge cases
claimed by the PR: add parametrized inputs for a datetime column with NaT, a
pandas nullable/Arrow-backed column with <NA>, and an empty 0-row DataFrame.
Update the expected comparisons against pandas DataFrame.count(axis=1,
numeric_only=...) so the row-wise path and numeric_only filtering are validated
for these cases without changing the existing assert_eq flow.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/cudf/cudf/tests/dataframe/methods/test_reductions.py`:
- Around line 252-268: Extend test_dataframe_count_axis1 to cover the dtypes and
edge cases claimed by the PR: add parametrized inputs for a datetime column with
NaT, a pandas nullable/Arrow-backed column with <NA>, and an empty 0-row
DataFrame. Update the expected comparisons against pandas
DataFrame.count(axis=1, numeric_only=...) so the row-wise path and numeric_only
filtering are validated for these cases without changing the existing assert_eq
flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: abfa9332-f852-4f8b-ad9c-9d5619f9eaee

📥 Commits

Reviewing files that changed from the base of the PR and between c979f58 and 1099327.

📒 Files selected for processing (2)
  • python/cudf/cudf/core/dataframe.py
  • python/cudf/cudf/tests/dataframe/methods/test_reductions.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants