Fix cudf.pandas --line-profile clobbering __file__#23017
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughSummary by CodeRabbit
Walkthrough
Changescudf.pandas line-profile execution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/merge |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cudf/cudf_pandas_tests/test_main.py (1)
9-17: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Drop
shell=Truefrom_run_python.This helper currently treats
commandas shell text, so valid temp script paths with spaces or shell metacharacters will be split/mangled by the shell. In this test file the immediate risk is flaky/broken path handling, and it also leaves a latent injection footgun for future callers. Pass an argv list tosubprocess.check_outputinstead.Proposed fix
+import sys import subprocess import tempfile import textwrap @@ -def _run_python(*, cudf_pandas, command): - executable = "python " - if cudf_pandas: - executable += "-m cudf.pandas " - return subprocess.check_output( - executable + command, - shell=True, - text=True, - encoding="utf-8", - ) +def _run_python(*, cudf_pandas, command): + argv = [sys.executable] + if cudf_pandas: + argv.extend(["-m", "cudf.pandas"]) + argv.extend(command) + return subprocess.check_output(argv, text=True, encoding="utf-8")- res = _run_python(cudf_pandas=True, command=f.name) - expect = _run_python(cudf_pandas=False, command=f.name) + res = _run_python(cudf_pandas=True, command=[f.name]) + expect = _run_python(cudf_pandas=False, command=[f.name])🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf/cudf_pandas_tests/test_main.py` around lines 9 - 17, The _run_python helper currently builds a shell command string and passes shell=True to subprocess.check_output, which can mangle paths and is unsafe for future callers. Update _run_python to construct and pass an argv list instead of a single command string, keeping the cudf_pandas toggle behavior in the executable selection and preserving the existing text/encoding handling.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@python/cudf/cudf_pandas_tests/test_main.py`:
- Around line 9-17: The _run_python helper currently builds a shell command
string and passes shell=True to subprocess.check_output, which can mangle paths
and is unsafe for future callers. Update _run_python to construct and pass an
argv list instead of a single command string, keeping the cudf_pandas toggle
behavior in the executable selection and preserving the existing text/encoding
handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 459cdb8b-337f-45c9-8951-84983363779d
📒 Files selected for processing (1)
python/cudf/cudf_pandas_tests/test_main.py
Description
closes #23010
python -m cudf.pandas --line-profile <script>writes an instrumented copy of the script to a temporary file and executes it viarunpy.run_path(<temp>), which sets__file__to that temporary path. Scripts that locate sibling resources relative to__file__(e.g.Path(__file__).resolve().parent.parent / "data" / "file.parquet") then resolve to the wrong location and fail:The same script runs fine without
--line-profile(it is executed directly, so__file__is correct).Root cause / fix
The per-line profiler needs the executed code object's filename to be the instrumented temp file — it reads source lines via
inspect.stack().code_context(→linecacheonco_filename) and shifts line numbers back to the original — so the temp filename can't simply be swapped for the real one.This PR keeps the code object's filename pointed at the temp file (per-line profiling output and tracebacks are unchanged) but executes it in a
__main__module whose__file__— andsys.argv[0]— refer to the original script. This matches the behavior of running without--line-profile, so scripts that resolve paths relative to__file__keep working.Tests
Adds
test_run_cudf_pandas_line_profile_preserves_file: runspython -m cudf.pandas --line-profileon a script that reads a sibling file via__file__and asserts it succeeds (and that__file__is the original script path). The function profiler (--profile) was never affected — it executes the original script directly.Checklist