Skip to content

Latest commit

 

History

History
154 lines (118 loc) · 4.04 KB

File metadata and controls

154 lines (118 loc) · 4.04 KB

Benchmark Quick Start Guide

This guide provides step-by-step instructions to run the Python vs Cython performance benchmark.

Prerequisites

  • Python 3.8 or higher
  • GCC compiler (for building Cython extensions)
  • pip package manager

Installation Steps

1. Clone the Repository

git clone https://github.com/nasirus/bench_python_c.git
cd bench_python_c

2. Install Dependencies

pip install -r requirements.txt

This will install:

  • pandas (DataFrame operations)
  • pyarrow (Parquet file I/O)
  • numpy (Numerical computations)
  • Cython (C extension compilation)

3. Build the Cython Extension

python setup.py build_ext --inplace

This compiles the Cython code (cython_impl.pyx) into a C extension module that can be imported by Python.

Running the Benchmark

Full Benchmark

python benchmark.py

This will:

  1. Run both Python and Cython implementations 5 times each
  2. Generate 1 million row DataFrames in each iteration
  3. Write each DataFrame to a parquet file with snappy compression
  4. Calculate and display performance statistics
  5. Clean up generated files automatically

Validation Tests

python test_validation.py

This verifies that:

  1. Both implementations produce identical DataFrames
  2. Parquet files are written and read correctly
  3. Data integrity is maintained

Understanding the Output

Benchmark Output Example

================================================================================
BENCHMARK RESULTS
================================================================================

Pure Python Implementation:
  Average Generation Time:  0.9574s
  Average Writing Time:     0.1696s
  Average Total Time:       1.1270s
  Min Total Time:           1.1170s
  Max Total Time:           1.1508s

Cython Implementation:
  Average Generation Time:  0.1371s
  Average Writing Time:     0.1562s
  Average Total Time:       0.2934s
  Min Total Time:           0.2904s
  Max Total Time:           0.2956s

Performance Comparison:
  Generation Speedup:       6.98x
  Writing Speedup:          1.09x
  Total Speedup:            3.84x

  Cython is 284.1% faster than Pure Python
================================================================================

Key Metrics Explained

  • Generation Time: Time to create the 1M row DataFrame
  • Writing Time: Time to write DataFrame to parquet file
  • Total Time: Combined generation + writing time
  • Speedup: Ratio of Python time / Cython time
  • Percentage Improvement: ((Speedup - 1) × 100)%

File Descriptions

File Purpose
benchmark.py Main benchmark runner with statistics
python_impl.py Pure Python implementation
cython_impl.pyx Cython optimized implementation
setup.py Build configuration for Cython
test_validation.py Correctness validation tests
requirements.txt Python package dependencies
README.md Complete documentation

Customizing the Benchmark

You can modify the benchmark parameters by editing benchmark.py:

# Change number of iterations
num_iterations = 5  # Default: 5

# Change DataFrame size
num_rows = 1_000_000  # Default: 1 million

Troubleshooting

Build Errors

If you encounter build errors, ensure you have:

  • A C compiler installed (GCC on Linux, Xcode on macOS, MSVC on Windows)
  • Python development headers (python3-dev on Ubuntu)

Import Errors

If you get import errors, make sure:

  1. You've built the Cython extension: python setup.py build_ext --inplace
  2. You're running from the project root directory
  3. All dependencies are installed: pip install -r requirements.txt

Performance Variations

Benchmark results may vary based on:

  • CPU speed and architecture
  • Available RAM
  • Disk I/O speed
  • System load
  • Python version

Next Steps

  • Experiment with different DataFrame sizes
  • Add more data types and transformations
  • Profile specific bottlenecks
  • Compare with other optimization approaches (Numba, PyPy)

Support

For issues or questions, please open an issue on GitHub.