Skip to content

A comprehensive benchmark comparing pure Python and Cython implementations for generating and writing a 1 million row pandas DataFrame to a parquet file with snappy compression.

Notifications You must be signed in to change notification settings

nasirus/bench_python_c

Repository files navigation

Python vs Cython Performance Benchmark

A comprehensive benchmark comparing pure Python and Cython implementations for generating and writing a 1 million row pandas DataFrame to a parquet file with snappy compression.

Overview

This benchmark evaluates the performance difference between:

  • Pure Python: Standard Python implementation using pandas and list comprehensions
  • Cython: Optimized implementation using Cython with C-level optimizations, static typing, and numpy arrays

Benchmark Task

The benchmark measures the time to:

  1. Generate a pandas DataFrame with 1,000,000 rows containing:

    • id: Sequential integer IDs
    • value1: Computed float values (i * 2.5)
    • value2: Computed float values (sqrt(i))
    • category: String categories (10 unique values)
    • flag: Boolean values (alternating True/False)
  2. Write the DataFrame to a parquet file with snappy compression

Installation

Prerequisites

  • Python 3.8+
  • GCC compiler (for building Cython extensions)

Setup

  1. Clone the repository:
git clone https://github.com/nasirus/bench_python_c.git
cd bench_python_c
  1. Install dependencies:
pip install -r requirements.txt
  1. Build the Cython extension:
python setup.py build_ext --inplace

Usage

Run the benchmark:

python benchmark.py

The benchmark will:

  • Run each implementation 5 times
  • Calculate average, minimum, and maximum times
  • Display detailed performance comparison
  • Clean up generated parquet files automatically

Benchmark Results

Test Environment

  • Python 3.12.3
  • pandas 2.2.3
  • pyarrow 18.0.0
  • Cython 3.0.11
  • numpy 2.1.3

Performance Results

Pure Python Implementation:

  • Average Generation Time: 0.9574s
  • Average Writing Time: 0.1696s
  • Average Total Time: 1.1270s

Cython Implementation:

  • Average Generation Time: 0.1371s
  • Average Writing Time: 0.1562s
  • Average Total Time: 0.2934s

Performance Comparison

Metric Speedup
DataFrame Generation 6.98x
Parquet Writing 1.09x
Total 3.84x

Cython is 284.1% faster than Pure Python for the complete workflow.

Key Findings

  1. DataFrame Generation: Cython shows the most significant improvement (~7x faster) due to:

    • Static typing and C-level loops
    • Direct numpy array manipulation
    • Elimination of Python interpreter overhead
    • Use of C math functions (sqrt)
  2. Parquet Writing: Minimal difference (~1.09x) because:

    • Both implementations use the same pyarrow engine
    • I/O operations are dominated by compression and disk writes
    • Limited optimization opportunities at Python level
  3. Overall Performance: Cython provides a ~3.84x speedup, making it excellent for:

    • Data generation and transformation tasks
    • Compute-intensive operations
    • Processing large datasets

Implementation Details

Pure Python (python_impl.py)

  • Uses standard Python loops and list comprehensions
  • Relies on pandas DataFrame constructor
  • Simple and readable implementation

Cython (cython_impl.pyx)

  • Uses static typing with cdef
  • Pre-allocates numpy arrays for efficiency
  • Utilizes C math functions from libc.math
  • Disables bounds checking and wraparound for maximum performance
  • Employs C division for faster arithmetic

Files

  • benchmark.py: Main benchmark runner script
  • python_impl.py: Pure Python implementation
  • cython_impl.pyx: Cython implementation
  • setup.py: Build script for Cython extension
  • requirements.txt: Python dependencies

License

MIT License

About

A comprehensive benchmark comparing pure Python and Cython implementations for generating and writing a 1 million row pandas DataFrame to a parquet file with snappy compression.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published