Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a benchmark using California's housing dataset from Scikit-learn. #10

Merged
merged 1 commit into from
Aug 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,5 @@ venv/

dask-worker-space/*
.pre-commit-config.yaml

.DS_Store
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,16 @@ def list_pow(values: List[float], factor: float) -> List[float]:
[Scaler](https://github.com/citi/scaler) or Dask.


## Benchmarks

**Parfun efficiently parallelizes short-duration functions**.

When running a short 0.28-second ML function on an AMD Epyc 7313 16-Cores Processor, Parfun provides an impressive
**7.4x speedup**. Source code for this experiment [here](benchmarks/california_housing.py).

![Benchmark Results](benchmarks/california_housing_results.svg)


## Quick Start
The built-in Sphinx documentation contains detailed usage instructions, implementation details, and an exhaustive
API reference.
Expand Down
99 changes: 99 additions & 0 deletions benchmarks/california_housing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
"""
Trains a decision tree regressor on the California housing dataset from scikit-learn.

Measure the training time when splitting the learning dataset process using Parfun.
"""

import argparse
import json
import timeit

from typing import List

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.base import RegressorMixin
from sklearn.tree import DecisionTreeRegressor

from parfun.decorators import parfun
from parfun.entry_point import BACKEND_REGISTRY, set_parallel_backend_context
from parfun.partition.api import per_argument
from parfun.partition.dataframe import df_by_row


class MeanRegressor(RegressorMixin):
def __init__(self, regressors: List[RegressorMixin]) -> None:
super().__init__()
self._regressors = regressors

def predict(self, X):
return np.mean([
regressor.predict(X)
for regressor in self._regressors
])


@parfun(
split=per_argument(dataframe=df_by_row),
combine_with=lambda regressors: MeanRegressor(list(regressors)),
)
def train_regressor(
dataframe: pd.DataFrame, feature_names: List[str], target_name: str
) -> RegressorMixin:

regressor = DecisionTreeRegressor()
regressor.fit(dataframe[feature_names], dataframe[[target_name]])

return regressor


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("n_workers", action="store", type=int)
parser.add_argument(
"--backend",
type=str,
choices=BACKEND_REGISTRY.keys(),
default="local_multiprocessing",
)
parser.add_argument(
"--backend_args",
type=str,
default="{}",
)

args = parser.parse_args()

dataset = fetch_california_housing(download_if_missing=True)

feature_names = dataset["feature_names"]
target_name = dataset["target_names"][0]

dataframe = pd.DataFrame(dataset["data"], columns=feature_names)
dataframe[target_name] = dataset["target"]

N_MEASURES = 5

with set_parallel_backend_context("local_single_process"):
regressor = train_regressor(dataframe, feature_names, target_name)

duration = timeit.timeit(
lambda: train_regressor(dataframe, feature_names, target_name),
number=N_MEASURES
) / N_MEASURES

print("Duration sequential:", duration)

backend_args = {"max_workers": args.n_workers, **json.loads(args.backend_args)}

with set_parallel_backend_context(args.backend, **backend_args):
regressor = train_regressor(dataframe, feature_names, target_name)

duration = timeit.timeit(
lambda: train_regressor(dataframe, feature_names, target_name),
number=N_MEASURES
) / N_MEASURES

print("Duration parallel:", duration)
52 changes: 52 additions & 0 deletions benchmarks/california_housing_results.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion parfun/about.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "6.0.6"
__version__ = "6.0.7"
Loading