Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add python onehot script #4

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mesejo
Copy link
Contributor

@mesejo mesejo commented Jan 9, 2024

Adds the Python script for benchmarking the one-hot encoding of pandas. This does not include the loading of the CSV.
Two observations:

  • I didn’t include the loading of the CSV in the timings of the Python function, but I do believe that the timings of rust include the loading of the CSV. Perhaps is worth including in the table both timings for Python (with CSV loading and without) in the blog post
  • This is not directly an apple-to-apple comparison since our one-hot function returns a one-column dictionary, and pandas returns a matrix. We could mention that is also one of the strengths of DataFusion the ability to process/work with nested datatypes such as dictionaries (see this)

@mesejo mesejo self-assigned this Jan 9, 2024
@dlovell
Copy link

dlovell commented Jan 11, 2024

@mesejo The blog post says the benchmark includes both reading the csv and application of onehot. Should this benchmark include the csv reading as well?

@mesejo mesejo force-pushed the feat/add-python-one-hot-encoding-script-bench branch from a70d15f to 009bdc1 Compare January 11, 2024 14:41
@mesejo
Copy link
Contributor Author

mesejo commented Jan 11, 2024

@mesejo The blog post says the benchmark includes both reading the csv and application of onehot. Should this benchmark include the csv reading as well?

Yes, you are correct. Updated the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants