Skip to content

Conversation

@simonjayhawkins
Copy link
Member

planning to remove the padding code from _wrap_result eventually, but until then we can skip it when we return a integer array from get_dummies

adding tests and benchmarks as precursor to potential changes to _wrap_result #41372

have a working implementation for ArrowStringArray using pyarrow native functions but is slower than object fallback, so am leaving that for a followup.

       before           after         ratio
     [4ec6925c]       [091b0b02]
     <master>         <get_dummies>
-      2.58±0.02s         655±10ms     0.25  strings.Dummies.time_get_dummies('arrow_string')
-      2.58±0.03s          643±9ms     0.25  strings.Dummies.time_get_dummies('string')
-      2.59±0.07s          638±7ms     0.25  strings.Dummies.time_get_dummies('str')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels May 13, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone May 13, 2021
@jreback jreback merged commit 3846040 into pandas-dev:master May 13, 2021
@jreback
Copy link
Contributor

jreback commented May 13, 2021

great

@simonjayhawkins simonjayhawkins deleted the get_dummies branch May 14, 2021 09:17
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants