Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support generating dtypes, index_value etc lazily for DataFrame chunks #2756

Merged
merged 7 commits into from
Mar 11, 2022

Conversation

qinxuye
Copy link
Collaborator

@qinxuye qinxuye commented Feb 25, 2022

What do these changes do?

This PR aims to improve performance of building tileables and chunks.

Test code:

import mars.tensor as mt
import mars.dataframe as md
from mars.utils import Timer
from mars.core.graph.builder.utils import build_graph


df = md.DataFrame(mt.random.rand(5000, 10, chunk_size=(2, 5)), columns=list('abcdefghij'))
df = df[df['a'] < 0.8]

with Timer() as timer:
    g = build_graph([df], tile=True)
    print(len(g))
print('cost', timer.duration, 'seconds')
Master This PR
6.476 3.231

Related issue number

#2791

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass, see here for how to run them

@qinxuye qinxuye added the type: enhancement request label Feb 25, 2022
@qinxuye qinxuye added this to the v0.9.0b2 milestone Feb 25, 2022
@qinxuye qinxuye added this to In progress in Misc via automation Feb 25, 2022
@qinxuye qinxuye added this to PR-In progress in v0.9 Release via automation Feb 25, 2022
@qinxuye qinxuye force-pushed the enh/expr-opt branch 3 times, most recently from f96d832 to 41d0be1 Compare March 3, 2022 06:51
@qinxuye qinxuye changed the title WIP: Improve performance of building tileables and chunks WIP: Introducing generating dtypes, index_value etc lazily for DataFrame chunks Mar 4, 2022
@qinxuye qinxuye removed this from the v0.9.0b2 milestone Mar 4, 2022
@qinxuye qinxuye changed the title WIP: Introducing generating dtypes, index_value etc lazily for DataFrame chunks WIP: Generating dtypes, index_value etc lazily for DataFrame chunks Mar 4, 2022
@qinxuye qinxuye changed the title WIP: Generating dtypes, index_value etc lazily for DataFrame chunks WIP: Support generating dtypes, index_value etc lazily for DataFrame chunks Mar 7, 2022
@qinxuye qinxuye force-pushed the enh/expr-opt branch 2 times, most recently from 51bb663 to afaa0ed Compare March 7, 2022 09:11
@qinxuye qinxuye changed the title WIP: Support generating dtypes, index_value etc lazily for DataFrame chunks Support generating dtypes, index_value etc lazily for DataFrame chunks Mar 7, 2022
@qinxuye qinxuye added this to the v0.9.0rc1 milestone Mar 7, 2022
@qinxuye qinxuye force-pushed the enh/expr-opt branch 2 times, most recently from 1607157 to dc6bab5 Compare March 7, 2022 14:17
@qinxuye qinxuye marked this pull request as ready for review March 7, 2022 16:14
@qinxuye qinxuye force-pushed the enh/expr-opt branch 3 times, most recently from 9656cd9 to 3971b47 Compare March 10, 2022 08:16
@qinxuye
Copy link
Collaborator Author

qinxuye commented Mar 10, 2022

According to cb test.

[ 50.00%] · For pymars commit 49c67952 (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.24-cloudpickle-defusedxml-numexpr-numpy-pandas-psutil-pyyaml-scikit-learn-scipy-sqlalchemy-tornado
[ 62.50%] ··· ...otes.EmptyRemotesExecutionSuite.time_remotes         6.43±0.02s
[ 75.00%] ··· ...h_builder.ChunkGraphBuilderSuite.time_filter           613±40ms
[ 75.00%] · For pymars commit e1b8c9dd (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.24-cloudpickle-defusedxml-numexpr-numpy-pandas-psutil-pyyaml-scikit-learn-scipy-sqlalchemy-tornado
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.24-cloudpickle-defusedxml-numexpr-numpy-pandas-psutil-pyyaml-scikit-learn-scipy-sqlalchemy-tornado
[ 87.50%] ··· ...otes.EmptyRemotesExecutionSuite.time_remotes         6.78±0.03s
[100.00%] ··· ...h_builder.ChunkGraphBuilderSuite.time_filter         1.18±0.01s

Both tests are better, especially ChunkGraphBuilderSuite.time_filter, from 1.18s to 613ms, almost half the time.

@qinxuye
Copy link
Collaborator Author

qinxuye commented Mar 11, 2022

Ready for review now.

mars/core/entity/core.py Outdated Show resolved Hide resolved
Copy link
Member

@wjsi wjsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@hekaisheng hekaisheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Misc automation moved this from In progress to Reviewer approved Mar 11, 2022
@hekaisheng hekaisheng merged commit 4806727 into mars-project:master Mar 11, 2022
Misc automation moved this from Reviewer approved to Done Mar 11, 2022
v0.9 Release automation moved this from PR-In progress to PR-Done Mar 11, 2022
@qinxuye qinxuye deleted the enh/expr-opt branch March 11, 2022 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Misc
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants