-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add PySparkOvertimeMonitor to avoid exceeding time budget #923
Conversation
@microsoft-github-policy-service agree company="Microsoft" |
@levscaut Thank you for the contribution! Could you remove the file |
Also, please remove unnecessary changes in test folders, only check in your code/docs changes. |
@thinkall Thanks for the advice! I have committed a cleaner version. |
A test on macos failed: https://github.com/microsoft/FLAML/actions/runs/4219786733/jobs/7325478207#step:6:84 |
This file is not needed.
in https://github.com/microsoft/FLAML/actions/runs/4220014982/jobs/7325940952 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Co-authored-by: Chi Wang <[email protected]>
Co-authored-by: Chi Wang <[email protected]>
Related work items: microsoft#493, microsoft#777, microsoft#820, microsoft#837, microsoft#843, microsoft#848, microsoft#849, microsoft#850, microsoft#853, microsoft#855, microsoft#857, microsoft#869, microsoft#870, microsoft#888, microsoft#894, microsoft#923, microsoft#924, microsoft#925, microsoft#934, microsoft#952, microsoft#962, microsoft#973, microsoft#975, microsoft#995
@thinkall
Why are these changes needed?
For very large dataset that takes very long time to complete a trial, the actual running time may be way longer than the time budget. It's usually the case that when the time budget is about to run out, a new trial is still started, and the current manage system will need to wait till the new trial is completed. In the worst case, the actual running time will be time budget plus time for a trial. For large dataset this extra time cost is unacceptable.
To make the time cost meet the time budget requirement, let me introduce PySparkOvertimeMonitor. It's a context manager that creates a monitor thread to terminate Spark jobs in time when the running time has exceeded the time budget. When the running jobs are cancelled, you can still fetch the last finished result.
You can set the argument force_cancel to be True in automl settings to enable the monitor. It's set to False by default.
I put PySparkOvertimeMonitor in the flaml.tune.spark.utils. Acting as a context manager, I wrapped the running parts of both sequential and spark parallel in flaml.tune with it. Don't worry, it will not function if the force_cancel is set to False or the Spark is not installed.
I also wrote a test called test_overtime in test/spark/test_overtime.py . In this test file you can see the actual running time only exceeds the time budget by 0.2s.
Checks