Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add PySparkOvertimeMonitor to avoid exceeding time budget #923

Merged
merged 21 commits into from
Feb 24, 2023

Conversation

levscaut
Copy link
Collaborator

@thinkall

Why are these changes needed?

For very large dataset that takes very long time to complete a trial, the actual running time may be way longer than the time budget. It's usually the case that when the time budget is about to run out, a new trial is still started, and the current manage system will need to wait till the new trial is completed. In the worst case, the actual running time will be time budget plus time for a trial. For large dataset this extra time cost is unacceptable.

To make the time cost meet the time budget requirement, let me introduce PySparkOvertimeMonitor. It's a context manager that creates a monitor thread to terminate Spark jobs in time when the running time has exceeded the time budget. When the running jobs are cancelled, you can still fetch the last finished result.

You can set the argument force_cancel to be True in automl settings to enable the monitor. It's set to False by default.

    automl_experiment = AutoML()
    automl_settings = {
        ...,
        "force_cancel": True,
    }

    automl_experiment.fit(**automl_settings)

I put PySparkOvertimeMonitor in the flaml.tune.spark.utils. Acting as a context manager, I wrapped the running parts of both sequential and spark parallel in flaml.tune with it. Don't worry, it will not function if the force_cancel is set to False or the Spark is not installed.

with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel, parallel=parallel):
    results = parallel(
            delayed(evaluation_function)(trial_to_run.config)
            for trial_to_run in trials_to_run)

with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel):
    result = evaluation_function(trial_to_run.config)

I also wrote a test called test_overtime in test/spark/test_overtime.py . In this test file you can see the actual running time only exceeds the time budget by 0.2s.

Checks

@levscaut
Copy link
Collaborator Author

@microsoft-github-policy-service agree company="Microsoft"

@thinkall
Copy link
Collaborator

@levscaut Thank you for the contribution!

Could you remove the file automl_pyspark_df.py:Zone.Identifier

https://github.com/microsoft/FLAML/actions/runs/4192036436/jobs/7267230259#step:2:425

@thinkall
Copy link
Collaborator

Also, please remove unnecessary changes in test folders, only check in your code/docs changes.

@levscaut
Copy link
Collaborator Author

@thinkall Thanks for the advice! I have committed a cleaner version.

@thinkall
Copy link
Collaborator

@thinkall
Copy link
Collaborator

FAILED test/nlp/test_autohf.py::test_hf_data - ZeroDivisionError: float division by zero
FAILED test/spark/test_overtime.py::test_overtime - assert 1.2553856372833252 < 1
 +  where 1.2553856372833252 = abs((11.255385637283325 - 10))

in https://github.com/microsoft/FLAML/actions/runs/4220014982/jobs/7325940952

flaml/tune/spark/utils.py Outdated Show resolved Hide resolved
flaml/tune/spark/utils.py Show resolved Hide resolved
flaml/tune/tune.py Show resolved Hide resolved
flaml/tune/tune.py Show resolved Hide resolved
flaml/tune/spark/utils.py Outdated Show resolved Hide resolved
test/spark/custom_mylearner.py Outdated Show resolved Hide resolved
test/spark/custom_mylearner.py Outdated Show resolved Hide resolved
test/spark/test_overtime.py Show resolved Hide resolved
test/spark/test_overtime.py Show resolved Hide resolved
Copy link
Contributor

@qingyun-wu qingyun-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

flaml/automl/automl.py Show resolved Hide resolved
test/spark/custom_mylearner.py Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants