add PySparkOvertimeMonitor to avoid exceeding time budget #923

levscaut · 2023-02-16T08:22:17Z

Why are these changes needed?

For very large dataset that takes very long time to complete a trial, the actual running time may be way longer than the time budget. It's usually the case that when the time budget is about to run out, a new trial is still started, and the current manage system will need to wait till the new trial is completed. In the worst case, the actual running time will be time budget plus time for a trial. For large dataset this extra time cost is unacceptable.

To make the time cost meet the time budget requirement, let me introduce PySparkOvertimeMonitor. It's a context manager that creates a monitor thread to terminate Spark jobs in time when the running time has exceeded the time budget. When the running jobs are cancelled, you can still fetch the last finished result.

You can set the argument force_cancel to be True in automl settings to enable the monitor. It's set to False by default.

    automl_experiment = AutoML()
    automl_settings = {
        ...,
        "force_cancel": True,
    }

    automl_experiment.fit(**automl_settings)

I put PySparkOvertimeMonitor in the flaml.tune.spark.utils. Acting as a context manager, I wrapped the running parts of both sequential and spark parallel in flaml.tune with it. Don't worry, it will not function if the force_cancel is set to False or the Spark is not installed.

with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel, parallel=parallel):
    results = parallel(
            delayed(evaluation_function)(trial_to_run.config)
            for trial_to_run in trials_to_run)

with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel):
    result = evaluation_function(trial_to_run.config)

I also wrote a test called test_overtime in test/spark/test_overtime.py . In this test file you can see the actual running time only exceeds the time budget by 0.2s.

Checks

I've used pre-commit to lint the changes in this PR, or I've made sure lint with flake8 output is two 0s.
[] I've included any doc changes needed for https://microsoft.github.io/FLAML/. See https://microsoft.github.io/FLAML/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

levscaut · 2023-02-16T08:30:23Z

@microsoft-github-policy-service agree company="Microsoft"

thinkall · 2023-02-16T08:37:03Z

@levscaut Thank you for the contribution!

Could you remove the file automl_pyspark_df.py:Zone.Identifier

https://github.com/microsoft/FLAML/actions/runs/4192036436/jobs/7267230259#step:2:425

thinkall · 2023-02-16T08:41:37Z

Also, please remove unnecessary changes in test folders, only check in your code/docs changes.

levscaut · 2023-02-16T09:21:55Z

@thinkall Thanks for the advice! I have committed a cleaner version.

thinkall · 2023-02-20T03:35:30Z

A test on macos failed: https://github.com/microsoft/FLAML/actions/runs/4219786733/jobs/7325478207#step:6:84
@levscaut could you fix it? Thanks.

This file is not needed.

thinkall · 2023-02-20T06:17:44Z

FAILED test/nlp/test_autohf.py::test_hf_data - ZeroDivisionError: float division by zero
FAILED test/spark/test_overtime.py::test_overtime - assert 1.2553856372833252 < 1
 +  where 1.2553856372833252 = abs((11.255385637283325 - 10))

in https://github.com/microsoft/FLAML/actions/runs/4220014982/jobs/7325940952

flaml/tune/spark/utils.py

flaml/tune/tune.py

flaml/tune/spark/utils.py

test/spark/custom_mylearner.py

test/spark/test_overtime.py

Co-authored-by: Li Jiang <[email protected]>

qingyun-wu

Looks good to me.

flaml/automl/automl.py

test/spark/custom_mylearner.py

.gitignore

Co-authored-by: Chi Wang <[email protected]>

test/spark/custom_mylearner.py

Co-authored-by: Chi Wang <[email protected]>

Related work items: microsoft#493, microsoft#777, microsoft#820, microsoft#837, microsoft#843, microsoft#848, microsoft#849, microsoft#850, microsoft#853, microsoft#855, microsoft#857, microsoft#869, microsoft#870, microsoft#888, microsoft#894, microsoft#923, microsoft#924, microsoft#925, microsoft#934, microsoft#952, microsoft#962, microsoft#973, microsoft#975, microsoft#995

thinkall requested review from qingyun-wu, thinkall, sonichi and liususan091219 February 16, 2023 08:26

levscaut added 2 commits February 16, 2023 17:08

merging

2878ae9

clean commit

fd282f9

levscaut force-pushed the wd branch from 6a3e956 to fd282f9 Compare February 16, 2023 09:20

sonichi assigned thinkall Feb 16, 2023

thinkall added 2 commits February 17, 2023 17:58

Merge branch 'microsoft:main' into wd

c9cfb8a

Merge branch 'microsoft:main' into wd

8be3986

thinkall and others added 2 commits February 20, 2023 12:16

Delete mylearner.py

b4c2f0d

This file is not needed.

fix py4j import error

1ade16c

more tolerant cancelling time

573dbe9

thinkall requested changes Feb 20, 2023

View reviewed changes

levscaut and others added 5 commits February 21, 2023 11:28

fix problems following suggestions

5dda4c0

Update flaml/tune/spark/utils.py

ecfff10

Co-authored-by: Li Jiang <[email protected]>

remove redundant model

dba71e0

Merge branch 'wd' of https://github.com/thinkall/FLAML into wd

3e7e11b

Merge branch 'main' into wd

4c19c83

thinkall approved these changes Feb 21, 2023

View reviewed changes

qingyun-wu approved these changes Feb 21, 2023

View reviewed changes

sonichi suggested changes Feb 22, 2023

View reviewed changes

flaml/automl/automl.py Show resolved Hide resolved

test/spark/custom_mylearner.py Outdated Show resolved Hide resolved

.gitignore Show resolved Hide resolved

thinkall and others added 2 commits February 22, 2023 11:31

Merge branch 'main' into wd

2e13cdc

Update test/spark/custom_mylearner.py

02d8e78

Co-authored-by: Chi Wang <[email protected]>

levscaut added 3 commits February 22, 2023 12:28

add docstr

28ae7f1

Merge branch 'wd' of https://github.com/thinkall/FLAML into wd

07ab383

reverse change in gitignore

41866ac

sonichi reviewed Feb 22, 2023

View reviewed changes

test/spark/custom_mylearner.py Show resolved Hide resolved

levscaut and others added 2 commits February 22, 2023 14:43

Update test/spark/custom_mylearner.py

c941771

Co-authored-by: Chi Wang <[email protected]>

Merge branch 'main' into wd

57c1271

thinkall requested a review from sonichi February 23, 2023 03:41

Merge branch 'microsoft:main' into wd

8731d0d

sonichi approved these changes Feb 23, 2023

View reviewed changes

Merge branch 'microsoft:main' into wd

006317f

thinkall enabled auto-merge February 24, 2023 07:03

thinkall added this pull request to the merge queue Feb 24, 2023

Merged via the queue into microsoft:main with commit c6a2440 Feb 24, 2023

thinkall deleted the wd branch February 24, 2023 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add PySparkOvertimeMonitor to avoid exceeding time budget #923

add PySparkOvertimeMonitor to avoid exceeding time budget #923

levscaut commented Feb 16, 2023

levscaut commented Feb 16, 2023

thinkall commented Feb 16, 2023

thinkall commented Feb 16, 2023

levscaut commented Feb 16, 2023

thinkall commented Feb 20, 2023

thinkall commented Feb 20, 2023

qingyun-wu left a comment

add PySparkOvertimeMonitor to avoid exceeding time budget #923

add PySparkOvertimeMonitor to avoid exceeding time budget #923

Conversation

levscaut commented Feb 16, 2023

Why are these changes needed?

Checks

levscaut commented Feb 16, 2023

thinkall commented Feb 16, 2023

thinkall commented Feb 16, 2023

levscaut commented Feb 16, 2023

thinkall commented Feb 20, 2023

thinkall commented Feb 20, 2023

qingyun-wu left a comment

Choose a reason for hiding this comment