Trial is being repeated with the exact same results #17257

lucaguarro · 2021-07-22T05:26:18Z

I am using an ASHAScheduler defined as so:

scheduler = ASHAScheduler(
    metric="eval_s_f1",
    mode="max",
    max_t=1000,
    grace_period=30,
    reduction_factor=1.5)

and passing it to huggingface's ray-tune integration function like so:

trainer.hyperparameter_search(
    hp_space=lambda _: tune_config_ASHA,
    backend="ray",
    compute_objective=my_objective,
    direction="maximize",
    n_trials=num_samples,
    resources_per_trial={
        "cpu": 2,
        "gpu": gpus_per_trial
    },
    scheduler=scheduler,
    keep_checkpoints_num=1,
    checkpoint_score_attr="training_iteration",
    stop={"training_iteration": 1} if smoke_test else None,
    progress_reporter=reporter,
    local_dir="~/ray_results/",
    name="tune_transformer",
    loggers=DEFAULT_LOGGERS + (WandbLogger,),
    time_budget_s=60*60*10) # 10 hours

The issue is that one of the trials seems to be re-executing. The reason why I think this is, is that the current trial that is running is yielding metrics that have already been reported to wandb for the same training iterations. Another thing is that the currently running trial is not having its results reported to wandb (even though the previous trials in the same execution cycle were). What is going on? Did my execution get corrupted somehow?

I find this to be a strange bug so I am not too sure what info I should provide. Let me know if there is any more details that I could provide that will help.

The text was updated successfully, but these errors were encountered:

richardliaw · 2021-07-22T05:29:22Z

Hey @lucaguarro , thanks for reporting. Maybe the WANDB logger might not be well set up here.

Could you provide the output that you're seeing on the terminal along with the wandb output?
IT'd be great to also note the expected behavior.
Is there a small repro script that I can run to reproduce this issue?

lucaguarro · 2021-07-22T06:01:47Z

I can provide an example. First thing to note is that 1 training iteration corresponds to 5 logging steps. Thus training iteration 45 corresponds to step 45*5 = 225.
This is the ray tune output for iteration 45.

And this is the WandB output for the same iteration (step 225):

Where I have placed the cursor shows the value for that iteration. As you can see the accuracy, loss, and f1 scores are all the same as in the ray-tune output. These values have already been reported to WandB. In fact it reports that the trial is finished.
The expected behavior would be for to ray-tune to allocate resources to a pending trial instead of one that has already been ran.
The current script I am running would need a lot of time to see if it would reproduce this bug (I have been running it for about 7 hours by now). I can see if I can get the same effect with a less compute-intensive script.

richardliaw added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks tune Tune-related issues labels Jul 22, 2021

krfricke assigned Yard1 Oct 7, 2021

krfricke added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trial is being repeated with the exact same results #17257

Trial is being repeated with the exact same results #17257

lucaguarro commented Jul 22, 2021

richardliaw commented Jul 22, 2021

lucaguarro commented Jul 22, 2021 •

edited

Loading

Trial is being repeated with the exact same results #17257

Trial is being repeated with the exact same results #17257

Comments

lucaguarro commented Jul 22, 2021

richardliaw commented Jul 22, 2021

lucaguarro commented Jul 22, 2021 • edited Loading

lucaguarro commented Jul 22, 2021 •

edited

Loading