Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trial is being repeated with the exact same results #17257

Open
lucaguarro opened this issue Jul 22, 2021 · 2 comments
Open

Trial is being repeated with the exact same results #17257

lucaguarro opened this issue Jul 22, 2021 · 2 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues

Comments

@lucaguarro
Copy link

I am using an ASHAScheduler defined as so:

scheduler = ASHAScheduler(
    metric="eval_s_f1",
    mode="max",
    max_t=1000,
    grace_period=30,
    reduction_factor=1.5)

and passing it to huggingface's ray-tune integration function like so:

trainer.hyperparameter_search(
    hp_space=lambda _: tune_config_ASHA,
    backend="ray",
    compute_objective=my_objective,
    direction="maximize",
    n_trials=num_samples,
    resources_per_trial={
        "cpu": 2,
        "gpu": gpus_per_trial
    },
    scheduler=scheduler,
    keep_checkpoints_num=1,
    checkpoint_score_attr="training_iteration",
    stop={"training_iteration": 1} if smoke_test else None,
    progress_reporter=reporter,
    local_dir="~/ray_results/",
    name="tune_transformer",
    loggers=DEFAULT_LOGGERS + (WandbLogger,),
    time_budget_s=60*60*10) # 10 hours

The issue is that one of the trials seems to be re-executing. The reason why I think this is, is that the current trial that is running is yielding metrics that have already been reported to wandb for the same training iterations. Another thing is that the currently running trial is not having its results reported to wandb (even though the previous trials in the same execution cycle were). What is going on? Did my execution get corrupted somehow?

I find this to be a strange bug so I am not too sure what info I should provide. Let me know if there is any more details that I could provide that will help.

@richardliaw richardliaw added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks tune Tune-related issues labels Jul 22, 2021
@richardliaw
Copy link
Contributor

Hey @lucaguarro , thanks for reporting. Maybe the WANDB logger might not be well set up here.

  1. Could you provide the output that you're seeing on the terminal along with the wandb output?
  2. IT'd be great to also note the expected behavior.
  3. Is there a small repro script that I can run to reproduce this issue?

@lucaguarro
Copy link
Author

lucaguarro commented Jul 22, 2021

  1. I can provide an example. First thing to note is that 1 training iteration corresponds to 5 logging steps. Thus training iteration 45 corresponds to step 45*5 = 225.
    This is the ray tune output for iteration 45.
    tune_dupe_bug
    And this is the WandB output for the same iteration (step 225):
    wandb_dupe_bug
    Where I have placed the cursor shows the value for that iteration. As you can see the accuracy, loss, and f1 scores are all the same as in the ray-tune output. These values have already been reported to WandB. In fact it reports that the trial is finished.

  2. The expected behavior would be for to ray-tune to allocate resources to a pending trial instead of one that has already been ran.

  3. The current script I am running would need a lot of time to see if it would reproduce this bug (I have been running it for about 7 hours by now). I can see if I can get the same effect with a less compute-intensive script.

@krfricke krfricke added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

4 participants