Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlflow_mixin should nest mlflow runs. This prevents the Ray Tune + MLflow scenario from working on Azure ML. #19909

Open
1 of 2 tasks
bstollnitz opened this issue Oct 30, 2021 · 5 comments
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues

Comments

@bstollnitz
Copy link

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

I tried to use Ray Tune + MLflow + Azure ML by following the "MLflow Mixin API" approach detailed in these docs: https://docs.ray.io/en/latest/tune/tutorials/tune-mlflow.html#mlflow-mixin-api, and then running training on Azure. Typically Azure understands mlflow nested runs, and is able to show separate graphs for the metrics in each child run. However, if I add Ray Tune in the mix, the metrics readings from all tune trials get dumped in a single graph on a single run. This makes the Ray Tune + MLflow + Azure ML scenario unusable.

Setting nested=True when the mlflow run is started in the mlflow_mixin might fix this issue.

Versions / Dependencies

python==3.8.10
ray[tune]==1.6.0
mlflow==1.20.2
azureml-core==1.34.0
azureml-pipeline==1.34.0
azureml-mlflow==1.34.0
azureml-defaults

Reproduction script

Here's the minimal scenario that reproduces the issue: https://docs.ray.io/en/latest/tune/tutorials/tune-mlflow.html#mlflow-mixin-api
A simple verification would be to look at the mlruns output and make sure that the tune trial runs all have a parent ID. With this in place, the scenario should work on Azure ML.

In case you want to verify on Azure ML, here are instructions on how to submit a training job: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-cli
Please feel free to reach out if you'd like me to verify on Azure ML.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@bstollnitz bstollnitz added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 30, 2021
@stale
Copy link

stale bot commented Feb 27, 2022

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 27, 2022
@bstollnitz
Copy link
Author

@amogkam - It seems that you wrote the original code for this feature. Can you please take a look?

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 7, 2022
@maggielkx
Copy link

maggielkx commented Mar 11, 2022

Hi @bstollnitz i am in a similar scenario as you are: MLFlow+ Raytune + Azure where the child runs cannot be automatically logged in Azure experiments. The current work-around i have is:

@mlflow_mixin
def train_func(...):
    run_id = mlflow.active_run().info.run_id
    with mlflow.start_run(nested=True):
        <your_own_child_run_code>
        mlflow.set_tag("mlflow.parentRunId", run_id)

Basically I checked the source code of tune.run() and their mlflow_mixin function, then printed out the mlflow class attributes at the beginning of my own train_func(), and i noticed although the nested run is set to True, ray cannot overwrite the mlflow used in Azure under the hood. Therefore I added it manually so that Azure recognizes each sub-run. Hope it helps!

@amogkam amogkam added tune Tune-related issues P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 18, 2022
@WaterKnight1998
Copy link

WaterKnight1998 commented Jul 12, 2022

@maggielkx This also worked for me, thank you very much :)

It would be good if tune.run(...).best_config returned the id of the mlflow run. I tried logging the run id in tune.report but didn't work :(

@steveepreston
Copy link

steveepreston commented Sep 26, 2024

Any update on this?
Because @mlflow_mixin deprecated and seems setup_mlflow() doesn't support nested=True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

5 participants