-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlflow integration causes 429 error responses from Databricks when running multiple parallel jobs/ops #22633
Comments
I'm going to take a stab at creating a proper fix for this. Essentially, we need to keep state about the mlflow run id to alleviate the need for all subsequent resource initis to have to fetch the mlflow run id (even across multiprocess executions). Ideally, this would be set in the context of the run so it's with the rest of the job state. My plan is to add an optional parameter for the mlflow run id (similar to the current parent id value). When creating the config for the job you could pass that value and subsequent inits would already have that in the context. If that parameter is not passed then the first mlflow resource init would update the context for the job. If I can't update the context in the resource init then I'll probably just leave the current logic to preserve backwards compatibility. I'm getting my local env set up and will report back when I have some updates. |
PR up here: #22718 |
…y hit databrick mlflow qps (dagster-io#22633)
…y hit databrick mlflow qps (dagster-io#22633)
Dagster version
1.7.4
What's the issue?
Databricks mlfow has a max qps of 3
The Dagster mlflow integration makes ~3 different calls to the mlflow API whenever the resource initializes. This calls are redundant in the broader context of the job to fetch the mlflow run id and start the run.
When you run multiple jobs with multiple fanned-out ops, then you easily hit the qps. This causes the resource init to fail and thus the entire job to also fail.
What did you expect to happen?
Running jobs in parallel would succeed without issue. A proper fix would probably include better state handling so subsequent ops in a job don't need to query mlflow.
Some level of auto retry (either at the op level or the individual requests) would be ideal too
How to reproduce?
Create a job with multiple ops that uses the dagster-mlflow integration
Use a sensor to kick of many of the jobs in a single run request
See how some of the ops will fail in resource init and thus fail the entire job
Deployment type
Dagster Helm chart
Deployment details
No response
Additional information
Error message from the Dagster UI for a failed resource init (I replaced our DBr instance id with ):
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered: