Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make job fail fast when container starting failed #635

Merged
merged 1 commit into from
Jan 19, 2022

Conversation

zuston
Copy link
Member

@zuston zuston commented Jan 19, 2022

Why

Now when starting container failed in one nodemanager, tony AM can't find it and the job wont fail until reaching the registration timeout.

How

We need to make job fail fast once container startup failed.

@zuston
Copy link
Member Author

zuston commented Jan 19, 2022

@oliverhu Could you help review?

@zuston zuston requested a review from oliverhu January 19, 2022 03:22
@zuston
Copy link
Member Author

zuston commented Jan 19, 2022

By the way, i found that when requested resource is invalid, TonY wont request again.
But I think TonY could request resource again when starting container failed, in some cases that some containers maybe marked as a lost node.
But now the job will fail directly and has to wait all resources ready before starting.

@oliverhu
Copy link
Member

makes sense..

@zuston zuston merged commit 61184f2 into tony-framework:master Jan 19, 2022
@zuston zuston mentioned this pull request Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants