Skip to content

An Indexed Job mode that allows every index to execute #109712

@ahg-g

Description

@ahg-g

What would you like to be added?

A mode of operation for Jobs with .spec.completionMode="Indexed that allows every index to execute.

Currently this is not possible because when a job reaches its .spec.backoffLimit, active pods will be deleted; moreover, the job is declared failed and so no new pods are created for the indices that didn't execute yet (happens more often when parallelism < completions).

I can think of two open issues:

  1. How to decide when to stop retrying a failed index. One approach is to consider backofflimit at the index level, this will be challenging to track per index in the job status, but one solution is to have the backofflimit with min semantics: each index is guaranteed to at least backofflimit retries, we track that in the job-controller memory for each index, and in status we only track which indexes reached the limit as a bitmap.
  2. Job failure status: in the simplest case we could just declare the job failed if at least one index failed, but we could also introduce an API to allow users to tune that (perhaps based on a percentage or a min number of indexes).

Related comment to this issue: #109131 (comment)

Why is this needed?

There are cases where the indexes represent independent operations, and so it is desired to continue and execute all of them before declaring the job complete.

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/appsCategorizes an issue or PR as relevant to SIG Apps.wg/batchCategorizes an issue or PR as relevant to WG Batch.

Type

No type

Projects

Status

Closed

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions