Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Deploying Terminating Pods as Brokers #437

Merged
merged 39 commits into from
Jan 13, 2022

Conversation

kate-goldenring
Copy link
Contributor

@kate-goldenring kate-goldenring commented Dec 10, 2021

What this PR does / why we need it:
This adds support for deploying Jobs to devices discovered by the Akri Controller. Previously, Akri only supported deploying Pods that never terminated. Adding Jobs will enable more scenarios with Akri such as device management scenarios. It follows the design in this proposal project-akri/akri-docs#17. Some specific changes to call out:

  • The Configuration CRD: (Breaking change) The brokerPodSpec section of the Configuration changed to a brokerType enum with a brokerPodSpec or brokerJobSpec options.
  • The Controller: Logic to deploy a Pod or Job depending on the brokerType of the Configuration. Logic to handle the brokerType of a Configuration being updated -- will bring down and redeploy workloads and services.
  • The Agent: Logic to not bring down a Configuration if the brokerType changed.
  • Helm Charts: In rbac.yaml, extends Controller's ClusterRole to enable it to create, delete, modify Jobs. Enables providing a JobSpec in Configurations, where completions, parallelism, and backoffLimit are configurable. All other types of JobSpec customizations will need to be done by creation a Configuration yaml and modifying it. Instructions for this can be added to our docs, just as they are for PodSpecs. Removed default image for debugEcho -- docs will need to be updated for this.

Docs
I put in a PR with docs updating Pod deployment in tandem with this one: project-akri/akri-docs#21
It also contains documentation on Jobs.

Tests
This PR includes unit tests. Should we extend out e2e workflows to test Jobs? Maybe we just test a subset of our platform/k8s distro matrix for jobs to prevent too many runs. If so, can that be tabled to a later pr so as to not overload this one?

Special notes for your reviewer:
I went ahead and built and published some containers for folks to try out this implementation. You can run this on this Kubernetes playground if you do not have a cluster:

  1. Clone and checkout this branch. Then package the helm chart locally:

    git clone [email protected]:kate-goldenring/akri.git --branch strategy-jobs akri-jobs
    helm package akri-jobs/deployment/helm
    
  2. Install Akri with a debugEcho Configuration that will deploy a Job to each (foo0 and foo1) discovered device. The Job will echo "Hello World" and then sleep for 5 seconds before terminating.

    helm install akri akri-0.8.0.tgz \
     --set agent.allowDebugEcho=true \
     --set debugEcho.discovery.enabled=true \
     --set debugEcho.discovery.image.tag=latest \
     --set controller.image.repository=ghcr.io/kate-goldenring/job-brokers/controller \
     --set controller.image.tag=v0.8.0-20220112_182137-amd64 \
     --set agent.image.repository=ghcr.io/kate-goldenring/job-brokers/agent \
     --set agent.image.tag=v0.8.0-20220107_184416-amd64 \
    --set debugEcho.configuration.brokerJob.image.repository=busybox \
    --set debugEcho.configuration.brokerJob.command[0]="sh" \
    --set debugEcho.configuration.brokerJob.command[1]="-c" \
    --set debugEcho.configuration.brokerJob.command[2]="echo 'Hello World'" \
    --set debugEcho.configuration.brokerJob.command[3]="sleep 5" \
    --set debugEcho.configuration.enabled=true
    
  3. Say you are feeling more exuberant and want the Job to echo "Hello Amazing World", you can change the brokerSpec. Upgrade the installation to do so and watch Akri delete all the resources (instances, jobs) and recreate them, deploying the new job (Do kubectl logs on the Pods to verify).

    helm upgrade akri akri-0.8.0.tgz \
    --set agent.allowDebugEcho=true \
    --set debugEcho.discovery.enabled=true \
    --set debugEcho.discovery.image.tag=latest \
    --set controller.image.repository=ghcr.io/kate-goldenring/job-brokers/controller \
    --set controller.image.tag=v0.8.0-20220112_182137-amd64 \
    --set agent.image.repository=ghcr.io/kate-goldenring/job-brokers/agent \
    --set agent.image.tag=v0.8.0-20220107_184416-amd64 \
    --set debugEcho.configuration.brokerJob.image.repository=busybox \
    --set debugEcho.configuration.brokerJob.command[0]="sh" \
    --set debugEcho.configuration.brokerJob.command[1]="-c" \
    --set debugEcho.configuration.brokerJob.command[2]="echo 'Hello Amazing World'" \
    --set debugEcho.configuration.brokerJob.command[3]="sleep 5" \
    --set debugEcho.configuration.enabled=true
    
  4. Test out some of the Job settings and modify the Job to run the Job twice per device (completions=2) in parallel (parallelism=2). These Helm settings simply modify equivalently named parts of the Kubernetes JobSpec

    helm upgrade akri akri-0.8.0.tgz \
     --set agent.allowDebugEcho=true \
     --set debugEcho.discovery.enabled=true \
     --set debugEcho.discovery.image.tag=latest \
     --set controller.image.repository=ghcr.io/kate-goldenring/job-brokers/controller \
     --set controller.image.tag=v0.8.0-20220112_182137-amd64 \
     --set agent.image.repository=ghcr.io/kate-goldenring/job-brokers/agent \
     --set agent.image.tag=v0.8.0-20220107_184416-amd64 \
    --set debugEcho.configuration.brokerJob.image.repository=busybox \
    --set debugEcho.configuration.brokerJob.command[0]="sh" \
    --set debugEcho.configuration.brokerJob.command[1]="-c" \
    --set debugEcho.configuration.brokerJob.command[2]="echo 'Hello Amazing World'" \
    --set debugEcho.configuration.brokerJob.command[3]="sleep 5" \
    --set debugEcho.configuration.brokerJob.parallelism=2 \
    --set debugEcho.configuration.brokerJob.completions=2 \
    --set debugEcho.configuration.enabled=true
    
  5. Pods are still deployed the same per usual and they can also be updated.

    helm install akri akri-0.8.0.tgz \
     --set agent.allowDebugEcho=true \
     --set debugEcho.discovery.enabled=true \
     --set debugEcho.discovery.image.tag=latest \
     --set controller.image.repository=ghcr.io/kate-goldenring/job-brokers/controller \
     --set controller.image.tag=v0.8.0-20220112_182137-amd64 \
     --set agent.image.repository=ghcr.io/kate-goldenring/job-brokers/agent \
     --set agent.image.tag=v0.8.0-20220107_184416-amd64 \
    --set debugEcho.configuration.brokerPod.image.repository=nginx \
    --set debugEcho.configuration.enabled=true
    

Note on future extension
Many deployment scenarios can be fulfilled with Jobs just by adjusting the parallelism (number of the same Pod that should run concurrently) and completions (number of times the Pods of the Job need to successfully complete). However, an additional scenario could be enabled by adding support in the controller for adding podAntiAffinity (by instance name).
For example, you could specify that you want 3 Jobs running concurrently (parallelism=3 and completions=3) but that you want at most 1 on each node. Limiting one to run on each node can be done by setting podAntiAffinity in the JobSpec like so:

  spec:
    template:
      metadata:
        labels:
          akri.sh/instance: akri-instance-ha5h
      spec:
        containers:
        - name: pi
          image: perl
          command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
        restartPolicy: Never
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: akri.sh/instance
                  operator: In
                  values:
                  - akri-instance-ha5h
              topologyKey: "kubernetes.io/hostname"

This would need to be done by the controller rather than in the Helm charts as it is likely desirable to set podAntiAffinity per device/instance rather than per Configuration.

If applicable:

  • this PR has an associated PR with documentation in akri-docs
  • this PR contains unit tests
  • added code adheres to standard Rust formatting (cargo fmt)
  • code builds properly (cargo build)
  • code is free of common mistakes (cargo clippy)
  • all Akri tests succeed (cargo test)
  • inline documentation builds (cargo doc)
  • version has been updated appropriately (./version.sh)
  • all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
Signed-off-by: Kate Goldenring <[email protected]>
@kate-goldenring
Copy link
Contributor Author

Would be good to figure out how to break these big prs into several smaller ones. No action needed for this one.

One way would be to make the Configuration CRD change in a first one. And then add the implementation in another.

@kate-goldenring
Copy link
Contributor Author

kate-goldenring commented Jan 6, 2022

When a job spec is changed, do you kill old jobs and schedule new jobs?

edit: I think I found the answer to this is yes. Do we need to rethink this? What is a firmware update of a device is in progress and you kill the job? You might brick the device that is being flashed.

Good point. One thing we could do is never delete a job -- let them stick around as "Completed" (which may be useful anyways for logs purposes) -- and jsut deploy the new job and add an anti-affinity that will not run it so long as another Job is running.

);
}
Some(broker) => {
if !should_recreate_config(&config, &prev_config_state) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, it has been a while since i've looked at this code ... should_recreate_config is confusing to me.

here we are only acting if it returns false. it seems like we used to do nothing if it returned false.

the name is also confusing to me. are we actually recreating a Configuration? or are we just executing code as though the Configuration is deleted and recreated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree the name is unhelpful. It works with the agent's context but not the controller. It may be best to rename it something like broker_changed or more specifically only_broker_changed. This check is done by the agent upon a config change. If true, it does not recreate the configuration rather lets the change stand (no further action). The controller does this check, and if true, it redeploys the brokers to reflect the new config. This is our first step at more explicit configuration modification handling. For this PR, we could leave this out and stick to the old way of deleting and reapplying a configuration, but it means instances will be brought down and redeployed. We need to figure out how to handle this gracefully and this was my first stab at it. it does mean though that if the controller goes down for some reason, a configuration may be applied that does not reflect the system (since the controller was not there to redeploy the brokers)

Copy link
Contributor Author

@kate-goldenring kate-goldenring Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bfjelds @jiria @romoh on this note, i am thinking that at least for this PR, we should remove support for handling broker changes and stick to our old approach of the agent delete and reapplying modified configurations. This would make this PR smaller, as it would remove the configuration watcher from the controller, removing the "Configuration Event" section of the flow diagram here for the time being.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in favor of splitting into smaller prs.

Another thought on how to handle controller restarts, we could cache the config that was used to create a broker as part of broker annotation. So then controller could check if the broker is based on the latest configuration or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. That can also done simply via the Job name. I am using the Configuration generation in the Job name. However, someone could feasibly delete a configuration and reapply a new one while the Controller is down and then the Controller would wrongly think it has the right state.

@kate-goldenring
Copy link
Contributor Author

When a job spec is changed, do you kill old jobs and schedule new jobs?
edit: I think I found the answer to this is yes. Do we need to rethink this? What is a firmware update of a device is in progress and you kill the job? You might brick the device that is being flashed.

Good point. One thing we could do is never delete a job -- let them stick around as "Completed" (which may be useful anyways for logs purposes) -- and jsut deploy the new job and add an anti-affinity that will not run it so long as another Job is running.

@jiria anti-affinity cannot be set in a JobSpec; however, we could set it in the podspecs as illustrated in the "Note on future extension" section of the PR description above. However, that would restrict any two pods of a job from running on a node, which seems okay, but there may be a scenario (doesnt seem likely though) where someone may want two identical pods
of a job (parallelism=2+) to run on the same node at once

controller/src/util/instance_action.rs Outdated Show resolved Hide resolved
shared/src/k8s/job.rs Show resolved Hide resolved
shared/src/k8s/job.rs Show resolved Hide resolved
@kate-goldenring
Copy link
Contributor Author

kate-goldenring commented Jan 7, 2022

@romoh @jiria @bfjelds to reduce the size of this PR, I have removed the Configuration watcher (config_action.rs, configuration_state.rs) from the Controller and ignoring brokerSpec changes in the agent (config_action.rs) -- motivated by this note. This not only was this the section that was the most confusing (Controller and Agent coordinating configuraiton recreation, handling letting jobs finish before deleting them, managing configuration state) but it is also hard to justify. It was added only to enable modifying a brokerSpec without bringing down and recreating instances. Since instances are still directly tied to a Configuration, this seems unneccesary. I have updated the PR description -- the examples point to newly built containers. There should now be much less to review. I updated the proposal and added a section on concerns over gracefully handling Configuration changes as well. Gracefully handling Configuration changes is a discussion that we can leave for another proposal and PR.

@jiria
Copy link
Contributor

jiria commented Jan 11, 2022

When a job spec is changed, do you kill old jobs and schedule new jobs?
edit: I think I found the answer to this is yes. Do we need to rethink this? What is a firmware update of a device is in progress and you kill the job? You might brick the device that is being flashed.

Good point. One thing we could do is never delete a job -- let them stick around as "Completed" (which may be useful anyways for logs purposes) -- and jsut deploy the new job and add an anti-affinity that will not run it so long as another Job is running.

@jiria anti-affinity cannot be set in a JobSpec; however, we could set it in the podspecs as illustrated in the "Note on future extension" section of the PR description above. However, that would restrict any two pods of a job from running on a node, which seems okay, but there may be a scenario (doesnt seem likely though) where someone may want two identical pods of a job (parallelism=2+) to run on the same node at once

Why do we need the anti-affinity in this case?

Re cleanup, imho some cleanup might be useful, but perhaps not killing the running jobs.

Also, from another pov, we might not want to have two firmware updates running at the same time. Is there a good way for the two jobs to signal each other?

Signed-off-by: Kate Goldenring <[email protected]>
@kate-goldenring
Copy link
Contributor Author

When a job spec is changed, do you kill old jobs and schedule new jobs?
edit: I think I found the answer to this is yes. Do we need to rethink this? What is a firmware update of a device is in progress and you kill the job? You might brick the device that is being flashed.

Good point. One thing we could do is never delete a job -- let them stick around as "Completed" (which may be useful anyways for logs purposes) -- and jsut deploy the new job and add an anti-affinity that will not run it so long as another Job is running.

@jiria anti-affinity cannot be set in a JobSpec; however, we could set it in the podspecs as illustrated in the "Note on future extension" section of the PR description above. However, that would restrict any two pods of a job from running on a node, which seems okay, but there may be a scenario (doesnt seem likely though) where someone may want two identical pods of a job (parallelism=2+) to run on the same node at once

Why do we need the anti-affinity in this case?

Re cleanup, imho some cleanup might be useful, but perhaps not killing the running jobs.

Also, from another pov, we might not want to have two firmware updates running at the same time. Is there a good way for the two jobs to signal each other?

@jiria per this comment and the changes associated with it, I removed all gracefully handling of Configuration changes from this PR. As this thread shows, we have plenty to discuss here. For now, they will be deleted immediately upon Configuration changes just like pods are. We dont do anything graceful from them currently either. See this diagram from the proposal PR for a visual. As to your questions, "anti-affinity" was the signalling i was hoping for. aka i was looking for a way for a job to know another job is running and therefore not run. Anti-affinity cannot do that for jobs. I agree that we should think this through for firmware update settings and make a plan.

// Misplaced `resources`
// Valid: .request.object.spec.brokerSpec.brokerPodSpec.containers[*].resources
// Invalid: .request.object.spec.brokerSpec.brokerPodSpec.resources
const INVALID_BROKER_POD_SPEC: &str = r#"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it woud be good to have an invalid test where both job and pod specs are defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that type of request would never even get to the admission controller because it would never be correctly applied to the cluster. I believe a Configuration must successfully be applied (match expected CRD format) and then is passed to an admission controller as an AdmissionReview. So this test is better suited as a unit test in configuration.rs, which affirms that that will not pass the expectation that it is of type Configuration CRD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the wrong thinking around this. Currently, our CRD does not specify oneOf with regards to brokerSpec. It seems that we cannot use the oneOf structural schema specifier since we need to set x-kubernetes-preserve-unknown-fields: true. We should do more investigation here to see if we can limit in our Configuration CRD specifying both types of brokers -- that way the error isnt caught by our application and they error out. For now, here is the test that asserts that serde will catch this and error out the Agent: 20422dc

@kate-goldenring
Copy link
Contributor Author

@romoh @jiria this should be ready for a final look

@kate-goldenring kate-goldenring merged commit efffe84 into project-akri:main Jan 13, 2022
@adithyaj adithyaj mentioned this pull request Nov 7, 2022
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants