Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: enable horizontal autoscaling of Liquid Legions v2 Mill #1697

Merged
merged 1 commit into from
Jul 26, 2024

Conversation

SanjayVas
Copy link
Member

@SanjayVas SanjayVas commented Jul 19, 2024

This converts the LLv2 Mill daemon to a Job which processes a single work item. Jobs are scheduled via a new MillJobScheduler daemon.

@wfa-reviewable
Copy link

This change is Reviewable

@SanjayVas
Copy link
Member Author

This is an alternate approach to #1686

@SanjayVas SanjayVas changed the base branch from sanjayvas-mill-job-base to main July 19, 2024 00:08
@SanjayVas SanjayVas force-pushed the sanjayvas-mill-job-scheduler branch from f32e504 to 6e0ae9c Compare July 22, 2024 22:18
@SanjayVas SanjayVas force-pushed the sanjayvas-mill-job-scheduler branch 5 times, most recently from ba048d1 to 46efae8 Compare July 23, 2024 01:03
@SanjayVas SanjayVas force-pushed the sanjayvas-mill-job-scheduler branch from 46efae8 to 6bf7845 Compare July 23, 2024 18:22
@SanjayVas SanjayVas force-pushed the sanjayvas-mill-job-scheduler branch from 6bf7845 to 17941b7 Compare July 23, 2024 18:35
Copy link
Member

@kungfucraig kungfucraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 58 of 87 files at r1, 5 of 8 files at r2, 8 of 14 files at r4, 2 of 5 files at r5.
Reviewable status: 70 of 87 files reviewed, 8 unresolved discussions (waiting on @SanjayVas)


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 69 at r6 (raw file):

/** Scheduler for Mill Kubernetes Jobs. */
class MillJobScheduler(

Any reason the scheduler has to be LLv2 specific?


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 72 at r6 (raw file):

  private val duchyId: String,
  private val computationsStub: ComputationsGrpcKt.ComputationsCoroutineStub,
  private val deploymentName: String,

nit: It's not clear what this is for without reading the code.

I wrote the above and then happened to skip to the bottom of the file and saw the flag definitions. A note in the ktdoc indicating these fields are documented as command line flags below would helop.


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 115 at r6 (raw file):

    val computationType: ComputationType = computationTypeConfig.computationType

    val activeJobCount: Int =

This would avoid the reference to the issue.

Code snippet:

ownedJobs.filter({it.status.active ?:0) > 0}).count() 

src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 122 at r6 (raw file):

      }
    if (activeJobCount >= millType.maximumConcurrency) {
      logger.fine { "Not scheduling: Maximum concurrency limit reached for $millType" }

It'd be helpful to have the activeJobCount, and maybe the limit on the log line.


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 153 at r6 (raw file):

    logger.info { "Scheduled Job $jobName for Computation $claimedComputationId" }
    if (activeJobCount + 1 >= millType.maximumConcurrency) {
      logger.info { "Mill type $millType is now at maximum concurrency" }

nit: add the max concurrency to the log line


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 162 at r6 (raw file):

  }

  private suspend fun Collection<V1Job>.deleteOverLimit(limit: Int, predicate: (V1Job) -> Boolean) {

When do we expect this function to have work to do?


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/shareshuffle/HonestMajorityShareShuffleMillFlags.kt line 17 at r6 (raw file):

package org.wfanet.measurement.duchy.deploy.common.daemon.mill.shareshuffle

import java.io.File

Why are these changes required in this PR?


src/main/terraform/gcloud/examples/duchy/main.tf line 74 at r6 (raw file):

  service_account = module.common.cluster_service_account
  machine_type    = "c2-standard-4"
  max_node_count  = 20

Is this the same value we expect to be passed to the scheduler? If so, is it possible for the scheduler to read it from the k8s API?

Copy link
Member Author

@SanjayVas SanjayVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 70 of 87 files reviewed, 6 unresolved discussions (waiting on @kungfucraig)


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 69 at r6 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

Any reason the scheduler has to be LLv2 specific?

It isn't. It's intended to support multiple mill types. Indeed, that's the reason for the MillType enum. I'm just adding HMSS in a follow up PR.


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 72 at r6 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

nit: It's not clear what this is for without reading the code.

I wrote the above and then happened to skip to the bottom of the file and saw the flag definitions. A note in the ktdoc indicating these fields are documented as command line flags below would helop.

Done. Added reference to Flags from KDoc.


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 115 at r6 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

This would avoid the reference to the issue.

Done. Slightly better syntax though by passing the predicate to count rather than using filter.


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 122 at r6 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

It'd be helpful to have the activeJobCount, and maybe the limit on the log line.

Done.

Note that some of these are at the FINE level.


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 162 at r6 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

When do we expect this function to have work to do?

Added some documentation.

The basic issue is that we don't want to leave a bunch of completed Jobs/Pods around. Rather than using the completed TTL, we instead have a configurable history limit. This is similar to the approach used by CronJob.


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/shareshuffle/HonestMajorityShareShuffleMillFlags.kt line 17 at r6 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

Why are these changes required in this PR?

I assume you mean the change to extract the common flags into MillFlags? It's been a longstanding TODO, and in this case it really helped when trying to figure out which flags needed to be moved/copied to the scheduler.

I would have done it in a pre-factoring PR, but had already moved some other things around such as the packages. If you feel strongly, I can still split it out as pre-factoring.


src/main/terraform/gcloud/examples/duchy/main.tf line 74 at r6 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

Is this the same value we expect to be passed to the scheduler? If so, is it possible for the scheduler to read it from the k8s API?

As discussed offline, this is not necessarily the same. It'll depend on the chosen machine types vs. the configured resource requirements for each job. It'll also depend on what else you're running in the node pool (or for EKS, node group), for example if you want to use the same pool for both HMSS and LLv2 Mill Jobs.

Copy link
Member

@kungfucraig kungfucraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 87 files at r1, 1 of 8 files at r2, 1 of 5 files at r5, 1 of 2 files at r7, all commit messages.
Reviewable status: 83 of 87 files reviewed, 1 unresolved discussion (waiting on @SanjayVas)


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 115 at r6 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

Done. Slightly better syntax though by passing the predicate to count rather than using filter.

Indeed. Didn't realize it took a predicate.


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 75 at r7 (raw file):

 * See [MillJobScheduler.Flags] for additional parameter documentation.
 */
class MillJobScheduler(

How is this being tested?


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/shareshuffle/HonestMajorityShareShuffleMillFlags.kt line 17 at r6 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

I assume you mean the change to extract the common flags into MillFlags? It's been a longstanding TODO, and in this case it really helped when trying to figure out which flags needed to be moved/copied to the scheduler.

I would have done it in a pre-factoring PR, but had already moved some other things around such as the packages. If you feel strongly, I can still split it out as pre-factoring.

It's fine. I can live with it. :)

Copy link
Member

@kungfucraig kungfucraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 87 files at r1, 2 of 3 files at r6, 1 of 2 files at r7.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @SanjayVas)


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 75 at r7 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

How is this being tested?

We discussed this offline. Manual testing is probably okay. We also know that KinD tests the happy path. Perhaps there's some way to mock a few things and have some basic tests to show that this class manages concurrency and some of the other clean up tasks correctly?

Copy link
Member Author

@SanjayVas SanjayVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 83 of 92 files reviewed, all discussions resolved (waiting on @kungfucraig and @stevenwarejones)


src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt line 75 at r7 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

We discussed this offline. Manual testing is probably okay. We also know that KinD tests the happy path. Perhaps there's some way to mock a few things and have some basic tests to show that this class manages concurrency and some of the other clean up tasks correctly?

Indeed, the Kubernetes tests exercise this but don't verify specific behavior (e.g. that the max concurrency limit is respected). I managed to add a unit test by mocking KubernetesClient.

@SanjayVas SanjayVas force-pushed the sanjayvas-mill-job-scheduler branch 2 times, most recently from fd8c443 to 89557f5 Compare July 26, 2024 00:51
Copy link
Member

@kungfucraig kungfucraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 10 files at r8, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @stevenwarejones)

@SanjayVas SanjayVas force-pushed the sanjayvas-mill-job-scheduler branch from 89557f5 to 36493c8 Compare July 26, 2024 21:07
Copy link
Collaborator

@stevenwarejones stevenwarejones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 66 of 87 files at r1, 4 of 8 files at r2, 7 of 14 files at r4, 3 of 5 files at r5, 2 of 3 files at r6, 9 of 10 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @SanjayVas)

@SanjayVas SanjayVas force-pushed the sanjayvas-mill-job-scheduler branch 2 times, most recently from c524b86 to be9f7fe Compare July 26, 2024 22:08
Copy link
Member Author

@SanjayVas SanjayVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 13 files at r10.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @kungfucraig and @stevenwarejones)

@SanjayVas SanjayVas enabled auto-merge (squash) July 26, 2024 22:14
@SanjayVas SanjayVas merged commit 0a1d25f into main Jul 26, 2024
4 checks passed
@SanjayVas SanjayVas deleted the sanjayvas-mill-job-scheduler branch July 26, 2024 22:35
ple13 pushed a commit that referenced this pull request Aug 16, 2024
This converts the LLv2 Mill daemon to a Job which processes a single work item. Jobs are scheduled via a new MillJobScheduler daemon.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants