-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: enable horizontal autoscaling of Liquid Legions v2 Mill #1697
Conversation
This is an alternate approach to #1686 |
f32e504
to
6e0ae9c
Compare
ba048d1
to
46efae8
Compare
46efae8
to
6bf7845
Compare
6bf7845
to
17941b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 58 of 87 files at r1, 5 of 8 files at r2, 8 of 14 files at r4, 2 of 5 files at r5.
Reviewable status: 70 of 87 files reviewed, 8 unresolved discussions (waiting on @SanjayVas)
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 69 at r6 (raw file):
/** Scheduler for Mill Kubernetes Jobs. */ class MillJobScheduler(
Any reason the scheduler has to be LLv2 specific?
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 72 at r6 (raw file):
private val duchyId: String, private val computationsStub: ComputationsGrpcKt.ComputationsCoroutineStub, private val deploymentName: String,
nit: It's not clear what this is for without reading the code.
I wrote the above and then happened to skip to the bottom of the file and saw the flag definitions. A note in the ktdoc indicating these fields are documented as command line flags below would helop.
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 115 at r6 (raw file):
val computationType: ComputationType = computationTypeConfig.computationType val activeJobCount: Int =
This would avoid the reference to the issue.
Code snippet:
ownedJobs.filter({it.status.active ?:0) > 0}).count()
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 122 at r6 (raw file):
} if (activeJobCount >= millType.maximumConcurrency) { logger.fine { "Not scheduling: Maximum concurrency limit reached for $millType" }
It'd be helpful to have the activeJobCount, and maybe the limit on the log line.
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 153 at r6 (raw file):
logger.info { "Scheduled Job $jobName for Computation $claimedComputationId" } if (activeJobCount + 1 >= millType.maximumConcurrency) { logger.info { "Mill type $millType is now at maximum concurrency" }
nit: add the max concurrency to the log line
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 162 at r6 (raw file):
} private suspend fun Collection<V1Job>.deleteOverLimit(limit: Int, predicate: (V1Job) -> Boolean) {
When do we expect this function to have work to do?
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/shareshuffle/HonestMajorityShareShuffleMillFlags.kt
line 17 at r6 (raw file):
package org.wfanet.measurement.duchy.deploy.common.daemon.mill.shareshuffle import java.io.File
Why are these changes required in this PR?
src/main/terraform/gcloud/examples/duchy/main.tf
line 74 at r6 (raw file):
service_account = module.common.cluster_service_account machine_type = "c2-standard-4" max_node_count = 20
Is this the same value we expect to be passed to the scheduler? If so, is it possible for the scheduler to read it from the k8s API?
17941b7
to
012442f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 70 of 87 files reviewed, 6 unresolved discussions (waiting on @kungfucraig)
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 69 at r6 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
Any reason the scheduler has to be LLv2 specific?
It isn't. It's intended to support multiple mill types. Indeed, that's the reason for the MillType enum. I'm just adding HMSS in a follow up PR.
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 72 at r6 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
nit: It's not clear what this is for without reading the code.
I wrote the above and then happened to skip to the bottom of the file and saw the flag definitions. A note in the ktdoc indicating these fields are documented as command line flags below would helop.
Done. Added reference to Flags from KDoc.
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 115 at r6 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
This would avoid the reference to the issue.
Done. Slightly better syntax though by passing the predicate to count
rather than using filter
.
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 122 at r6 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
It'd be helpful to have the activeJobCount, and maybe the limit on the log line.
Done.
Note that some of these are at the FINE level.
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 162 at r6 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
When do we expect this function to have work to do?
Added some documentation.
The basic issue is that we don't want to leave a bunch of completed Jobs/Pods around. Rather than using the completed TTL, we instead have a configurable history limit. This is similar to the approach used by CronJob.
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/shareshuffle/HonestMajorityShareShuffleMillFlags.kt
line 17 at r6 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
Why are these changes required in this PR?
I assume you mean the change to extract the common flags into MillFlags? It's been a longstanding TODO, and in this case it really helped when trying to figure out which flags needed to be moved/copied to the scheduler.
I would have done it in a pre-factoring PR, but had already moved some other things around such as the packages. If you feel strongly, I can still split it out as pre-factoring.
src/main/terraform/gcloud/examples/duchy/main.tf
line 74 at r6 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
Is this the same value we expect to be passed to the scheduler? If so, is it possible for the scheduler to read it from the k8s API?
As discussed offline, this is not necessarily the same. It'll depend on the chosen machine types vs. the configured resource requirements for each job. It'll also depend on what else you're running in the node pool (or for EKS, node group), for example if you want to use the same pool for both HMSS and LLv2 Mill Jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 11 of 87 files at r1, 1 of 8 files at r2, 1 of 5 files at r5, 1 of 2 files at r7, all commit messages.
Reviewable status: 83 of 87 files reviewed, 1 unresolved discussion (waiting on @SanjayVas)
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 115 at r6 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Done. Slightly better syntax though by passing the predicate to
count
rather than usingfilter
.
Indeed. Didn't realize it took a predicate.
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 75 at r7 (raw file):
* See [MillJobScheduler.Flags] for additional parameter documentation. */ class MillJobScheduler(
How is this being tested?
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/shareshuffle/HonestMajorityShareShuffleMillFlags.kt
line 17 at r6 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
I assume you mean the change to extract the common flags into MillFlags? It's been a longstanding TODO, and in this case it really helped when trying to figure out which flags needed to be moved/copied to the scheduler.
I would have done it in a pre-factoring PR, but had already moved some other things around such as the packages. If you feel strongly, I can still split it out as pre-factoring.
It's fine. I can live with it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 87 files at r1, 2 of 3 files at r6, 1 of 2 files at r7.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @SanjayVas)
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 75 at r7 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
How is this being tested?
We discussed this offline. Manual testing is probably okay. We also know that KinD tests the happy path. Perhaps there's some way to mock a few things and have some basic tests to show that this class manages concurrency and some of the other clean up tasks correctly?
012442f
to
6239034
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 83 of 92 files reviewed, all discussions resolved (waiting on @kungfucraig and @stevenwarejones)
src/main/kotlin/org/wfanet/measurement/duchy/deploy/common/daemon/mill/MillJobScheduler.kt
line 75 at r7 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
We discussed this offline. Manual testing is probably okay. We also know that KinD tests the happy path. Perhaps there's some way to mock a few things and have some basic tests to show that this class manages concurrency and some of the other clean up tasks correctly?
Indeed, the Kubernetes tests exercise this but don't verify specific behavior (e.g. that the max concurrency limit is respected). I managed to add a unit test by mocking KubernetesClient.
fd8c443
to
89557f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 10 files at r8, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @stevenwarejones)
89557f5
to
36493c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 66 of 87 files at r1, 4 of 8 files at r2, 7 of 14 files at r4, 3 of 5 files at r5, 2 of 3 files at r6, 9 of 10 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @SanjayVas)
c524b86
to
be9f7fe
Compare
be9f7fe
to
c1425b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 13 files at r10.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @kungfucraig and @stevenwarejones)
This converts the LLv2 Mill daemon to a Job which processes a single work item. Jobs are scheduled via a new MillJobScheduler daemon.
This converts the LLv2 Mill daemon to a Job which processes a single work item. Jobs are scheduled via a new MillJobScheduler daemon.