Skip to content

Commit 0295de2

Browse files
authored
Merge pull request #4036 from ruiwen-zhao/parallel-beta
KEP-3673: Promote Parallel Image Pull Limit to Beta
2 parents 3821d25 + d1a39f1 commit 0295de2

File tree

3 files changed

+77
-15
lines changed

3 files changed

+77
-15
lines changed
+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3673
22
alpha:
33
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-node/3673-kubelet-parallel-image-pull-limit/README.md

+73-13
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ tags, and then generate with `hack/update-toc.sh`.
109109
- [Scalability](#scalability)
110110
- [Troubleshooting](#troubleshooting)
111111
- [Implementation History](#implementation-history)
112+
- [Alpha](#alpha-1)
112113
- [Drawbacks](#drawbacks)
113114
- [Alternatives](#alternatives)
114115
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
@@ -345,8 +346,14 @@ This can inform certain test coverage improvements that we want to do before
345346
extending the production code to implement this enhancement.
346347
-->
347348

348-
New unit test will be added to image_manager_test.go.
349-
- `k8s.io/kubernetes/pkg/kubelet/images/puller.go`: `01/05/2023` - `100.0`
349+
New unit test is added to image_manager_test.go along with Alpha implementation.
350+
- `k8s.io/kubernetes/pkg/kubelet/images/image_manager.go`: `05/29/2023` - `97.3`
351+
352+
Unit test covers the following cases:
353+
354+
1. Kubelet allows the number of image pull requests to be sent to container runtime, if the number equals to or below `MaxParallelImagePulls`.
355+
2. Kubelet blocks further image pull requests from being sent to container runtime, if `MaxParallelImagePulls` is hit.
356+
3. If a certain number of image pulls get stuck, other image pull requests can still be sent to container runtime.
350357

351358
##### Integration tests
352359

@@ -374,7 +381,9 @@ https://storage.googleapis.com/k8s-triage/index.html
374381
We expect no non-infra related flakes in the last month as a GA graduation criteria.
375382
-->
376383

377-
A new node_e2e test with `serialize-image-pulls==false` will be added to make sure that when maxParallelImagePulls is reached, all further image pulls will be blocked.
384+
A new node_e2e test with `serialize-image-pulls==false` will be added test parallel image pull limits.
385+
1. When maxParallelImagePulls is reached, all further image pulls will be blocked.
386+
2. Verify the behavior when the same image is pulled in parallel, which will happen when image pull policy is `Always`.
378387

379388
- <test>: <link to test coverage>
380389

@@ -385,6 +394,7 @@ A new node_e2e test with `serialize-image-pulls==false` will be added to make s
385394

386395
#### Beta
387396
- Gather feedback from developers and surveys
397+
- Add e2e test to cover the parallel image pull case
388398

389399
#### GA
390400
- Gather feedback from real-world usage from kubernetes vendors.
@@ -585,6 +595,7 @@ https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05
585595
This section must be completed when targeting beta to a release.
586596
-->
587597

598+
588599
###### How can a rollout or rollback fail? Can it impact already running workloads?
589600

590601
<!--
@@ -597,13 +608,22 @@ rollout. Similarly, consider large clusters and how enablement/disablement
597608
will rollout across nodes.
598609
-->
599610

611+
This is an opt-in feature, and it does not change any default behavior. If there is any bug in this feature, image pulls might fail.
612+
No running workloads will be imapcted.
613+
614+
Note that when changing MaxParallelImagePulls, kubelet restart is required. Since the parallel image pull counter
615+
is maintained in memory, restarting kubelet will reset the counter and potentially allow more image pulls than the limit.
616+
600617
###### What specific metrics should inform a rollback?
601618

602619
<!--
603620
What signals should users be paying attention to when the feature is young
604621
that might indicate a serious problem?
605622
-->
606623

624+
In worst case, image pulls might fail. Users can monitor image pull k8s events and `runtime_operations_errors_total` metric to see if there is an increase
625+
of image pull failures.
626+
607627
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
608628

609629
<!--
@@ -612,12 +632,33 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
612632
are missing a bunch of machinery and tooling and can't do that now.
613633
-->
614634

635+
This is an opt-in feature, and it does not change any default behavior. We manually tested enabling and disabling this feature by changing kubelet config and
636+
restarting kubelet.
637+
638+
The manual test steps are as following:
639+
640+
1. Create an one-node 1.27 k8s cluster, which has MaxParallelImagePulls support but the value is nil (no limit) by default.
641+
2. Manually change the MaxParallelImagePulls setting by SSH-ing to the node and adding the following to the kubelet config:
642+
```
643+
serializeImagePulls: false
644+
maxParallelImagePulls: 2
645+
```
646+
3. Deploy three pods, each with a different container image to the one-node cluster. All the three images are 5GB. The relatively-big size makes sure there is enough time between image pulling events, and makes it easier for us to observe the behavior.
647+
4. Observe the k8s events by running `kubectl get events`, and observe that exactly two images finish pulling first, and then the remaining one image finishes.
648+
5. Manually change the MaxParallelImagePulls setting by SSH-ing to the node again and removing the `serializeImagePulls` entry and `maxParallelImagePulls` entry.
649+
6. Deploy two pods, each with a different container image to the cluster. Both of the two images are 5GB, and they are different images from the three images deployed in step 3.
650+
7. Observe the k8s events by running `kubectl get events`, and observe that exactly one image finishes pulling first, and then the remaining one image finishes.
651+
652+
653+
615654
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
616655

617656
<!--
618657
Even if applying deprecation policies, they may still surprise some users.
619658
-->
620659

660+
No.
661+
621662
### Monitoring Requirements
622663

623664
<!--
@@ -634,6 +675,10 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
634675
checking if there are objects with field X set) may be a last resort. Avoid
635676
logs or events for this purpose.
636677
-->
678+
Image pulling is managed by kubelet, and does not affect how workloads run. That said, when parallel image pulling is enabled (SerialImagePulls is set to false), an operator will observe that
679+
a pod could start while kubelet is still pulling images for another pod.
680+
681+
To observe the effect of different `MaxParallelImagePulls` settings, please refer to the next section.
637682

638683
###### How can someone using this feature know that it is working for their instance?
639684

@@ -646,13 +691,11 @@ and operation of this feature.
646691
Recall that end users cannot usually observe component logs or access metrics.
647692
-->
648693

649-
- [ ] Events
650-
- Event Reason:
651-
- [ ] API .status
652-
- Condition name:
653-
- Other field:
654-
- [ ] Other (treat as last resort)
655-
- Details:
694+
- [X] Events
695+
- Event Reason: Pulling
696+
697+
Assuming `MaxParallelImagePulls` is set to _X_, an operator can look at the container runtime log, and see _X_ PullImageRequests sent to container runtime at the same time.
698+
If the image pulls take roughly the same amount of time, an operator can see k8s event and see _X_ images finish pulling at roughly the same time.
656699

657700
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
658701

@@ -677,15 +720,19 @@ question.
677720
Pick one more of these and delete the rest.
678721
-->
679722

680-
- [ ] Metrics
681-
- Metric name:
682-
- [Optional] Aggregation method:
723+
We can rely on the existing metrics on image pull to determine if this feature has any impact on image pulling.
724+
725+
- [X] Metrics
726+
- Metric name: kubelet_runtime_operations_errors_total
727+
- [Optional] Aggregation method: operation_type=pull_image
683728
- Components exposing the metric:
684729
- [ ] Other (treat as last resort)
685730
- Details:
686731

687732
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
688733

734+
No.
735+
689736
<!--
690737
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
691738
implementation difficulties, etc.).
@@ -699,6 +746,8 @@ This section must be completed when targeting beta to a release.
699746

700747
###### Does this feature depend on any specific services running in the cluster?
701748

749+
No.
750+
702751
<!--
703752
Think about both cluster-level services (e.g. metrics-server) as well
704753
as node-level agents (e.g. specific version of CRI). Focus on external or
@@ -817,8 +866,12 @@ details). For now, we leave it here.
817866

818867
###### How does this feature react if the API server and/or etcd is unavailable?
819868

869+
N/A. This feature does not rely on any component other than kubelet.
870+
820871
###### What are other known failure modes?
821872

873+
No known failure modes.
874+
822875
<!--
823876
For each of them, fill in the following information by copying the below template:
824877
- [Failure mode brief description]
@@ -834,6 +887,9 @@ For each of them, fill in the following information by copying the below templat
834887

835888
###### What steps should be taken if SLOs are not being met to determine the problem?
836889

890+
If this feature impact image pulling. The user should unset MaxParallelImagePulls (i.e. setting MaxParallelImagePulls to nil),
891+
or set SerialImagePulls to true to enable serial image pulling.
892+
837893
## Implementation History
838894

839895
<!--
@@ -847,6 +903,10 @@ Major milestones might include:
847903
- when the KEP was retired or superseded
848904
-->
849905

906+
### Alpha
907+
908+
Alpha feature was implemented in 1.27.
909+
850910
## Drawbacks
851911

852912
<!--

keps/sig-node/3673-kubelet-parallel-image-pull-limit/kep.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ approvers:
1212
- "@mrunalp"
1313

1414
# The target maturity stage in the current dev cycle for this KEP.
15-
stage: alpha
15+
stage: beta
1616

1717
# The most recent milestone for which work toward delivery of this KEP has been
1818
# done. This can be the current (upcoming) milestone, if it is being actively
1919
# worked on.
20-
latest-milestone: "v1.27"
20+
latest-milestone: "v1.28"
2121

2222
# The milestone at which this feature was, or is targeted to be, at each stage.
2323
milestone:

0 commit comments

Comments
 (0)