feat: use named array instead of array in normalizing score #80901

draveness · 2019-08-02T11:32:26Z

/kind feature
/sig scheduling
/priority important-soon
/assign @bsalamat @ahg-g

What this PR does / why we need it:

I was refactoring the interpod-affinity priority with scheduling framework and found out it is quite hard to connect the node which passing to the Score plugin to the NodeScoreList passing to the NormalizeScore with index.

// nodename here
func (ipa *InterPodAffinityPlugin) Score(pc *framework.PluginContext, pod *v1.Pod, nodeName string) (int, *framework.Status) {

// index to score here
func (ipa *InterPodAffinityPlugin) NormalizeScore(pc *framework.PluginContext, pod *v1.Pod, scores framework.NodeScoreList) *framework.Status {

The priority functions are using a HostPriority which contains node name and score, it would be better to provide the node name information in normalizing score phase.

We have two options for optimising this:

Use a nodeName -> score map, which this PR try to introduce
Use an array of HostPriority liked struct, ex:

type NodeScore struct {
  Name string
  Score int
}

I recalled that in the API convention documents, kubernetes prefer the named array to map. I could go with the second option if it is the convention we must follow here, but I prefer the first for now. What do you think of this? @bsalamat @ahg-g

Which issue(s) this PR fixes:

ref: #80667

Does this PR introduce a user-facing change?:

Use a named array instead of a score array in normalizing-score phase.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

draveness · 2019-08-02T12:25:39Z

/retest

alculquicondor · 2019-08-02T15:21:06Z

/hold
Changes to the framework interfaces are considered user facing. Can we open an issue for this to start a discussion?

alculquicondor · 2019-08-02T15:25:09Z

/cc @liu-cong
who has been looking into normalize score interface as well

k8s-ci-robot · 2019-08-02T15:25:10Z

@alculquicondor: GitHub didn't allow me to request PR reviews from the following users: liu-cong.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @liu-cong
who has been looking into normalize score interface as well

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

draveness · 2019-08-02T17:51:23Z

Changes to the framework interfaces are considered user facing. Can we open an issue for this to start a discussion?

I sent this PR to start a discussion about normalizing score phase as we met problems when transforming the original priority functions into the scheduling framework. Feel free to leave comments and opinions in this PR.

liu-cong · 2019-08-02T20:24:17Z

/assign

liu-cong · 2019-08-02T20:47:45Z

This does involve an interface change, please document it in Does this PR introduce a user-facing change?.

This seems to be a reasonable change. I cannot think of any objections to it besides more memory usage. I will wait for opinions from others.

alculquicondor · 2019-08-02T20:54:11Z

It's not clear to me why is the change needed. Normalization should be totally independent from the node information.

draveness · 2019-08-03T03:38:10Z

It's not clear to me why is the change needed. Normalization should be totally independent from the node information.

Ideally, normalization should be totally independent of the node information, many of the priority functions now indeed do not use node name in HostPriority, but we have some real cases that need other nodeName and information during the normalizing phase.

Currently, the reduce phase in priority function is similar to the normalizing phase in the scheduling framework, these reduce function accepts a list of nodeName and score as an argument, but the normalize score function only accepts a list of score which lacks node name information for now.

If you checked out the priority functions with reducing phase, some of them use nodeName to get more information, ex: SelectorSpread use nodeName to get zoneID.

And we need to provide node names in the normalizing phase because of the following reasons:

Keep consistent with the original reduce priority functions (HostPriority), otherwise, we can't migrate some of them to the scheduling framework, ex: SelectorSpread, InterPodAffinity and EvenPodSpreading
Good for debugging purpose, ex:

klog.Infof("%v -> %v: EvenPodsSpreadPriority, Score: (%d)", pod.Name, nodeName, *score)
klog.Infof("%v -> %v: InterPodAffinityPriority, Score: (%d)", pod.Name, node.Name, int(fScore))
"%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, result[i].Host, int(fScore),

InterPodAffinity and EvenPodsSpread priorities are using deprecated pattern and quite different from the other priority function, in this WIP PR #80898, I tried to refactor it with the scheduling framework and met this problem. Even if we could find another approach to refactor them without node name, SelectorSpread also need node name to get zone id.

Based on the reasons above, I'd like to add node name to normalizing score phase.

cc/ @bsalamat @ahg-g

ahg-g

Can you elaborate why you are recommending a map over a "named array"?

ahg-g · 2019-08-06T10:44:04Z

pkg/scheduler/framework/v1alpha1/framework.go

 			score, status := pl.Score(pc, pod, nodes[index].Name)
 			if !status.IsSuccess() {
 				errCh.SendErrorWithCancel(fmt.Errorf(status.Message()), cancel)
 				return
 			}
-			pluginToNodeScoreMap[pl.Name()][index] = score
+			pluginToNodeScoreMap[pl.Name()][nodeName] = score


I suspect you need to synchronize access when writing to the map (a pre-allocated slice doesn't) since this practically adds a new key-value to it.

draveness · 2019-08-06T10:59:12Z

Can you elaborate why you are recommending a map over a "named array"?

I don't have a preference. What do you think of these two options?

ahg-g · 2019-08-06T11:51:51Z

Can you elaborate why you are recommending a map over a "named array"?

I don't have a preference. What do you think of these two options?

I think it is better to follow the sig-arch recommendation.

draveness · 2019-08-06T12:12:11Z

I think it is better to follow the sig-arch recommendation.

I'll make the change.

draveness · 2019-08-07T16:24:50Z

/retest

…

On Aug 8, 2019, 12:19 AM +0800, Kubernetes Prow Robot ***@***.***>, wrote: @draveness: The following test failed, say /retest to rerun them all: Test name Commit Details Rerun command pull-kubernetes-integration 58dda26 link /test pull-kubernetes-integration Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

alculquicondor · 2019-08-07T17:54:55Z

/lgtm

Huang-Wei

@draveness I may miss something, but my impression is that this PR still doesn't resolve the core issue of the deprecated priorities - which is: the nodes slice here are the ones that passed Predicates. So the Score() function still doesn't have the handle of all nodes info?

	// Run the Score plugins.
	scoresMap, scoreStatus := framework.RunScorePlugins(pluginContext, pod, nodes)

(The nodes come from one of the parameters of PrioritizeNodes())

Huang-Wei · 2019-08-07T22:13:25Z

pkg/scheduler/core/generic_scheduler.go

-	for _, scoreList := range scoresMap {
-		for i := range nodes {
-			result[i].Score += scoreList[i]
+	for _, nodeScoreList := range scoresMap {


Can this be merged into previous for i := range nodes loop, after L806?

Good catch, Done

draveness · 2019-08-08T00:31:55Z

@draveness I may miss something, but my impression is that this PR still doesn't resolve the core issue of the deprecated priorities - which is: the nodes slice here are the ones that passed Predicates. So the Score() function still doesn't have the handle of all nodes info?

We could get nodes from the frameworkHandle.NodeInfoSnapshot(), I could give an implementation of refactoring interpod-affinity for review after this gets merged. If it does not work well and something goes wrong, I'll propose to add nodes to Score function. And even if it doesn't work well, use a named array instead could make the API easier to use. Does this make sense to you? @Huang-Wei

/lgtm

All being said, I am not opposed to having a named array (map) to keep score of nodes. I think it will be easier to use.

/hold cancel

I'll cancel the hold

Huang-Wei · 2019-08-08T00:42:11Z

We could get nodes from the frameworkHandle.NodeInfoSnapshot()

Looks good.

I could give an implementation of refactoring interpod-affinity for review after this gets merged.

Thanks. Let's see if it works.

Huang-Wei · 2019-08-08T00:44:35Z

/lgtm

Will leave /approval to @bsalamat or @ahg-g .

draveness · 2019-08-08T02:55:44Z

/retest

draveness · 2019-08-08T16:49:53Z

Friendly ping @bsalamat and @ahg-g for approval. Thanks!

bsalamat · 2019-08-09T01:04:13Z

pkg/scheduler/framework/v1alpha1/interface.go

@@ -31,10 +31,16 @@ import (
 type Code int

 // NodeScoreList declares a list of nodes and their scores.
-type NodeScoreList []int
+type NodeScoreList []NodeScore


Wouldn't it be better to use a map from node names to node score (map[string]int) instead of NodeScoreList? With the current data structure, finding the score of a node requires iterating over a NodeScoreList.

Wouldn't it be better to use a map from node names to node score (map[string]int) instead of NodeScoreList? With the current data structure, finding the score of a node requires iterating over a NodeScoreList.

I don't have a preference here

I recalled that in the API convention documents, kubernetes prefer the named array to map. I could go with the second option if it is the convention we must follow here, but I prefer the first for now. What do you think of this?

We have to think carefully about this question: "Do we need to find score of a particular node in our code?" If the answer is no, we should definitely go with the current List instead of a map. Lists are faster to access compared to maps if we don't need to search for something.

We have to think carefully about this question: "Do we need to find score of a particular node in our code?"

+1

If the answer is no, we should definitely go with the current List instead of a map. Lists are faster to access compared to maps if we don't need to search for something.

For the implementation of InterPodAffinity and EvenPodSpread, we don't need to find the score of a particular node in NormalizeScore, and we only use node name in the named array (or map) to find score from the internal data structure.

Unless we have a strong reason to use map (which I doubt we will), my recommendation is to use a list.

ahg-g · 2019-08-09T19:21:38Z

pkg/scheduler/core/generic_scheduler.go

-		for i := range nodes {
-			result[i].Score += scoreList[i]
+		for j := range scoresMap {
+			result[i].Score += scoresMap[j][i].Score


I suspect that the original code will be faster because it has better cache locality when reading scores from the scoresMap.

We need more evidence on this, before that I prefer to keep it this way.

In theory sequential (or strided) reads are the fastest form of memory access patterns because of prefetching and cache locality.

The other thing in this nested loop that could make it slower is that it makes n*m map lookups (n=number of nodes, m = number of priority functions/score plugins), while if we make the node loop inside, the number of scoresMap lookups will be zero. So it might be even faster (and easier to read) to flip the whole thing like this:

for i := range nodes { result = append(result, schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0}) } for j, priorityList := range priorityConfigs { for i := range nodes { result[i].Score += priorityList[i].Score * priorityConfigs[j].Weight } } for _, scoreList := range scoresMap { for i := range nodes { result[i].Score += scoreList[i].Score } }

We need more evidence on this, before that I prefer to keep it this way.
The theory says this change makes things slower, so if you don't have evidence to the contrary, then we shouldn't make the change, especially that it is not related to the purpose of the PR "using named array in normalize scores".

In theory sequential (or strided) reads are the fastest form of memory access patterns because of prefetching and cache locality.

I know the theory of the cache locality and here are the benchmark numbers:

package main import "testing" func locality(result []int, scoresMap map[int][]int, anotherScoresMap map[int][]int) { for _, scores := range scoresMap { for i := range scores { result[i] = result[i] + scores[i] } } for _, scores := range anotherScoresMap { for i := range scores { result[i] = result[i] + scores[i] } } } func nonlocality(result []int, scoresMap map[int][]int, anotherScoresMap map[int][]int) { for i := range result { for j := range scoresMap { result[i] = result[i] + scoresMap[j][i] } for j := range anotherScoresMap { result[i] = result[i] + anotherScoresMap[j][i] } } } func BenchmarkLocality(b *testing.B) { result := make([]int, 5000) scores := make(map[int][]int, 10) anotherScores := make(map[int][]int, 10) for i := 0; i < 10; i++ { scores[i] = make([]int, 5000) anotherScores[i] = make([]int, 5000) for j := 0; j < 5000; j++ { scores[i][j] = j anotherScores[i][j] = j } } b.ResetTimer() for i := 0; i < b.N; i++ { locality(result, scores, anotherScores) } } func BenchmarkNonLocality(b *testing.B) { result := make([]int, 5000) scores := make(map[int][]int, 10) anotherScores := make(map[int][]int, 10) for i := 0; i < 10; i++ { scores[i] = make([]int, 5000) anotherScores[i] = make([]int, 5000) for j := 0; j < 5000; j++ { scores[i][j] = j anotherScores[i][j] = j } } b.ResetTimer() for i := 0; i < b.N; i++ { locality(result, scores, anotherScores) } } $ go test -bench=. goos: darwin goarch: amd64 pkg: github.com/golang BenchmarkLocality-8 30000 44607 ns/op BenchmarkNonLocality-8 30000 44245 ns/op PASS

According to the benchmarks, I prefer to keep it this way.

Strange, it is either golang's compiler is so good, or new architectures have such powerful memory controllers. I would like to do the same test in C++ to see if I get different results.

Do you have a theory why they are the same?

The thing is that the number of priorities (j values) is very small, and so is the number of nodes. We are only talking about ~200kb, which can fit in the caches pretty easily.

Minor note in the posted benchmark: it should be map[string][]int. but again, we are only doing lookups among 10 values.

On the other hand, I don't understand the motivation to do the change. I'm not going to hold the review.

I re ran the test and set the sizes to two orders of magnitude bigger, no difference.

The issue about cache is not only about size, but also what you move to the cache (how much of the cache line being moved from memory is actually used), so it is about effective memory bandwidth.

ahg-g · 2019-08-09T19:22:50Z

pkg/scheduler/framework/v1alpha1/interface.go

@@ -31,10 +31,16 @@ import (
 type Code int

 // NodeScoreList declares a list of nodes and their scores.
-type NodeScoreList []int
+type NodeScoreList []NodeScore


Unless we have a strong reason to use map (which I doubt we will), my recommendation is to use a list.

draveness · 2019-08-12T09:02:02Z

Hi @bsalamat, do you have any thoughts on this?

ahg-g · 2019-08-13T07:30:18Z

/lgtm

draveness · 2019-08-13T14:25:05Z

/lgtm

Hi @ahg-g, can I get approval on this or I have to bother Bobby?

ahg-g · 2019-08-14T04:45:49Z

/approve

k8s-ci-robot · 2019-08-14T04:46:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, draveness

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

draveness · 2019-08-14T06:27:26Z

/retest

…

On Aug 14, 2019, 2:18 PM +0800, Kubernetes Prow Robot ***@***.***>, wrote: @draveness: The following test failed, say /retest to rerun them all: Test name Commit Details Rerun command pull-kubernetes-integration aa5f9fd link /test pull-kubernetes-integration Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

draveness · 2019-08-14T07:19:42Z

/retest

k8s-ci-robot assigned ahg-g Aug 2, 2019

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Aug 2, 2019

k8s-ci-robot assigned bsalamat Aug 2, 2019

draveness mentioned this pull request Aug 2, 2019

feat: refactor interpod affinity with the scheduling framework #80898

Closed

k8s-ci-robot requested review from ahg-g and resouer August 2, 2019 11:33

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 2, 2019

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2019

k8s-ci-robot assigned liu-cong Aug 2, 2019

draveness mentioned this pull request Aug 6, 2019

feat: remove normalize score section in the scheduling framwork kubernetes/enhancements#1178

Merged

ahg-g reviewed Aug 6, 2019

View reviewed changes

draveness force-pushed the feature/use-map-instead-of-array branch from b152920 to 68dc5da Compare August 6, 2019 12:33

k8s-ci-robot assigned alculquicondor Aug 7, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2019

Huang-Wei reviewed Aug 7, 2019

View reviewed changes

feat: use named array instead of score array in normalizing score

aa5f9fd

draveness force-pushed the feature/use-map-instead-of-array branch from 58dda26 to aa5f9fd Compare August 8, 2019 00:20

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2019

k8s-ci-robot assigned Huang-Wei Aug 8, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019

draveness mentioned this pull request Aug 8, 2019

Refactor InterPodAffinity with the Scheduling Framework #80667

Closed

4 tasks

bsalamat reviewed Aug 9, 2019

View reviewed changes

ahg-g reviewed Aug 9, 2019

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2019

k8s-ci-robot merged commit 1906650 into kubernetes:master Aug 14, 2019

k8s-ci-robot added this to the v1.16 milestone Aug 14, 2019

draveness deleted the feature/use-map-instead-of-array branch August 14, 2019 10:42

feat: use named array instead of array in normalizing score #80901

feat: use named array instead of array in normalizing score #80901

Conversation

draveness commented Aug 2, 2019 • edited Loading

draveness commented Aug 2, 2019

alculquicondor commented Aug 2, 2019

alculquicondor commented Aug 2, 2019

k8s-ci-robot commented Aug 2, 2019

draveness commented Aug 2, 2019 • edited Loading

liu-cong commented Aug 2, 2019

liu-cong commented Aug 2, 2019

alculquicondor commented Aug 2, 2019

draveness commented Aug 3, 2019 • edited Loading

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

draveness commented Aug 6, 2019

ahg-g commented Aug 6, 2019

draveness commented Aug 6, 2019

draveness commented Aug 7, 2019 via email

alculquicondor commented Aug 7, 2019

Huang-Wei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

draveness commented Aug 8, 2019 • edited Loading

Huang-Wei commented Aug 8, 2019

Huang-Wei commented Aug 8, 2019

draveness commented Aug 8, 2019

draveness commented Aug 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

draveness Aug 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

draveness Aug 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Aug 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

draveness commented Aug 12, 2019

ahg-g commented Aug 13, 2019

draveness commented Aug 13, 2019

ahg-g commented Aug 14, 2019

k8s-ci-robot commented Aug 14, 2019

draveness commented Aug 14, 2019 via email

draveness commented Aug 14, 2019

draveness commented Aug 2, 2019 •

edited

Loading

draveness commented Aug 2, 2019 •

edited

Loading

draveness commented Aug 3, 2019 •

edited

Loading

draveness commented Aug 8, 2019 •

edited

Loading

draveness Aug 9, 2019 •

edited

Loading

draveness Aug 13, 2019 •

edited

Loading

alculquicondor Aug 13, 2019 •

edited

Loading