Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use named array instead of array in normalizing score #80901

Merged

Conversation

draveness
Copy link
Contributor

@draveness draveness commented Aug 2, 2019

/kind feature
/sig scheduling
/priority important-soon
/assign @bsalamat @ahg-g

What this PR does / why we need it:

I was refactoring the interpod-affinity priority with scheduling framework and found out it is quite hard to connect the node which passing to the Score plugin to the NodeScoreList passing to the NormalizeScore with index.

// nodename here
func (ipa *InterPodAffinityPlugin) Score(pc *framework.PluginContext, pod *v1.Pod, nodeName string) (int, *framework.Status) {

// index to score here
func (ipa *InterPodAffinityPlugin) NormalizeScore(pc *framework.PluginContext, pod *v1.Pod, scores framework.NodeScoreList) *framework.Status { 

The priority functions are using a HostPriority which contains node name and score, it would be better to provide the node name information in normalizing score phase.

We have two options for optimising this:

  1. Use a nodeName -> score map, which this PR try to introduce
  2. Use an array of HostPriority liked struct, ex:
type NodeScore struct {
  Name string
  Score int
}

I recalled that in the API convention documents, kubernetes prefer the named array to map. I could go with the second option if it is the convention we must follow here, but I prefer the first for now. What do you think of this? @bsalamat @ahg-g

Which issue(s) this PR fixes:

ref: #80667

Does this PR introduce a user-facing change?:

Use a named array instead of a score array in normalizing-score phase.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Aug 2, 2019
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 2, 2019
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 2, 2019
@draveness
Copy link
Contributor Author

/retest

@alculquicondor
Copy link
Member

/hold
Changes to the framework interfaces are considered user facing. Can we open an issue for this to start a discussion?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2019
@alculquicondor
Copy link
Member

/cc @liu-cong
who has been looking into normalize score interface as well

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: GitHub didn't allow me to request PR reviews from the following users: liu-cong.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @liu-cong
who has been looking into normalize score interface as well

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@draveness
Copy link
Contributor Author

draveness commented Aug 2, 2019

Changes to the framework interfaces are considered user facing. Can we open an issue for this to start a discussion?

I sent this PR to start a discussion about normalizing score phase as we met problems when transforming the original priority functions into the scheduling framework. Feel free to leave comments and opinions in this PR.

@liu-cong
Copy link
Contributor

liu-cong commented Aug 2, 2019

/assign

@liu-cong
Copy link
Contributor

liu-cong commented Aug 2, 2019

This does involve an interface change, please document it in Does this PR introduce a user-facing change?.

This seems to be a reasonable change. I cannot think of any objections to it besides more memory usage. I will wait for opinions from others.

@alculquicondor
Copy link
Member

It's not clear to me why is the change needed. Normalization should be totally independent from the node information.

@draveness
Copy link
Contributor Author

draveness commented Aug 3, 2019

It's not clear to me why is the change needed. Normalization should be totally independent from the node information.

Ideally, normalization should be totally independent of the node information, many of the priority functions now indeed do not use node name in HostPriority, but we have some real cases that need other nodeName and information during the normalizing phase.

Currently, the reduce phase in priority function is similar to the normalizing phase in the scheduling framework, these reduce function accepts a list of nodeName and score as an argument, but the normalize score function only accepts a list of score which lacks node name information for now.

If you checked out the priority functions with reducing phase, some of them use nodeName to get more information, ex: SelectorSpread use nodeName to get zoneID.

And we need to provide node names in the normalizing phase because of the following reasons:

  1. Keep consistent with the original reduce priority functions (HostPriority), otherwise, we can't migrate some of them to the scheduling framework, ex: SelectorSpread, InterPodAffinity and EvenPodSpreading
  2. Good for debugging purpose, ex:
klog.Infof("%v -> %v: EvenPodsSpreadPriority, Score: (%d)", pod.Name, nodeName, *score)
klog.Infof("%v -> %v: InterPodAffinityPriority, Score: (%d)", pod.Name, node.Name, int(fScore))
"%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, result[i].Host, int(fScore),

InterPodAffinity and EvenPodsSpread priorities are using deprecated pattern and quite different from the other priority function, in this WIP PR #80898, I tried to refactor it with the scheduling framework and met this problem. Even if we could find another approach to refactor them without node name, SelectorSpread also need node name to get zone id.

Based on the reasons above, I'd like to add node name to normalizing score phase.

cc/ @bsalamat @ahg-g

Copy link
Member

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why you are recommending a map over a "named array"?

score, status := pl.Score(pc, pod, nodes[index].Name)
if !status.IsSuccess() {
errCh.SendErrorWithCancel(fmt.Errorf(status.Message()), cancel)
return
}
pluginToNodeScoreMap[pl.Name()][index] = score
pluginToNodeScoreMap[pl.Name()][nodeName] = score
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you need to synchronize access when writing to the map (a pre-allocated slice doesn't) since this practically adds a new key-value to it.

@draveness
Copy link
Contributor Author

Can you elaborate why you are recommending a map over a "named array"?

I don't have a preference. What do you think of these two options?

@ahg-g
Copy link
Member

ahg-g commented Aug 6, 2019

Can you elaborate why you are recommending a map over a "named array"?

I don't have a preference. What do you think of these two options?

I think it is better to follow the sig-arch recommendation.

@draveness
Copy link
Contributor Author

I think it is better to follow the sig-arch recommendation.

I'll make the change.

@draveness draveness force-pushed the feature/use-map-instead-of-array branch from b152920 to 68dc5da Compare August 6, 2019 12:33
@draveness
Copy link
Contributor Author

draveness commented Aug 7, 2019 via email

@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2019
Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@draveness I may miss something, but my impression is that this PR still doesn't resolve the core issue of the deprecated priorities - which is: the nodes slice here are the ones that passed Predicates. So the Score() function still doesn't have the handle of all nodes info?

	// Run the Score plugins.
	scoresMap, scoreStatus := framework.RunScorePlugins(pluginContext, pod, nodes)

(The nodes come from one of the parameters of PrioritizeNodes())

for _, scoreList := range scoresMap {
for i := range nodes {
result[i].Score += scoreList[i]
for _, nodeScoreList := range scoresMap {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be merged into previous for i := range nodes loop, after L806?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, Done

@draveness draveness force-pushed the feature/use-map-instead-of-array branch from 58dda26 to aa5f9fd Compare August 8, 2019 00:20
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019
@draveness
Copy link
Contributor Author

draveness commented Aug 8, 2019

@draveness I may miss something, but my impression is that this PR still doesn't resolve the core issue of the deprecated priorities - which is: the nodes slice here are the ones that passed Predicates. So the Score() function still doesn't have the handle of all nodes info?

We could get nodes from the frameworkHandle.NodeInfoSnapshot(), I could give an implementation of refactoring interpod-affinity for review after this gets merged. If it does not work well and something goes wrong, I'll propose to add nodes to Score function. And even if it doesn't work well, use a named array instead could make the API easier to use. Does this make sense to you? @Huang-Wei

/lgtm

All being said, I am not opposed to having a named array (map) to keep score of nodes. I think it will be easier to use.

/hold cancel

I'll cancel the hold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2019
@Huang-Wei
Copy link
Member

We could get nodes from the frameworkHandle.NodeInfoSnapshot()

Looks good.

I could give an implementation of refactoring interpod-affinity for review after this gets merged.

Thanks. Let's see if it works.

@Huang-Wei
Copy link
Member

/lgtm

Will leave /approval to @bsalamat or @ahg-g .

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019
@draveness
Copy link
Contributor Author

/retest

@draveness
Copy link
Contributor Author

Friendly ping @bsalamat and @ahg-g for approval. Thanks!

@@ -31,10 +31,16 @@ import (
type Code int

// NodeScoreList declares a list of nodes and their scores.
type NodeScoreList []int
type NodeScoreList []NodeScore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use a map from node names to node score (map[string]int) instead of NodeScoreList? With the current data structure, finding the score of a node requires iterating over a NodeScoreList.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use a map from node names to node score (map[string]int) instead of NodeScoreList? With the current data structure, finding the score of a node requires iterating over a NodeScoreList.

I don't have a preference here

I recalled that in the API convention documents, kubernetes prefer the named array to map. I could go with the second option if it is the convention we must follow here, but I prefer the first for now. What do you think of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to think carefully about this question: "Do we need to find score of a particular node in our code?" If the answer is no, we should definitely go with the current List instead of a map. Lists are faster to access compared to maps if we don't need to search for something.

Copy link
Contributor Author

@draveness draveness Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to think carefully about this question: "Do we need to find score of a particular node in our code?"

+1

If the answer is no, we should definitely go with the current List instead of a map. Lists are faster to access compared to maps if we don't need to search for something.

For the implementation of InterPodAffinity and EvenPodSpread, we don't need to find the score of a particular node in NormalizeScore, and we only use node name in the named array (or map) to find score from the internal data structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have a strong reason to use map (which I doubt we will), my recommendation is to use a list.

for i := range nodes {
result[i].Score += scoreList[i]
for j := range scoresMap {
result[i].Score += scoresMap[j][i].Score
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the original code will be faster because it has better cache locality when reading scores from the scoresMap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more evidence on this, before that I prefer to keep it this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory sequential (or strided) reads are the fastest form of memory access patterns because of prefetching and cache locality.

The other thing in this nested loop that could make it slower is that it makes n*m map lookups (n=number of nodes, m = number of priority functions/score plugins), while if we make the node loop inside, the number of scoresMap lookups will be zero. So it might be even faster (and easier to read) to flip the whole thing like this:

for i := range nodes {
		result = append(result, schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0})
}

for j, priorityList := range priorityConfigs {
		for i := range nodes {
			result[i].Score += priorityList[i].Score * priorityConfigs[j].Weight
		}
}

for _, scoreList := range scoresMap {
		for i := range nodes {
			result[i].Score += scoreList[i].Score
		}
}

We need more evidence on this, before that I prefer to keep it this way.
The theory says this change makes things slower, so if you don't have evidence to the contrary, then we shouldn't make the change, especially that it is not related to the purpose of the PR "using named array in normalize scores".

Copy link
Contributor Author

@draveness draveness Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory sequential (or strided) reads are the fastest form of memory access patterns because of prefetching and cache locality.

I know the theory of the cache locality and here are the benchmark numbers:

package main

import "testing"

func locality(result []int, scoresMap map[int][]int, anotherScoresMap map[int][]int) {
	for _, scores := range scoresMap {
		for i := range scores {
			result[i] = result[i] + scores[i]
		}
	}

	for _, scores := range anotherScoresMap {
		for i := range scores {
			result[i] = result[i] + scores[i]
		}
	}
}

func nonlocality(result []int, scoresMap map[int][]int, anotherScoresMap map[int][]int) {
	for i := range result {
		for j := range scoresMap {
			result[i] = result[i] + scoresMap[j][i]
		}

		for j := range anotherScoresMap {
			result[i] = result[i] + anotherScoresMap[j][i]
		}
	}
}

func BenchmarkLocality(b *testing.B) {
	result := make([]int, 5000)
	scores := make(map[int][]int, 10)
	anotherScores := make(map[int][]int, 10)

	for i := 0; i < 10; i++ {
		scores[i] = make([]int, 5000)
		anotherScores[i] = make([]int, 5000)
		for j := 0; j < 5000; j++ {
			scores[i][j] = j
			anotherScores[i][j] = j
		}
	}

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		locality(result, scores, anotherScores)
	}
}

func BenchmarkNonLocality(b *testing.B) {
	result := make([]int, 5000)
	scores := make(map[int][]int, 10)
	anotherScores := make(map[int][]int, 10)

	for i := 0; i < 10; i++ {
		scores[i] = make([]int, 5000)
		anotherScores[i] = make([]int, 5000)
		for j := 0; j < 5000; j++ {
			scores[i][j] = j
			anotherScores[i][j] = j
		}
	}

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		locality(result, scores, anotherScores)
	}
}

$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/golang
BenchmarkLocality-8      	   30000	     44607 ns/op
BenchmarkNonLocality-8   	   30000	     44245 ns/op
PASS

According to the benchmarks, I prefer to keep it this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange, it is either golang's compiler is so good, or new architectures have such powerful memory controllers. I would like to do the same test in C++ to see if I get different results.

Do you have a theory why they are the same?

Copy link
Member

@alculquicondor alculquicondor Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that the number of priorities (j values) is very small, and so is the number of nodes. We are only talking about ~200kb, which can fit in the caches pretty easily.

Minor note in the posted benchmark: it should be map[string][]int. but again, we are only doing lookups among 10 values.

On the other hand, I don't understand the motivation to do the change. I'm not going to hold the review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re ran the test and set the sizes to two orders of magnitude bigger, no difference.

The issue about cache is not only about size, but also what you move to the cache (how much of the cache line being moved from memory is actually used), so it is about effective memory bandwidth.

@@ -31,10 +31,16 @@ import (
type Code int

// NodeScoreList declares a list of nodes and their scores.
type NodeScoreList []int
type NodeScoreList []NodeScore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have a strong reason to use map (which I doubt we will), my recommendation is to use a list.

@draveness
Copy link
Contributor Author

Hi @bsalamat, do you have any thoughts on this?

@ahg-g
Copy link
Member

ahg-g commented Aug 13, 2019

/lgtm

@draveness
Copy link
Contributor Author

/lgtm

Hi @ahg-g, can I get approval on this or I have to bother Bobby?

@ahg-g
Copy link
Member

ahg-g commented Aug 14, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, draveness

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2019
@draveness
Copy link
Contributor Author

draveness commented Aug 14, 2019 via email

@draveness
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit 1906650 into kubernetes:master Aug 14, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 14, 2019
@draveness draveness deleted the feature/use-map-instead-of-array branch August 14, 2019 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants