Skip to content
This repository was archived by the owner on Sep 2, 2024. It is now read-only.

README for new MT Scheduler with pluggable policies#888

Merged
knative-prow-robot merged 3 commits intoknative-extensions:mainfrom
aavarghese:doc
Oct 4, 2021
Merged

README for new MT Scheduler with pluggable policies#888
knative-prow-robot merged 3 commits intoknative-extensions:mainfrom
aavarghese:doc

Conversation

@aavarghese
Copy link
Copy Markdown
Contributor

Signed-off-by: aavarghese avarghese@us.ibm.com

Continuation of #768

Proposed Changes

  • README

Release Note


Docs

@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Sep 22, 2021
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 22, 2021
@aavarghese
Copy link
Copy Markdown
Contributor Author

/cc @lionelvillard

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 22, 2021

Codecov Report

Merging #888 (ea56516) into main (3f66360) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #888   +/-   ##
=======================================
  Coverage   75.01%   75.01%           
=======================================
  Files         152      152           
  Lines        7080     7080           
=======================================
  Hits         5311     5311           
  Misses       1485     1485           
  Partials      284      284           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f66360...ea56516. Read the comment docs.

@lionelvillard
Copy link
Copy Markdown
Contributor

@aavarghese can you fix the linter errors? thx!

Copy link
Copy Markdown
Member

@pierDipi pierDipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I love this document!

Comment thread pkg/common/scheduler/README.md Outdated
Comment thread pkg/common/scheduler/README.md Outdated
1. **Pod failure**:
When a pod/replica in a StatefulSet goes down due to some reason (but its node and zone are healthy), a new replica is spun up by the StatefulSet with the same pod identit (pod can come up on a different node) almost immediately.
All existing vreplica placements will still be valid and no rebalancing is needed.
There shouldn’t be any degradation in Kafka message processing.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really true, a consumer group rebalance could degrade message processing especially when Kafka Consumer Incremental Rebalance Protocol is not being used (which afaik is not implemented in Sarama).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierDipi the pod set being referred to here is only talking about the eventing scheduler adapter pods where vreplicas are placed. Since pod will restart, the same placements can be kept without a rebalancing of the vreps.
I agree with you about the consumer group rebalancing and degradation but that may/may not happen here if the kafka pods are affected, as well.
I hope I'm not missing anything...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierDipi the pod set being referred to here is only talking about the eventing scheduler adapter pods where vreplicas are placed. Since pod will restart, the same placements can be kept without a rebalancing of the vreps.

This is what All existing vreplica placements will still be valid and no rebalancing is needed. is saying, I agree and it's clear to me why.

I was referring to There shouldn’t be any degradation in Kafka message processing..

I agree with you about the consumer group rebalancing and degradation but that may/may not happen here if the kafka pods are affected, as well.
I hope I'm not missing anything...

so, are you saying that if a pod where vreplicas are placed goes down that won't trigger a consumer group rebalance that affects message processing?

In the worst-case scenario, I'd expect something like this to happen (happy to be wrong):

  1. Pod goes down
  2. A new pod comes up (same name)
  3. Kafka broker sees a new consumer that wants to join the group -> rebalance
  4. Kafka detects that the consumer that was consuming messages in the dead pod (1) is not sending heartbeats anymore -> rebalance (again)

at least one rebalance happen. 2 in the worst case since terminationGracePeriodSeconds = 0 < "time for Kafka to detect that a consumer is dead"

Is the above not possible? If yes, does that count as a degradation for Kafka message processing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This is absolutely possible.

I made an assumption that the same sticky pod (when restarted) would have the same consumer member ID and using static membership, would get the same assignment or so.

I don't have any numbers to quantify the extent of degradation for these recovery scenarios. Will need to run some performance runs to measure latency. Thank you for catching this @pierDipi !!

Signed-off-by: aavarghese <avarghese@us.ibm.com>
Comment thread pkg/common/scheduler/README.md
Comment thread pkg/common/scheduler/README.md Outdated
Comment thread pkg/common/scheduler/README.md Outdated
Comment thread pkg/common/scheduler/README.md Outdated
@aavarghese aavarghese force-pushed the doc branch 2 times, most recently from 3c4da3d to b49ece3 Compare September 30, 2021 15:05
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Comment thread pkg/common/scheduler/README.md
Signed-off-by: aavarghese <avarghese@us.ibm.com>
@lionelvillard
Copy link
Copy Markdown
Contributor

/approve
/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2021
@knative-prow-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aavarghese, lionelvillard

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2021
@knative-prow-robot knative-prow-robot merged commit 7b363a2 into knative-extensions:main Oct 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants