这是indexloc提供的服务,不要输入任何密码
Skip to content
This repository was archived by the owner on Sep 2, 2024. It is now read-only.

README for new MT Scheduler with pluggable policies #888

Merged
merged 3 commits into from
Oct 4, 2021

Conversation

aavarghese
Copy link
Contributor

Signed-off-by: aavarghese avarghese@us.ibm.com

Continuation of #768

Proposed Changes

  • README

Release Note


Docs

@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Sep 22, 2021
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 22, 2021
@aavarghese
Copy link
Contributor Author

/cc @lionelvillard

@codecov
Copy link

codecov bot commented Sep 22, 2021

Codecov Report

Merging #888 (ea56516) into main (3f66360) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #888   +/-   ##
=======================================
  Coverage   75.01%   75.01%           
=======================================
  Files         152      152           
  Lines        7080     7080           
=======================================
  Hits         5311     5311           
  Misses       1485     1485           
  Partials      284      284           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f66360...ea56516. Read the comment docs.

@lionelvillard
Copy link
Contributor

@aavarghese can you fix the linter errors? thx!

Copy link
Member

@pierDipi pierDipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I love this document!

1. **Pod failure**:
When a pod/replica in a StatefulSet goes down due to some reason (but its node and zone are healthy), a new replica is spun up by the StatefulSet with the same pod identit (pod can come up on a different node) almost immediately.
All existing vreplica placements will still be valid and no rebalancing is needed.
There shouldn’t be any degradation in Kafka message processing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really true, a consumer group rebalance could degrade message processing especially when Kafka Consumer Incremental Rebalance Protocol is not being used (which afaik is not implemented in Sarama).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierDipi the pod set being referred to here is only talking about the eventing scheduler adapter pods where vreplicas are placed. Since pod will restart, the same placements can be kept without a rebalancing of the vreps.
I agree with you about the consumer group rebalancing and degradation but that may/may not happen here if the kafka pods are affected, as well.
I hope I'm not missing anything...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierDipi the pod set being referred to here is only talking about the eventing scheduler adapter pods where vreplicas are placed. Since pod will restart, the same placements can be kept without a rebalancing of the vreps.

This is what All existing vreplica placements will still be valid and no rebalancing is needed. is saying, I agree and it's clear to me why.

I was referring to There shouldn’t be any degradation in Kafka message processing..

I agree with you about the consumer group rebalancing and degradation but that may/may not happen here if the kafka pods are affected, as well.
I hope I'm not missing anything...

so, are you saying that if a pod where vreplicas are placed goes down that won't trigger a consumer group rebalance that affects message processing?

In the worst-case scenario, I'd expect something like this to happen (happy to be wrong):

  1. Pod goes down
  2. A new pod comes up (same name)
  3. Kafka broker sees a new consumer that wants to join the group -> rebalance
  4. Kafka detects that the consumer that was consuming messages in the dead pod (1) is not sending heartbeats anymore -> rebalance (again)

at least one rebalance happen. 2 in the worst case since terminationGracePeriodSeconds = 0 < "time for Kafka to detect that a consumer is dead"

Is the above not possible? If yes, does that count as a degradation for Kafka message processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This is absolutely possible.

I made an assumption that the same sticky pod (when restarted) would have the same consumer member ID and using static membership, would get the same assignment or so.

I don't have any numbers to quantify the extent of degradation for these recovery scenarios. Will need to run some performance runs to measure latency. Thank you for catching this @pierDipi !!

Signed-off-by: aavarghese <avarghese@us.ibm.com>
@aavarghese aavarghese force-pushed the doc branch 2 times, most recently from 3c4da3d to b49ece3 Compare September 30, 2021 15:05
Signed-off-by: aavarghese <avarghese@us.ibm.com>
Signed-off-by: aavarghese <avarghese@us.ibm.com>
@lionelvillard
Copy link
Contributor

/approve
/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2021
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aavarghese, lionelvillard

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2021
@knative-prow-robot knative-prow-robot merged commit 7b363a2 into knative-extensions:main Oct 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants