Make sure deployments never underscale on rollouts #11140

markusthoemmes · 2021-04-08T14:50:39Z

Proposed Changes

Sometimes we DO update the revision deployment, for example for config
changes and for updates to the queue-proxy. We want those updates to
have as little interruption as possible, so this requires the pods to be
rolled in a way that there is never less pods than requested.

Context on the rollout speed: This does not significantly impact the speed at which a rollout is done. As soon as the pods are Terminating, the next batch will be rolled, so this just makes sure we never dip below the replicas setting of the Deployment. Since our pods take ~40s to shut down anyway, this likely isn't even impacting the amount of resources used by the rollout significantly.

Release Note

Changed the rollout behavior of application deployment changes (due to Knative upgrade for example) to never have less ready posd than required.

codecov · 2021-04-08T15:20:05Z

Codecov Report

Merging #11140 (e2cb911) into main (e57b86b) will increase coverage by 0.06%.
The diff coverage is 100.00%.

❗ Current head e2cb911 differs from pull request most recent head ef3a40d. Consider uploading reports for the commit ef3a40d to get more accurate results

@@            Coverage Diff             @@
##             main   #11140      +/-   ##
==========================================
+ Coverage   87.69%   87.75%   +0.06%     
==========================================
  Files         191      190       -1     
  Lines        9193     9173      -20     
==========================================
- Hits         8062     8050      -12     
+ Misses        879      869      -10     
- Partials      252      254       +2

Impacted Files	Coverage Δ
pkg/reconciler/revision/resources/deploy.go	`100.00% <100.00%> (ø)`
pkg/autoscaler/metrics/http_scrape_client.go	`84.61% <0.00%> (-2.89%)`	⬇️
pkg/reconciler/revision/reconcile_resources.go	`80.95% <0.00%> (-2.39%)`	⬇️
pkg/reconciler/configuration/configuration.go	`86.71% <0.00%> (-1.57%)`	⬇️
pkg/activator/net/revision_backends.go	`91.32% <0.00%> (-0.70%)`	⬇️
pkg/activator/handler/handler.go	`95.12% <0.00%> (-0.54%)`	⬇️
pkg/autoscaler/metrics/stats_scraper.go	`91.44% <0.00%> (-0.33%)`	⬇️
cmd/activator/main.go	`0.00% <0.00%> (ø)`
pkg/reconciler/revision/config/store.go	`100.00% <0.00%> (ø)`
pkg/reconciler/autoscaling/config/store.go	`100.00% <0.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e57b86b...ef3a40d. Read the comment docs.

markusthoemmes · 2021-04-08T16:16:33Z

/test bla

knative-prow-robot · 2021-04-08T16:16:38Z

@markusthoemmes: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test pull-knative-serving-build-tests
/test pull-knative-serving-unit-tests
/test pull-knative-serving-upgrade-tests
/test pull-knative-serving-istio-latest-mesh
/test pull-knative-serving-istio-latest-mesh-tls
/test pull-knative-serving-istio-latest-no-mesh
/test pull-knative-serving-istio-latest-no-mesh-tls
/test pull-knative-serving-istio-stable-mesh
/test pull-knative-serving-istio-stable-mesh-tls
/test pull-knative-serving-istio-stable-no-mesh
/test pull-knative-serving-istio-stable-no-mesh-tls
/test pull-knative-serving-istio-stable-no-mesh-tls-with-http01
/test pull-knative-serving-kourier-stable
/test pull-knative-serving-kourier-stable-tls
/test pull-knative-serving-contour-latest
/test pull-knative-serving-contour-tls
/test pull-knative-serving-ambassador-latest
/test pull-knative-serving-ambassador-latest-tls
/test pull-knative-serving-kong-latest
/test pull-knative-serving-kong-latest-tls
/test pull-knative-serving-https

Use /test all to run the following jobs:

pull-knative-serving-build-tests
pull-knative-serving-unit-tests
pull-knative-serving-upgrade-tests
pull-knative-serving-istio-stable-no-mesh
pull-knative-serving-istio-stable-no-mesh-tls

In response to this:

/test bla

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

markusthoemmes · 2021-04-08T16:17:01Z

/test pull-knative-serving-upgrade-tests

vagababov · 2021-04-08T16:21:36Z

pkg/reconciler/revision/resources/deploy_test.go

@@ -217,6 +219,13 @@ var (
 				},
 			},
 			ProgressDeadlineSeconds: ptr.Int32(0),
+			Strategy: appsv1.DeploymentStrategy{
+				Type: appsv1.RollingUpdateDeploymentStrategyType,
+				RollingUpdate: &appsv1.RollingUpdateDeployment{


I actually discussed that with Matt some 2 years ago, though, I actually wanted to use more aggressive settings :)

More aggressive in which ways?

markusthoemmes · 2021-04-08T17:02:09Z

/test pull-knative-serving-upgrade-tests

Sometimes we DO update the revision deployment, for example for config changes and for updates to the queue-proxy. We want those updates to have as little interruption as possible, so this requires the pods to be rolled in a way that there is never less pods than requested.

markusthoemmes · 2021-04-14T17:26:36Z

/unhold

markusthoemmes · 2021-04-14T17:26:49Z

/assign @julz @vagababov @dprotaso

julz · 2021-04-20T16:00:44Z

I'm pretty sure this is fine and what we should do, but can we quantify (maybe a quick experiment to see?) what this will mean in terms of number of pods before/after the change for users/operators. Especially in contexts where you're charging Actual Money per replica (but even if not and it's just how many extra nodes you need during upgrade) that seems like important info.

tl;dr is the result here that we end up with one or two more pods during deploy, or is it that we end up with a significant fraction of the total extra while the deployment rolls out?

markusthoemmes · 2021-04-20T16:08:25Z

@julz see the comment on that in the intro. For the K8s rollout mechanism, Terminating is equal to "Gone". Since our pods take like 40s to completely shut down, you're looking at doubling the amount of pods in the cluster for any rollout taking <40s. That is also true for today's behavior.

The only tangible diff is, with this in place, the rollout will start by adding containers and only start removing once those added ones are ready.

If you're charging for Terminating pods as well, while they are terminating, I don't think anything changes substantially. If you're only charging for ready pods, you might see 1 or 2 extra pods in this case.

markusthoemmes · 2021-04-20T16:12:01Z

Or to put more simply: With this, you're tracking 0/+2 extra ready pods, whereas before we've been tracking roughly -1/+1 ready pods.

julz · 2021-04-20T16:14:43Z

With this, you're tracking 0/+2 extra ready pods, whereas before we've been tracking roughly -1/+1 ready pods.

that's the line I was looking for 👍

dprotaso · 2021-04-28T19:33:47Z

/lgtm
/approve

knative-prow-robot · 2021-04-28T19:33:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/reconciler/OWNERS~~ [dprotaso]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

myrat92 · 2021-09-23T08:51:55Z

Hello, @markusthoemmes What ways do we offer to update to the queue-proxy? use "kubectl edit deploy"?

knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 8, 2021

google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Apr 8, 2021

knative-prow-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/API API objects and controllers labels Apr 8, 2021

vagababov reviewed Apr 8, 2021

View reviewed changes

markusthoemmes force-pushed the experiment-rollout branch from e2cb911 to ef3a40d Compare April 14, 2021 17:24

knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2021

knative-prow-robot assigned dprotaso, julz and vagababov Apr 14, 2021

markusthoemmes changed the title ~~Use the most defensive rollout strategy for revision updates~~ Make sure deployments never underscale on rollouts Apr 20, 2021

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2021

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 28, 2021

knative-prow-robot merged commit 5af4e84 into knative:main Apr 28, 2021

Make sure deployments never underscale on rollouts #11140

Make sure deployments never underscale on rollouts #11140

Uh oh!

Conversation

markusthoemmes commented Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Uh oh!

codecov bot commented Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

markusthoemmes commented Apr 8, 2021

Uh oh!

knative-prow-robot commented Apr 8, 2021

Uh oh!

markusthoemmes commented Apr 8, 2021

Uh oh!

vagababov Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

markusthoemmes Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

markusthoemmes commented Apr 8, 2021

Uh oh!

markusthoemmes commented Apr 14, 2021

Uh oh!

markusthoemmes commented Apr 14, 2021

Uh oh!

julz commented Apr 20, 2021

Uh oh!

markusthoemmes commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markusthoemmes commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julz commented Apr 20, 2021

Uh oh!

dprotaso commented Apr 28, 2021

Uh oh!

knative-prow-robot commented Apr 28, 2021

Uh oh!

myrat92 commented Sep 23, 2021

Uh oh!

Uh oh!

markusthoemmes commented Apr 8, 2021 •

edited

Loading

codecov bot commented Apr 8, 2021 •

edited

Loading

markusthoemmes commented Apr 20, 2021 •

edited

Loading

markusthoemmes commented Apr 20, 2021 •

edited

Loading