这是indexloc提供的服务,不要输入任何密码
Skip to content

Be more discerning about which errors cause mesh fall-back #11174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 13, 2021

Conversation

julz
Copy link
Member

@julz julz commented Apr 13, 2021

Before this PR, we would fall back to service scraping if all pods we tried to scrape errored. In many cases this heuristic is fine, but with a small number (e.g. one) of pods, it can lead to us falling back to service scraping unnecessarily if an unrelated error occurs. Once we've turned off direct scraping, we never turn it back on, so this is unfortunate.

This PR tightens the check to ensure that all of the errors we saw were compatible with having been caused by the mesh being enabled before falling back to service scraping.

/hold for prow to tell me if this actually works with mesh, I might've guessed the wrong status code for mesh errors

Release Note

Tightens the heuristic for mesh being abled in the service scraper. We now expect all errors to be related to mesh (i.e. 503 status code). This prevents accidentally falling in to service scrape mode when errors are encountered for other reasons.

@knative-prow-robot knative-prow-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 13, 2021
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Apr 13, 2021
@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/autoscale labels Apr 13, 2021
@julz
Copy link
Member Author

julz commented Apr 13, 2021

/test pull-knative-serving-istio-stable-mesh
/test pull-knative-serving-istio-stable-mesh-tls

@codecov
Copy link

codecov bot commented Apr 13, 2021

Codecov Report

Merging #11174 (34cec38) into main (1b007c5) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #11174      +/-   ##
==========================================
+ Coverage   87.73%   87.75%   +0.02%     
==========================================
  Files         190      190              
  Lines        9171     9188      +17     
==========================================
+ Hits         8046     8063      +17     
+ Misses        871      870       -1     
- Partials      254      255       +1     
Impacted Files Coverage Δ
pkg/autoscaler/metrics/http_scrape_client.go 87.50% <100.00%> (+2.88%) ⬆️
pkg/autoscaler/metrics/stats_scraper.go 91.77% <100.00%> (+0.32%) ⬆️
pkg/reconciler/configuration/configuration.go 86.71% <0.00%> (-1.57%) ⬇️
pkg/activator/config/store.go 91.30% <0.00%> (ø)
pkg/reconciler/revision/config/store.go 100.00% <0.00%> (ø)
pkg/reconciler/autoscaling/config/store.go 100.00% <0.00%> (ø)
pkg/reconciler/configuration/config/store.go 100.00% <0.00%> (ø)
pkg/activator/handler/handler.go 95.65% <0.00%> (+0.53%) ⬆️
pkg/activator/net/revision_backends.go 91.32% <0.00%> (+0.91%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b007c5...34cec38. Read the comment docs.

@julz
Copy link
Member Author

julz commented Apr 13, 2021

passing the failure through the tests-we-know-dont-work-reliably-on-mesh filter, above is actually success! yay

@julz julz force-pushed the scrapedthistogether branch from 0d06832 to 15649df Compare April 13, 2021 16:15
Previous to this commit, we would fall back to service scraping if all
pods we tried to scrape errored. In many cases this heuristic is fine,
but with a small number (e.g. one) of pods, it can lead to us falling
back to service scraping unnecesarily. Once we've turned off direct
scraping, we never turn it back on, so this is unfortunate.

This commit tightens the check to ensure that all of the errors we saw
were compatible with having been caused by the mesh being enabled.
@julz julz force-pushed the scrapedthistogether branch from 15649df to 34cec38 Compare April 13, 2021 16:17
@julz julz changed the title WIP: Be more discerning about which errors cause mesh fall-back Be more discerning about which errors cause mesh fall-back Apr 13, 2021
@knative-prow-robot knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 13, 2021
@julz
Copy link
Member Author

julz commented Apr 13, 2021

Should be good to go, but I force pushed so lets rerun the tests (which will fail again, unfortunately, but hopefully only in expected ways) for safety:

/unhold
/test pull-knative-serving-istio-stable-mesh
/test pull-knative-serving-istio-stable-mesh-tls

/assign @markusthoemmes @vagababov PTAL :)

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 13, 2021
@julz
Copy link
Member Author

julz commented Apr 13, 2021

looks like rolling deployment upgrade failure again FYI @markusthoemmes

// TODO(julz) consider other tests, e.g. looking for upstream header in response.
func isMeshError(err error) bool {
var errWithStatus errorWithStatusCode
if errors.As(err, &errWithStatus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're into code golfing

return errors.As(err, &errWithStatus) && errWithStatus.statusCode == meshErrorStatusCode

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2021
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vagababov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2021
@julz
Copy link
Member Author

julz commented Apr 13, 2021

/test pull-knative-serving-upgrade-tests

@julz
Copy link
Member Author

julz commented Apr 13, 2021

Again 😢

/test pull-knative-serving-upgrade-tests

@knative-prow-robot
Copy link
Contributor

knative-prow-robot commented Apr 13, 2021

@julz: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-knative-serving-istio-stable-mesh 34cec38 link /test pull-knative-serving-istio-stable-mesh

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@julz
Copy link
Member Author

julz commented Apr 13, 2021

Another known flake. We really need to chase these down :(.

/test pull-knative-serving-istio-stable-no-mesh

@knative-prow-robot knative-prow-robot merged commit 01ef03f into knative:main Apr 13, 2021
@vagababov
Copy link
Contributor

Yay! finally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/autoscale cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants