这是indexloc提供的服务,不要输入任何密码
Skip to content

Refine delay jitter for exponential backoff #7206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

YuriyHolinko
Copy link
Contributor

@YuriyHolinko YuriyHolinko commented Mar 17, 2025

Adjusted the jitter calculation to improve the randomness and distribution of delays in the exponential backoff logic.

current implementation is not based on exponential backoff but more uses randomly generated numbers in range (0, upperBound) and only upperBound exponentially grows, so in some cases we generate relatively low delays for retries
Also there was an existing link
image

https://github.com/grpc/proposal/blob/master/A6-client-retries.md#exponential-backoff in code to implementation of exponential backoff but actual source code was different

@YuriyHolinko YuriyHolinko requested a review from a team as a code owner March 17, 2025 15:11
Copy link

codecov bot commented Mar 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.93%. Comparing base (490173b) to head (2fe22d8).
Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #7206   +/-   ##
=========================================
  Coverage     89.93%   89.93%           
- Complexity     6676     6678    +2     
=========================================
  Files           750      750           
  Lines         20168    20176    +8     
  Branches       1978     1978           
=========================================
+ Hits          18138    18146    +8     
  Misses         1435     1435           
  Partials        595      595           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jack-berg
Copy link
Member

Is this fixing the problem discussed in #7004?

@YuriyHolinko
Copy link
Contributor Author

YuriyHolinko commented Mar 17, 2025

Is this fixing the problem discussed in #7004?

yeah, looks like it's the same issue
i did not think someone else suffers from it, but as we have a few users then it will be very useful feature/bug fix

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic for computing backoff looks good. Just wondering why touch the other logic, which needs to be pretty carefully considered.

long currentBackoffNanos =
Math.min(nextBackoffNanos, retryPolicy.getMaxBackoff().toNanos());
long backoffNanos = (long) (randomJitter.get() * currentBackoffNanos);
nextBackoffNanos = (long) (currentBackoffNanos * retryPolicy.getBackoffMultiplier());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also update the implementation in JdkHttpSender.

I know its not ideal that there are two implementations.. Maybe worth adding a utility function to RetryPolicy that computes the backoff for a given attempt N. Signature might look like:

public long computeBackoffNanosForAttempt(int attempt, Random randomSource) {...}

It wouldn't be as efficient as the current implementation, but...

  • its such a tiny amount of compute that who cares
  • the compute is trivial compared to the overall cost of preparing and executing and HTTP request

Copy link
Contributor Author

@YuriyHolinko YuriyHolinko Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Added code for JdkHttpSender

I did not consider adding that method to calculate a backoff delay time
looking at the code I would say we can build more abstractions for sending requests and checking responses and exceptions, but not sure it's really helpful so let's probably move on with duplicated approach as we had before

try {
response = chain.proceed(chain.request());
if (response != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for changing this part of the logic?

Copy link
Contributor Author

@YuriyHolinko YuriyHolinko Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if response is null and no exception happened, the code fails in throw exception; line because exception is null. when response is null it's not something transient

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is possible but I haven't seen response null in practice. If it does occur, we can simply add a null check immediately after response = chain.proceed(chain.request()).

Copy link
Contributor Author

@YuriyHolinko YuriyHolinko Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's exactly what I did in this PR 😃
also null check was before this change so I intentionally keep it (but I suspect I can just drop it)

also previous code has the issue - if previous(before last) attempt returned rertryable response but the last attempt gets retryable exception method still returns the previous response which is not good as the last state (exception) should be returned

@YuriyHolinko
Copy link
Contributor Author

YuriyHolinko commented Mar 20, 2025

The new logic for computing backoff looks good. Just wondering why touch the other logic, which needs to be pretty carefully considered.

I replied in threads, tell me please if there are still any unanswered questions

@jack-berg
Copy link
Member

Thanks!

@jack-berg jack-berg merged commit 3c12e3a into open-telemetry:main Mar 25, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants