-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat(spanner): add GCP standard otel attributes #11652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
737e400 to
782047b
Compare
9d27fab to
4261d64
Compare
4261d64 to
2fc0115
Compare
olavloite
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also modify some of the existing test cases and/or add new test cases that verify that the addtional span attributes are present? The main reason for wanting that is not because I think that they are not there, but to prevent accidental regressions in the future where someone by accident removes them, or changes the key of (some of) them.
| metrics := parseServerTimingHeader(md) | ||
| mt.currOp.currAttempt.setServerTimingMetrics(metrics) | ||
| latencies := parseServerTimingHeader(md) | ||
| span := otrace.SpanFromContext(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to always return a useful value? Or put another way: What is the customer is not using tracing. Do we need to execute these steps here? And if we still do, is there any chance of any errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We anyways need parseServerTimingHeader because of native metrics, in this PR we just pass it to setGFEAndAFESpanAttributes and if customer is not using tracing it won't be captured et-all
if !span.IsRecording() {
return
}
And yes it always returns a useful value both for cloud path and direct path
|
|
||
| // setSpanAttributes dynamically sets span attributes based on the request type. | ||
| func setSpanAttributes[T any](span oteltrace.Span, req T) { | ||
| if !span.IsRecording() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok. I guess this answers my question above.
spanner/grpc_client.go
Outdated
| } | ||
| for t, v := range latencyMap { | ||
| span.SetAttributes( | ||
| attribute.Float64(t[:3]+".latency_ms", float64(v.Nanoseconds())/1e6), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this could use a small comment explaining what t[:3] does (gfe or afe latency). And are we absolutely sure that this will never change in the future? What if we introduce some new header key in the future that is shorter, and someone is still using this version of the client? Maybe we should only do the t[:3] for the well-known header keys, and include the full header key for other values (or completely skip them).
|
|
||
| var attrs []attribute.KeyValue | ||
|
|
||
| if r, ok := any(req).(interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive are these checks? It is executed for each request, and there is no way to enable/disable it, other than disabling tracing. It is useful information, so I would lean towards enabling it by default, but should we include an option to disable it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is executed for each request when span.IsRecording(), it uses generics, compared to interface assertions it's faster and cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a benchmark test BenchmarkSetSpanAttributes
High-Level Summary
- Not Recording is Cheap: When the span is not recording, the function's overhead is negligible and constant (~64 ns, 0 allocations). The function correctly returns early, making it very efficient in this state.
- Recording Cost is Proportional to Work: When the span is recording, the performance cost (latency and allocations) is directly proportional to the number and type of attributes being set on the span.
- Tail Latency is High: The p99 latencies are significantly higher than the median (p50) latencies. This indicates that while the function is fast on average, there are occasional spikes where it can take much longer (e.g., 2.6µs vs a median of 166ns for
ExecuteSqlRequest). This is common in tracing systems and can be caused by factors like garbage collection, context switching or internal locking within the tracing SDK.
Detailed Breakdown
1. Not-Recording Spans
Across all request types, the performance is consistently fast and predictable:
- Latency: ~64 ns
- Allocations: 0 allocs/op (the 8 bytes reported are likely from the benchmark harness itself).
- p50/p90/p99: All hover around 41-42 ns.
This confirms that the if !span.IsRecording() check is effective at making the function call a cheap no-op when tracing is disabled or sampled out.
2. Recording Spans
Here, performance varies based on the request type, as expected.
-
ExecuteBatchDmlRequest(Worst Case):- Latency: ~380 ns (p50: 250 ns, p99: 2.75 µs)
- Allocations: 6 allocs/op (708 B)
- Reason: This is the most expensive case because the function iterates over a slice of DML statements, creates a new slice of strings (
sqls), and then adds multiple attributes. This involves more logic and memory allocation than other types.
-
ExecuteSqlRequest&ReadRequest:- Latency: ~270-280 ns (p50: 166 ns, p99: ~2.6 µs)
- Allocations: 3 allocs/op (612 B)
- Reason: These requests set three attributes (
transaction.tag,statement.tag, anddb.statementordb.table), leading to moderate latency and allocations.
-
CommitRequest:- Latency: ~200 ns (p50: 125 ns, p99: ~2.0 µs)
- Allocations: 2 allocs/op (302 B)
- Reason: This is faster as it only sets two attributes from the request options.
-
PartitionQueryRequest&PartitionReadRequest:- Latency: ~130 ns (p50: 83 ns, p99: ~0.4 µs)
- Allocations: 1 alloc/op (123 B)
- Reason: You correctly identified that these requests don't always have
RequestOptions. After your change removing them, these requests only set one attribute (db.statementordb.table), making them much faster and more memory-efficient than the other requests.
-
Requests with No Attributes (
BatchWriteRequest,Empty struct):- Latency: ~80 ns
- Allocations: 0 allocs/op
- Reason: The performance is very close to the
not-recordingcase. The slightly higher latency comes from the additional type assertion checks inside the function which ultimately find no attributes to set.
Conclusion
The setSpanAttributes function is well-optimized. The primary performance cost is incurred only when a span is actively recording, and this cost scales linearly with the complexity of the request object. The worst-case median latency is low (~250 ns for a batch DML request), though tail latencies can be an order of magnitude higher. For requests that don't set any attributes, the overhead is minimal.
|
@olavloite I updated test |
Key Changes: