-
Notifications
You must be signed in to change notification settings - Fork 77
[confgenerator] Add Otel Logging metrics to AgentSelfMetrics.
#1957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[confgenerator] Add Otel Logging metrics to AgentSelfMetrics.
#1957
Conversation
AgentSelfMetrics.AgentSelfMetrics.
|
After exploring more the implementation of the monitoring metrics, it seems better to implement the metric |
e397156 to
38cf69b
Compare
38cf69b to
359e194
Compare
3b4e3de to
8aad915
Compare
confgenerator/agentmetrics.go
Outdated
| otel.AggregateLabels("sum", "response_code")), | ||
| ), | ||
| // Aggregating delta metrics from fluent-bit and otel logging will. | ||
| // Keep the initial value so the resulting cumulative metric at the end of the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to use drop or we'll double-count fluent bit values if otel restarts and fluent-bit does not, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll double-count fluent bit values if otel restarts and fluent-bit does not
This won't happen because the mechanism that keeps the "historic metric count" for the summed metrics is the "deltatocumulative" processor, which will reset if otel restarts.
Note : drop could almost work, but we would loose all the "logs sent on startup" count from the first metric datapoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this wouldn't double-count. Consider the following scrapes:
- scrape 1, delta 1 (because
keep), cumulative 1 (t0-t1) - scrape 2, delta 1 (2-1 = 1), cumulative 2 (t0-t2)
- scrape 3, delta 1 (3-1 = 1), cumulative 3 (t0-t3)
- otel restarts
- scrape 3, delta 3 (because
keep), cumulative 3 (t4-t5)
The scrape after the restart should be reporting a cumulative 0 because we already reported those increments, not a cumulative 3 with a fresh start time (which is considered as additional to the previous cumulative 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider, for contrast, the current released Ops Agent behaviour in the same example. I added some updates in the time ranges assuming t(n+1) - t(n) = 1 minute regularity to simplify the example :
- scrape at t1, cumulative 1 (t1, t1)
- first scrape of a prometheus metric has
start_time == time
- first scrape of a prometheus metric has
- scrape at t2, cumulative 2 (t1, t2)
- scrape at t3, cumulative 3 (t1, t3)
- otel restarts at t4
- scrape at t5, cumulative 3 (t5, t5)
- first scrape of a prometheus metric has
start_time == time - the scrape gets 3 (because fluent-bit didn't restart)
- reported cumulative 3 with fresh timestamp t5
- first scrape of a prometheus metric has
The current implementation in this PR with keep will result in the same behaviour as shown in this example.
The scrape after the restart should be reporting a cumulative 0 because we already reported those increments, not a cumulative 3 with a fresh start time (which is considered as additional to the previous cumulative 3)
Following the semantics of a "cumulative metric" the range (t5, t6) shouldn't have any data from scrapes 1, 2 or 3, because there were no observations of an experiment ("logs sent") in that time range. This is "semantically" correct, but enforcing this fails when considering other edge cases. For example, if otel takes 3 minutes to start, the first scrape will get "log counts" from an unknown time range since fluent-bit started.
The main idea of why using keep doesn't double count is that the initial datapoint value after a restart only represents data of an "unknown" time range (e.g. (?, t5)) that happened before the scrape.
The provided example can be rewritten like this :
- scrape at t1, delta 1 (because keep), cumulative 1 (t0, t1),
- after “aligning” all delta points to a 1m window we have a
start_time = t0.
- after “aligning” all delta points to a 1m window we have a
- scrape at t2, delta 1 (2-1 = 1), cumulative 2 (t0, t2)
- scrape at t3, delta 1 (3-1 = 1), cumulative 3 (t0, t3)
- otel restarts at t4,
- deltatocumulative looses in-memory data about the previous datapoints
- scrape at t5, delta 3 (because keep), cumulative 3 (t4, t5),
- deltatocumulative observes a new metric.
- after “aligning” all delta points to 1m window we have a
start_time = t4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're accurately describing the currently released Ops Agent behavior, then that behavior is also buggy. I would have expected us to be correctly using process_start_time as the start time for our fluent-bit metrics today.
You're correct that using drop would potentially result in undercounting if there were logs uploaded while otel was not running. I think that's the lesser of two evils, since it's only a problem if otel is not running for a long time, whereas overcounting is a problem every single time otel restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're accurately describing the currently released Ops Agent behavior, then that behavior is also buggy. I would have expected us to be correctly using process_start_time as the start time for our fluent-bit metrics today.
Now i realized I was missing a piece of the (puzzle) pipeline. The OA behaviour example i was describing was from observing the debugexporter logs before exporting the metrics.
The googlecloudexporter has a feature cumulative_normalization 1 enabled by default. This feature stores "reset" points (startTime == 0 or startTime == endTime or currValue < prevValue) and uses them as "anchor" to substract the subsequent point values from it. This the same behaviour as auto 2 in the cumulativetodeltaprocessor (which is very similar to drop with additional heuristics).
Note : fluent-bit does have process_start_time, but otel doesn't. None of them are configured to be used in the prometheus receiver.
Current OA released behaviour :
- scrape at t1, cumulative 1 (t1, t1) -> not exported
- first scrape of a prometheus metric has start_time == time
- stored as anchor by
googlecloudexporter
- scrape at t2, cumulative 2 (t1, t2) -> exported as cumulative 1 (2 - 1) with (t1, t2)
- scrape at t3, cumulative 3 (t1, t3) -> exported as cumulative 2 (3 - 1) with (t1, t3)
- otel restarts at t4
- scrape at t5, cumulative 3 (t5, t5) -> not exported
- first scrape of a prometheus metric has start_time == time
- stored as anchor by
googlecloudexporter
You're correct that using drop would potentially result in undercounting if there were logs uploaded while otel was not running. I think that's the lesser of two evils, since it's only a problem if otel is not running for a long time, whereas overcounting is a problem every single time otel restarts.
I agree with this! Undercounting is the lesser of two evils. The current OA released behaviour also undercounts. Thank you for point this out!
Solution
After all this details its clear to me that using keep was double-counting logs. Thank you for pointing this out and providing examples @quentinmit !
I think the best solution is to set auto in the cumulativetodeltaprocessor which does the same thing as cumulative_normalization. Note that since we are aligning all points to have endTime - startTime = 1 min, the export normalization will never be applied.
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline with @quentinmit . The additional heuristics that "auto" uses only happen for the first point. "drop" is the correct setting for initial_value in cumulativetodelta.
confgenerator/agentmetrics.go
Outdated
| "fluentbit_log_entry_count", "fluentbit_log_entry_retry_count", "fluentbit_request_count", | ||
| ), | ||
| otel.TransformationMetrics( | ||
| []otel.TransformQuery{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You don't need to create a slice just to pass variadic arguments here.
(Also, yikes, the implementation of otel.TransformationMetrics currently performs all metric statements, then all datapoint statements, then all scope statements. Instead of performing them in the order passed. I don't think that's relevant for this processor, but it's a big footgun. And the Statement field is a string instead of an ottl.Statement... Add it to the pile of tech debt, I guess.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done ! Addressed Nit in d2cbb8f.
5ad6cc7 to
d2cbb8f
Compare
Description
Collect the following "Otel Logging" metrics and transform them to use as Logging Self Metrics. This PR refactors
AgentSelfMetricsto accommodate to recent changes in the Ops Agent architecture.This is a summary of the new Logging Specific Self Metrics transformations :
agent/log_entry_countis a sum of the following transformations :otelcol_exporter_sent_log_records->agent/log_entry_countwithresponse_code = 200otelcol_exporter_send_failed_log_records->agent/log_entry_countwithresponse_code = 400fluentbit_stackdriver_proc_records_total->agent/log_entry_countwithstatusasresponse_codeagent/log_entry_retry_countis a sum of the following transformations :otelcol_exporter_send_failed_log_records->agent/log_entry_retry_countwithresponse_code = 400fluentbit_stackdriver_retried_records_total->agent/log_entry_retry_countwithstatusasresponse_codeagent/request_countis a sum of the following transformations :grpc.client.attempt.durationwithgrpc.target ~= logging.googleapis->agent/request_countwithresponse_codemapped fromgrpc.statusfluentbit_stackdriver_requests_total->agent/log_entry_retry_countwithstatusasresponse_codeThis a summary of the changes :
AgentSelfMetricswith separate pipelines forfluent_bit,otel,logging_metricsandops_agent.Self Logspipeline when fluent-bit is running idle and otel logging is enabled.Related issue
b/415094454
How has this been tested?
Checklist: