这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@franciscovalentecastro
Copy link
Contributor

@franciscovalentecastro franciscovalentecastro commented May 27, 2025

Description

Collect the following "Otel Logging" metrics and transform them to use as Logging Self Metrics. This PR refactors AgentSelfMetrics to accommodate to recent changes in the Ops Agent architecture.

This is a summary of the new Logging Specific Self Metrics transformations :

  • agent/log_entry_count is a sum of the following transformations :

    • otelcol_exporter_sent_log_records -> agent/log_entry_count with response_code = 200
    • otelcol_exporter_send_failed_log_records -> agent/log_entry_count with response_code = 400
    • fluentbit_stackdriver_proc_records_total -> agent/log_entry_count with status as response_code
  • agent/log_entry_retry_count is a sum of the following transformations :

    • otelcol_exporter_send_failed_log_records -> agent/log_entry_retry_count with response_code = 400
    • fluentbit_stackdriver_retried_records_total -> agent/log_entry_retry_count with status as response_code
  • agent/request_count is a sum of the following transformations :

    • grpc.client.attempt.duration with grpc.target ~= logging.googleapis -> agent/request_count with response_code mapped from grpc.status
    • fluentbit_stackdriver_requests_total -> agent/log_entry_retry_count with status as response_code

This a summary of the changes :

  • Refactor AgentSelfMetrics with separate pipelines for fluent_bit, otel, logging_metrics and ops_agent.
  • Keep Self Logs pipeline when fluent-bit is running idle and otel logging is enabled.

Related issue

b/415094454

How has this been tested?

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

@franciscovalentecastro franciscovalentecastro changed the title [Draft] Testing updates to get Otel Logging AgentSelfMetrics. [confgenerator] Add Otel Logging metrics to AgentSelfMetrics. May 29, 2025
@franciscovalentecastro franciscovalentecastro marked this pull request as ready for review June 2, 2025 16:00
@franciscovalentecastro
Copy link
Contributor Author

franciscovalentecastro commented Jun 3, 2025

After exploring more the implementation of the monitoring metrics, it seems better to implement the metric googlecloudlogging/log_count, which would be similar to googlecloudmonitoring/point_count, instead of setting response_code manually in agent/log_entry_count. See for the already implemented metric : https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/exporter/collector/metrics.go#L186. CC @quentinmit .

@franciscovalentecastro franciscovalentecastro force-pushed the fcovalente-otel-logging-self-metrics branch from e397156 to 38cf69b Compare July 3, 2025 22:01
@franciscovalentecastro franciscovalentecastro force-pushed the fcovalente-otel-logging-self-metrics branch from 38cf69b to 359e194 Compare August 19, 2025 18:50
@franciscovalentecastro franciscovalentecastro force-pushed the fcovalente-otel-logging-self-metrics branch from 3b4e3de to 8aad915 Compare August 27, 2025 21:09
otel.AggregateLabels("sum", "response_code")),
),
// Aggregating delta metrics from fluent-bit and otel logging will.
// Keep the initial value so the resulting cumulative metric at the end of the pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to use drop or we'll double-count fluent bit values if otel restarts and fluent-bit does not, no?

Copy link
Contributor Author

@franciscovalentecastro franciscovalentecastro Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll double-count fluent bit values if otel restarts and fluent-bit does not

This won't happen because the mechanism that keeps the "historic metric count" for the summed metrics is the "deltatocumulative" processor, which will reset if otel restarts.

Note : drop could almost work, but we would loose all the "logs sent on startup" count from the first metric datapoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this wouldn't double-count. Consider the following scrapes:

  • scrape 1, delta 1 (because keep), cumulative 1 (t0-t1)
  • scrape 2, delta 1 (2-1 = 1), cumulative 2 (t0-t2)
  • scrape 3, delta 1 (3-1 = 1), cumulative 3 (t0-t3)
  • otel restarts
  • scrape 3, delta 3 (because keep), cumulative 3 (t4-t5)

The scrape after the restart should be reporting a cumulative 0 because we already reported those increments, not a cumulative 3 with a fresh start time (which is considered as additional to the previous cumulative 3)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider, for contrast, the current released Ops Agent behaviour in the same example. I added some updates in the time ranges assuming t(n+1) - t(n) ‎ =  1 minute regularity to simplify the example :

  • scrape at t1, cumulative 1 (t1, t1)
    • first scrape of a prometheus metric has start_time == time
  • scrape at t2, cumulative 2 (t1, t2)
  • scrape at t3, cumulative 3 (t1, t3)
  • otel restarts at t4
  • scrape at t5, cumulative 3 (t5, t5)
    • first scrape of a prometheus metric has start_time == time
    • the scrape gets 3 (because fluent-bit didn't restart)
    • reported cumulative 3 with fresh timestamp t5

The current implementation in this PR with keep will result in the same behaviour as shown in this example.

The scrape after the restart should be reporting a cumulative 0 because we already reported those increments, not a cumulative 3 with a fresh start time (which is considered as additional to the previous cumulative 3)

Following the semantics of a "cumulative metric" the range (t5, t6) shouldn't have any data from scrapes 1, 2 or 3, because there were no observations of an experiment ("logs sent") in that time range. This is "semantically" correct, but enforcing this fails when considering other edge cases. For example, if otel takes 3 minutes to start, the first scrape will get "log counts" from an unknown time range since fluent-bit started.

The main idea of why using keep doesn't double count is that the initial datapoint value after a restart only represents data of an "unknown" time range (e.g. (?, t5)) that happened before the scrape.

The provided example can be rewritten like this :

  • scrape at t1, delta 1 (because keep), cumulative 1 (t0, t1),
    • after “aligning” all delta points to a 1m window we have a start_time = t0.
  • scrape at t2, delta 1 (2-1 = 1), cumulative 2 (t0, t2)
  • scrape at t3, delta 1 (3-1 = 1), cumulative 3 (t0, t3)
  • otel restarts at t4,
    • deltatocumulative looses in-memory data about the previous datapoints
  • scrape at t5, delta 3 (because keep), cumulative 3 (t4, t5),
    • deltatocumulative observes a new metric.
    • after “aligning” all delta points to 1m window we have a start_time = t4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're accurately describing the currently released Ops Agent behavior, then that behavior is also buggy. I would have expected us to be correctly using process_start_time as the start time for our fluent-bit metrics today.

You're correct that using drop would potentially result in undercounting if there were logs uploaded while otel was not running. I think that's the lesser of two evils, since it's only a problem if otel is not running for a long time, whereas overcounting is a problem every single time otel restarts.

Copy link
Contributor Author

@franciscovalentecastro franciscovalentecastro Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're accurately describing the currently released Ops Agent behavior, then that behavior is also buggy. I would have expected us to be correctly using process_start_time as the start time for our fluent-bit metrics today.

Now i realized I was missing a piece of the (puzzle) pipeline. The OA behaviour example i was describing was from observing the debugexporter logs before exporting the metrics.

The googlecloudexporter has a feature cumulative_normalization 1 enabled by default. This feature stores "reset" points (startTime == 0 or startTime == endTime or currValue < prevValue) and uses them as "anchor" to substract the subsequent point values from it. This the same behaviour as auto 2 in the cumulativetodeltaprocessor (which is very similar to drop with additional heuristics).

Note : fluent-bit does have process_start_time, but otel doesn't. None of them are configured to be used in the prometheus receiver.

Current OA released behaviour :

  • scrape at t1, cumulative 1 (t1, t1) -> not exported
    • first scrape of a prometheus metric has start_time == time
    • stored as anchor by googlecloudexporter
  • scrape at t2, cumulative 2 (t1, t2) -> exported as cumulative 1 (2 - 1) with (t1, t2)
  • scrape at t3, cumulative 3 (t1, t3) -> exported as cumulative 2 (3 - 1) with (t1, t3)
  • otel restarts at t4
  • scrape at t5, cumulative 3 (t5, t5) -> not exported
    • first scrape of a prometheus metric has start_time == time
    • stored as anchor by googlecloudexporter

You're correct that using drop would potentially result in undercounting if there were logs uploaded while otel was not running. I think that's the lesser of two evils, since it's only a problem if otel is not running for a long time, whereas overcounting is a problem every single time otel restarts.

I agree with this! Undercounting is the lesser of two evils. The current OA released behaviour also undercounts. Thank you for point this out!

Solution

After all this details its clear to me that using keep was double-counting logs. Thank you for pointing this out and providing examples @quentinmit !

I think the best solution is to set auto in the cumulativetodeltaprocessor which does the same thing as cumulative_normalization. Note that since we are aligning all points to have endTime - startTime = 1 min, the export normalization will never be applied.

Footnotes

  1. https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/googlecloudexporter#configuration-reference

  2. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/cumulativetodeltaprocessor/internal/tracking/tracker_test.go#L110-L125

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @quentinmit . The additional heuristics that "auto" uses only happen for the first point. "drop" is the correct setting for initial_value in cumulativetodelta.

"fluentbit_log_entry_count", "fluentbit_log_entry_retry_count", "fluentbit_request_count",
),
otel.TransformationMetrics(
[]otel.TransformQuery{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You don't need to create a slice just to pass variadic arguments here.

(Also, yikes, the implementation of otel.TransformationMetrics currently performs all metric statements, then all datapoint statements, then all scope statements. Instead of performing them in the order passed. I don't think that's relevant for this processor, but it's a big footgun. And the Statement field is a string instead of an ottl.Statement... Add it to the pile of tech debt, I guess.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ! Addressed Nit in d2cbb8f.

@franciscovalentecastro franciscovalentecastro force-pushed the fcovalente-otel-logging-self-metrics branch from 5ad6cc7 to d2cbb8f Compare October 15, 2025 16:26
@franciscovalentecastro franciscovalentecastro added the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 20, 2025
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 20, 2025
@franciscovalentecastro franciscovalentecastro added the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 20, 2025
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 20, 2025
@franciscovalentecastro franciscovalentecastro merged commit dae879b into master Oct 20, 2025
62 checks passed
@franciscovalentecastro franciscovalentecastro deleted the fcovalente-otel-logging-self-metrics branch October 20, 2025 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants