[confgenerator] Add Otel Logging metrics to `AgentSelfMetrics`. #1957

franciscovalentecastro · 2025-05-27T18:42:49Z

Description

Collect the following "Otel Logging" metrics and transform them to use as Logging Self Metrics. This PR refactors AgentSelfMetrics to accommodate to recent changes in the Ops Agent architecture.

This is a summary of the new Logging Specific Self Metrics transformations :

agent/log_entry_count is a sum of the following transformations :
- otelcol_exporter_sent_log_records -> agent/log_entry_count with response_code = 200
- otelcol_exporter_send_failed_log_records -> agent/log_entry_count with response_code = 400
- fluentbit_stackdriver_proc_records_total -> agent/log_entry_count with status as response_code
agent/log_entry_retry_count is a sum of the following transformations :
- otelcol_exporter_send_failed_log_records -> agent/log_entry_retry_count with response_code = 400
- fluentbit_stackdriver_retried_records_total -> agent/log_entry_retry_count with status as response_code
agent/request_count is a sum of the following transformations :
- grpc.client.attempt.duration with grpc.target ~= logging.googleapis -> agent/request_count with response_code mapped from grpc.status
- fluentbit_stackdriver_requests_total -> agent/log_entry_retry_count with status as response_code

This a summary of the changes :

Refactor AgentSelfMetrics with separate pipelines for fluent_bit, otel, logging_metrics and ops_agent.
Keep Self Logs pipeline when fluent-bit is running idle and otel logging is enabled.

Related issue

b/415094454

How has this been tested?

Checklist:

Unit tests
- Unit tests do not apply.
- Unit tests have been added/modified and passed for this PR.
Integration tests
- Integration tests do not apply.
- Integration tests have been added/modified and passed for this PR.
Documentation
- This PR introduces no user visible changes.
- This PR introduces user visible changes and the corresponding documentation change has been made.
Minor version bump
- This PR introduces no new features.
- This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
- This PR bumps the version.

franciscovalentecastro · 2025-06-03T19:51:32Z

After exploring more the implementation of the monitoring metrics, it seems better to implement the metric googlecloudlogging/log_count, which would be similar to googlecloudmonitoring/point_count, instead of setting response_code manually in agent/log_entry_count. See for the already implemented metric : https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/exporter/collector/metrics.go#L186. CC @quentinmit .

confgenerator/agentmetrics.go

quentinmit · 2025-09-08T21:26:50Z

confgenerator/agentmetrics.go

+				otel.AggregateLabels("sum", "response_code")),
+		),
+		// Aggregating delta metrics from fluent-bit and otel logging will.
+		// Keep the initial value so the resulting cumulative metric at the end of the pipeline.


I think we need to use drop or we'll double-count fluent bit values if otel restarts and fluent-bit does not, no?

we'll double-count fluent bit values if otel restarts and fluent-bit does not

This won't happen because the mechanism that keeps the "historic metric count" for the summed metrics is the "deltatocumulative" processor, which will reset if otel restarts.

Note : drop could almost work, but we would loose all the "logs sent on startup" count from the first metric datapoint.

I don't understand why this wouldn't double-count. Consider the following scrapes:

scrape 1, delta 1 (because keep), cumulative 1 (t0-t1)

scrape 2, delta 1 (2-1 = 1), cumulative 2 (t0-t2)

scrape 3, delta 1 (3-1 = 1), cumulative 3 (t0-t3)

otel restarts

scrape 3, delta 3 (because keep), cumulative 3 (t4-t5)

The scrape after the restart should be reporting a cumulative 0 because we already reported those increments, not a cumulative 3 with a fresh start time (which is considered as additional to the previous cumulative 3)

Consider, for contrast, the current released Ops Agent behaviour in the same example. I added some updates in the time ranges assuming t(n+1) - t(n) ‎ = 1 minute regularity to simplify the example :

scrape at t1, cumulative 1 (t1, t1)

first scrape of a prometheus metric has start_time == time

scrape at t2, cumulative 2 (t1, t2)

scrape at t3, cumulative 3 (t1, t3)

otel restarts at t4

scrape at t5, cumulative 3 (t5, t5)

first scrape of a prometheus metric has start_time == time

the scrape gets 3 (because fluent-bit didn't restart)

reported cumulative 3 with fresh timestamp t5

The current implementation in this PR with keep will result in the same behaviour as shown in this example.

The scrape after the restart should be reporting a cumulative 0 because we already reported those increments, not a cumulative 3 with a fresh start time (which is considered as additional to the previous cumulative 3)

Following the semantics of a "cumulative metric" the range (t5, t6) shouldn't have any data from scrapes 1, 2 or 3, because there were no observations of an experiment ("logs sent") in that time range. This is "semantically" correct, but enforcing this fails when considering other edge cases. For example, if otel takes 3 minutes to start, the first scrape will get "log counts" from an unknown time range since fluent-bit started.

The main idea of why using keep doesn't double count is that the initial datapoint value after a restart only represents data of an "unknown" time range (e.g. (?, t5)) that happened before the scrape.

The provided example can be rewritten like this :

scrape at t1, delta 1 (because keep), cumulative 1 (t0, t1),

after “aligning” all delta points to a 1m window we have a start_time = t0.

scrape at t2, delta 1 (2-1 = 1), cumulative 2 (t0, t2)

scrape at t3, delta 1 (3-1 = 1), cumulative 3 (t0, t3)

otel restarts at t4,

deltatocumulative looses in-memory data about the previous datapoints

scrape at t5, delta 3 (because keep), cumulative 3 (t4, t5),

deltatocumulative observes a new metric.

after “aligning” all delta points to 1m window we have a start_time = t4

If you're accurately describing the currently released Ops Agent behavior, then that behavior is also buggy. I would have expected us to be correctly using process_start_time as the start time for our fluent-bit metrics today.

You're correct that using drop would potentially result in undercounting if there were logs uploaded while otel was not running. I think that's the lesser of two evils, since it's only a problem if otel is not running for a long time, whereas overcounting is a problem every single time otel restarts.

If you're accurately describing the currently released Ops Agent behavior, then that behavior is also buggy. I would have expected us to be correctly using process_start_time as the start time for our fluent-bit metrics today.

Now i realized I was missing a piece of the (puzzle) pipeline. The OA behaviour example i was describing was from observing the debugexporter logs before exporting the metrics.

The googlecloudexporter has a feature cumulative_normalization ¹ enabled by default. This feature stores "reset" points (startTime == 0 or startTime == endTime or currValue < prevValue) and uses them as "anchor" to substract the subsequent point values from it. This the same behaviour as auto ² in the cumulativetodeltaprocessor (which is very similar to drop with additional heuristics).

Note : fluent-bit does have process_start_time, but otel doesn't. None of them are configured to be used in the prometheus receiver.

Current OA released behaviour :

scrape at t1, cumulative 1 (t1, t1) -> not exported

first scrape of a prometheus metric has start_time == time

stored as anchor by googlecloudexporter

scrape at t2, cumulative 2 (t1, t2) -> exported as cumulative 1 (2 - 1) with (t1, t2)

scrape at t3, cumulative 3 (t1, t3) -> exported as cumulative 2 (3 - 1) with (t1, t3)

otel restarts at t4

scrape at t5, cumulative 3 (t5, t5) -> not exported

first scrape of a prometheus metric has start_time == time

stored as anchor by googlecloudexporter

You're correct that using drop would potentially result in undercounting if there were logs uploaded while otel was not running. I think that's the lesser of two evils, since it's only a problem if otel is not running for a long time, whereas overcounting is a problem every single time otel restarts.

I agree with this! Undercounting is the lesser of two evils. The current OA released behaviour also undercounts. Thank you for point this out!

Solution

After all this details its clear to me that using keep was double-counting logs. Thank you for pointing this out and providing examples @quentinmit !

I think the best solution is to set auto in the cumulativetodeltaprocessor which does the same thing as cumulative_normalization. Note that since we are aligning all points to have endTime - startTime = 1 min, the export normalization will never be applied.

Footnotes

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/googlecloudexporter#configuration-reference ↩

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/cumulativetodeltaprocessor/internal/tracking/tracker_test.go#L110-L125 ↩

Discussed offline with @quentinmit . The additional heuristics that "auto" uses only happen for the first point. "drop" is the correct setting for initial_value in cumulativetodelta.

confgenerator/agentmetrics.go

confgenerator/otel/ottl/ottl.go

confgenerator/otel/processors.go

quentinmit · 2025-09-19T22:47:30Z

confgenerator/agentmetrics.go

+			"fluentbit_log_entry_count", "fluentbit_log_entry_retry_count", "fluentbit_request_count",
+		),
+		otel.TransformationMetrics(
+			[]otel.TransformQuery{


Nit: You don't need to create a slice just to pass variadic arguments here.

(Also, yikes, the implementation of otel.TransformationMetrics currently performs all metric statements, then all datapoint statements, then all scope statements. Instead of performing them in the order passed. I don't think that's relevant for this processor, but it's a big footgun. And the Statement field is a string instead of an ottl.Statement... Add it to the pile of tech debt, I guess.)

Done ! Addressed Nit in d2cbb8f.

confgenerator/agentmetrics.go

… processors.

franciscovalentecastro changed the title ~~[Draft] Testing updates to get Otel Logging AgentSelfMetrics.~~ [confgenerator] Add Otel Logging metrics to AgentSelfMetrics. May 29, 2025

franciscovalentecastro marked this pull request as ready for review June 2, 2025 16:00

franciscovalentecastro requested a review from quentinmit June 2, 2025 20:43

quentinmit reviewed Jun 3, 2025

View reviewed changes

confgenerator/agentmetrics.go Outdated Show resolved Hide resolved

confgenerator/agentmetrics.go Outdated Show resolved Hide resolved

franciscovalentecastro force-pushed the fcovalente-otel-logging-self-metrics branch from e397156 to 38cf69b Compare July 3, 2025 22:01

franciscovalentecastro force-pushed the fcovalente-otel-logging-self-metrics branch from 38cf69b to 359e194 Compare August 19, 2025 18:50

franciscovalentecastro force-pushed the fcovalente-otel-logging-self-metrics branch from 3b4e3de to 8aad915 Compare August 27, 2025 21:09

franciscovalentecastro requested review from jefferbrecht and quentinmit August 29, 2025 15:46

quentinmit reviewed Sep 8, 2025

View reviewed changes

quentinmit approved these changes Sep 19, 2025

View reviewed changes

franciscovalentecastro added 18 commits October 15, 2025 16:08

[confgenerator] Refactor AgentSelfMetrics.

cc1d9ba

Don't remove fluent-bit self-metrics when Otel Logging is enabled.

afed386

Update goldens.

54236e0

Create logging_metrics pipeline.

29c3c77

Update metric filters.

1384cb3

Address details so interval aggregation works.

cad8422

"Combine" logging metrics.

a91256c

Try using MetricsGeneration.

d895675

Update LoggingMetricsPipelineProcessors pipeline.

0c8e5b0

Update goldens.

206f899

Update goldens.

2ef5f22

Use CumulativeToDelta and DeltaToCumulative for metric aggregation.

2c4f6f2

Update goldens.

81ab74a

Remove unnecessary metricsgeneration processor.

d5247dd

Remove debug TODOs and reorder pipelines.

35f43fd

Remove unnecessary MetricsFilter.

d547f85

Improve 1m interval batching and aggregation logic. Refactor order of…

42b6562

… processors.

Update goldens.

8cbfefc

franciscovalentecastro added 19 commits October 15, 2025 16:11

Update CopyMetric.

8566100

Update timestamp alignment logic.

31ef438

Improve timestamp alignment logic. Test auto for initial_value.

8df389f

Simplify start_time logic further.

46874cb

Set initial_value: keep again.

3261a5a

Rename pipeline names to avoid collision with user config.

dc9f496

Add comments and update goldens.

00f3ec9

See initial_value: auto for cumulativetodelta processor.

fa3bea9

Update goldens.

65ff0a5

Resolve merge conflicts.

7129420

Simplify pipeline to use deltatocumulative for metric aggregation.

d2f837e

Update goldens.

c831309

Simplify pipeline using TransformationMetrics.

8fc99f1

Update goldens.

ee2cdf3

No need to update ExtractCountMetric.

5f0ff2f

Set metric context.

55d9ca4

Remove double quote.

938e52a

Update goldens after rebase.

a28c1b9

Address comments.

d2cbb8f

franciscovalentecastro force-pushed the fcovalente-otel-logging-self-metrics branch from 5ad6cc7 to d2cbb8f Compare October 15, 2025 16:26

franciscovalentecastro added 3 commits October 15, 2025 18:34

Merge branch 'master' into fcovalente-otel-logging-self-metrics

74ead13

Keep submodules versions set in master.

64e1810

Merge branch 'master' into fcovalente-otel-logging-self-metrics

fc5705e

franciscovalentecastro added the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 20, 2025

stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 20, 2025

franciscovalentecastro added the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 20, 2025

stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 20, 2025

Merge branch 'master' into fcovalente-otel-logging-self-metrics

7bd5202

franciscovalentecastro merged commit dae879b into master Oct 20, 2025
62 checks passed

franciscovalentecastro deleted the fcovalente-otel-logging-self-metrics branch October 20, 2025 19:13

[confgenerator] Add Otel Logging metrics to AgentSelfMetrics. #1957

[confgenerator] Add Otel Logging metrics to AgentSelfMetrics. #1957

Uh oh!

Conversation

franciscovalentecastro commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issue

How has this been tested?

Checklist:

Uh oh!

franciscovalentecastro commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quentinmit Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

franciscovalentecastro Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quentinmit Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

franciscovalentecastro Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

quentinmit Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

franciscovalentecastro Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Current OA released behaviour :

Solution

Footnotes

Uh oh!

franciscovalentecastro Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quentinmit Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

franciscovalentecastro Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[confgenerator] Add Otel Logging metrics to `AgentSelfMetrics`. #1957

[confgenerator] Add Otel Logging metrics to `AgentSelfMetrics`. #1957

franciscovalentecastro commented May 27, 2025 •

edited

Loading

franciscovalentecastro commented Jun 3, 2025 •

edited

Loading

franciscovalentecastro Sep 9, 2025 •

edited

Loading

franciscovalentecastro Sep 12, 2025 •

edited

Loading