otel spans for llm requests and overall taskrun #43

dexhorthy · 2025-03-26T01:49:24Z

still needs

spans for tool calls
working on a little refactoring, but learned my lesson on that one last time

this should work out of the box with the default otel stack

https://www.loom.com/share/b6c27b635e544f9dbdf5ed6c863bf8ef

Important

Add OpenTelemetry spans for TaskRun lifecycle and LLM requests, update CRDs, and refactor status handling.

OpenTelemetry Integration:
- Add tracing for TaskRun lifecycle and LLM requests in taskrun_controller.go.
- Introduce endTaskRunSpan() to close spans on errors or completion.
- Use Tracer in TaskRunReconciler for span creation.
CRD Updates:
- Add SpanContext to TaskRunStatus and TaskRunToolCallStatus in taskrun_types.go and taskruntoolcall_types.go.
- Update CRD YAMLs to include SpanContext fields.
Refactoring:
- Replace string status with TaskRunStatusStatus type in taskrun_types.go.
- Remove Initializing status from TaskRunStatus enum.
Miscellaneous:
- Update main.go to initialize OpenTelemetry tracer and meter.
- Rename otel import to kubechainotel in main.go.

^{This description was created by}^{for 7ea1dc5. It will automatically update as commits are pushed.}

ellipsis-dev

❌ Changes requested. Reviewed everything up to 7ea1dc5 in 2 minutes and 32 seconds

More details

Looked at 500 lines of code in 8 files
Skipped 0 files when reviewing.
Skipped posting 8 drafted comments based on config settings.

1. kubechain/internal/controller/taskrun/taskrun_controller.go:649

Draft comment:
When parsing the stored span IDs (lines 652-657), if conversion fails, consider logging the error so that tracing issues are more easily debugged. This would help catch issues early if the stored hex values are invalid.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
While error logging could help with debugging, the code is handling the errors gracefully by continuing execution. The errors would only occur if the stored IDs are invalid, which is an edge case. The code already has a fallback to use the current context if parsing fails. Adding logs may add noise without providing much value.
The comment identifies a potential observability gap that could make debugging harder. Silent failures can be problematic in distributed tracing systems.
However, this is a non-critical path - failed span parsing just means we lose some tracing context but doesn't impact functionality. The code handles the failure case appropriately.
The comment should be deleted as it suggests adding non-essential logging for an edge case that is already handled gracefully.

2. kubechain/internal/controller/taskrun/taskrun_controller.go:556

Draft comment:
The comment on ending the span immediately (line 553-556) indicates this is a temporary workaround. Documenting a plan for maintaining continuous spans across reconciliations would clarify intent for future developers.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
The comment suggests documenting a plan for future developers, which is not a specific code suggestion or a request for a test. It seems to be more about ensuring future clarity, which violates the rule against asking the PR author to update documentation or confirm intentions.

3. kubechain/internal/controller/taskrun/taskrun_controller.go:696

Draft comment:
When recording and setting error status on the child span (lines 696-700), ensure that any additional context or metadata needed for troubleshooting is added. This can help with deeper insights in monitoring systems.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
This comment is asking the PR author to ensure that additional context or metadata is added for troubleshooting, which is a form of asking them to double-check their work. This violates the rule against asking the author to ensure behavior is intended or to double-check things.

4. kubechain/internal/controller/taskrun/taskrun_controller.go:519

Draft comment:
Note that ending the span immediately (line 526) creates separate spans per reconciliation. Ensure this is the intended trace model.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50%
None

5. kubechain/internal/controller/taskrun/taskrun_controller.go:650

Draft comment:
If parsing the parent's SpanContext fails (lines 652-657), consider logging the error to help diagnose trace linkage issues.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50%
The suggestion is about a real code change in the PR. The current code silently fails which could make debugging tracing issues harder. However, these parsing errors would be rare edge cases since the trace/span IDs are generated by the system itself. The benefit of adding logging here is relatively minor.
Am I underestimating the importance of trace debugging? Silent failures in observability code could be more problematic than I'm assuming.
While trace debugging is important, these specific parsing errors would only occur if there was corruption in the stored trace IDs, which is extremely unlikely since they are system-generated. The logging would rarely if ever be useful.
The comment suggests a minor improvement that would rarely be useful. While technically valid, it's not important enough to warrant a code change.

6. kubechain/internal/controller/taskrun/taskrun_controller.go:687

Draft comment:
Good use of a child span to track the LLM request. The error handling portion (lines 688-700) properly records errors in the span.
Reason this comment was not posted:
Confidence changes required: 10% <= threshold 50%
None

7. kubechain/cmd/main.go:198

Draft comment:
The error message on line 198 reads "to initialize metrics certificate watcher". It would be clearer and more consistent with the other error messages if it said "failed to initialize metrics certificate watcher".
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

8. kubechain/internal/controller/taskrun/taskrun_controller.go:103

Draft comment:
Minor typographical error: In the comment at line 103, please change 'dont' to 'don't' for proper grammar.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_fgZ5hIVJVu5UfLBC

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-03-26T01:52:02Z

kubechain/api/v1alpha1/taskrun_types.go

 	SpanContext *SpanContext `json:"spanContext,omitempty"`
 }

+type TaskRunStatusStatus string


Consider using the existing TaskStatusType instead of creating a new type with identical values.

TaskStatusType type and constants (task_types.go)

ellipsis-dev · 2025-03-26T01:52:02Z

kubechain/internal/controller/taskrun/taskrun_controller.go

+	var childSpan trace.Span
+
+	// Use controller's tracer if available, otherwise get the global tracer
+	tracer := r.Tracer


The pattern of checking if r.Tracer is nil and then calling otel.GetTracerProvider().Tracer("taskrun") is repeated several times. Consider refactoring this logic into a helper function to reduce duplication and improve maintainability.

ellipsis-dev · 2025-03-26T01:52:02Z

kubechain/internal/controller/taskrun/taskrun_controller.go

+	var spanID trace.SpanID
+
+	// Convert hex strings to byte arrays
+	traceIDBytes, err := trace.TraceIDFromHex(taskRun.Status.SpanContext.TraceID)


Consider logging errors when converting TraceID/SpanID (lines 382-390) to aid debugging if conversion fails.

balanceiskey

Lookin' good to me, excited to see this put to work

dexhorthy added 5 commits March 25, 2025 17:03

spantypes, etc

51ee6ce

types propagated

a612f56

llm request span working

45323b3

wip on ending spans

b299cd9

spans for llm requests

7ea1dc5

dexhorthy requested review from allisoneer and balanceiskey March 26, 2025 01:49

ellipsis-dev bot reviewed Mar 26, 2025

View reviewed changes

ok one little refactor

c3422fd

balanceiskey approved these changes Mar 26, 2025

View reviewed changes

dexhorthy merged commit 89e67b2 into humanlayer:main Mar 26, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

otel spans for llm requests and overall taskrun #43

otel spans for llm requests and overall taskrun #43

Uh oh!

dexhorthy commented Mar 26, 2025 •

edited

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Mar 26, 2025

Uh oh!

ellipsis-dev bot Mar 26, 2025

Uh oh!

ellipsis-dev bot Mar 26, 2025

Uh oh!

balanceiskey left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

otel spans for llm requests and overall taskrun #43

otel spans for llm requests and overall taskrun #43

Uh oh!

Conversation

dexhorthy commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

balanceiskey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dexhorthy commented Mar 26, 2025 •

edited

Loading