add-agent-monitoring-alerting #401

saahil-mehta · 2025-10-21T08:46:33Z

Add Monitoring & Alerting for Agent Deployments

Overview

Pertaining #142. This PR adds monitoring and alerting infrastructure to the agent-starter-pack, giving users production-grade observability out of the box. The implementation is platform-aware, agent-aware, and fully configurable through user prompts during project creation.

What's New

User-Facing Features

1. Optional Email Alert Notifications

Interactive prompt during agent-starter-pack create asks for an email address
If provided, alerts are delivered via email + Cloud Console
If skipped, alerts are console-only (no email noise for dev environments)
User sees clear confirmation of their choice with visual feedback

2. Configurable Alert Thresholds
All thresholds are exposed as Terraform variables with sensible defaults:

Latency alerts: P95 threshold (default: 3000ms)
Error rate alerts: Error count per 5-min window (default: 10 errors)
Retriever latency alerts (Agentic RAG only): P99 threshold (default: 10000ms)
Agent error rate (Cloud Run): Errors per second (default: 0.5/sec)

Users can customise these in deployment/terraform/dev/vars/env.tfvars after project creation.

Infrastructure Added

Universal Log-Based Metrics (All Agents, All Platforms)

Agent Operation Count: Tracks all agent operations with operation type labels
Agent Error Count by Category: Categorised errors (LLM_FAILURE, TOOL_FAILURE, RETRIEVER_FAILURE, etc.)

Agentic RAG-Specific Metrics

Retriever Latency Distribution: P50/P95/P99 retrieval performance with histogram buckets
Document Count Distribution: Number of documents retrieved per call
Retriever Latency Alert: Fires when P99 > threshold (default 10s)

Agent Engine (Reasoning Engine) Platform

Latency Alert: P95 request latency monitoring using native platform metrics
Error Rate Alert: Fires when log-based error count exceeds threshold in 5-min window
Dashboard: 5-7 chart dashboard including:
- Request count (requests/sec)
- Request latency (P50/P95/P99)
- CPU allocation
- Memory allocation
- Agent errors by category
- Retriever latency (Agentic RAG only)
- Documents retrieved per call (Agentic RAG only)

Cloud Run Platform

Latency Alert: P95 request latency using Cloud Run native metrics
5xx Error Rate Alert: Monitors 5xx response codes
Agent Error Alert: Log-based agent errors with rate threshold

Technical Details

Terraform Structure

New file: deployment/terraform/dev/monitoring.tf (757 lines)
New file: deployment/terraform/monitoring.tf (prod equivalent)
Modified: deployment/terraform/dev/variables.tf (added 4 monitoring variables)
Modified: deployment/terraform/dev/vars/env.tfvars (added default threshold values)
Modified: deployment/terraform/dev/apis.tf (added monitoring.googleapis.com)

Python CLI Integration

agent_starter_pack/cli/commands/create.py: Added interactive email prompt
agent_starter_pack/cli/utils/template.py: Thread alert_notification_email through template processing
tests/cli/commands/test_create.py: Updated test mocks to handle new prompt

Smart Templating

Uses Jinja2 conditionals to render appropriate resources based on:
- cookiecutter.deployment_target (agent_engine vs cloud_run)
- cookiecutter.agent_name (agentic_rag gets extra retriever metrics)
Notification channel only created if email provided (using Terraform count)

Metric Design Decisions

Why log-based metrics for agent telemetry?

Platform-agnostic: Works on both Agent Engine and Cloud Run
Flexible: Can extract any JSON payload attribute from structured logs
Extensible: Users can add custom agent metrics by logging with the right labels

Why native metrics for platform SLOs?

Accuracy: Platform-provided metrics are the source of truth
Performance: No additional overhead from log processing
Consistency: Aligns with Google Cloud best practices

Alert auto-close: 30 minutes

Prevents alert fatigue from transient issues
Long enough to investigate without losing context
Configurable via alert_strategy.auto_close if users want different behaviour

Test Coverage

All tests pass:

✅ 95/95 CLI tests (including new email prompt flow)
✅ Ruff linting
✅ Mypy type checking
✅ Import ordering fixed

Migration Notes

Existing Projects

This is a template change only - existing deployed agents are unaffected
Users can retrofit monitoring by:
1. Copying the new monitoring.tf files
2. Adding the monitoring variables
3. Running terraform apply

New Projects

Zero additional effort required
Users just need to answer the email prompt during creation
Monitoring deploys automatically with the agent infrastructure

Example Usage

$ agent-starter-pack create my-agent

# ... after other prompts ...

Monitoring & Alerting Setup
Configure email notifications for production alerts (optional).
Email for alert notifications: ops-team@example.com
✓ Alerts will be sent to: ops-team@example.com

# Or skip it:
Email for alert notifications:
⚠ Email notifications disabled. Alerts will only appear in Cloud Console.

After deployment, users get:

Real-time dashboards in Cloud Monitoring
Automatic alerts when thresholds are breached
Structured logs for debugging with labels.service_name filtering

Checklist

Added user-facing email prompt with clear feedback
Created comprehensive monitoring.tf for both dev and prod
Added configurable threshold variables with sensible defaults
Platform-specific alerts (Agent Engine vs Cloud Run)
Agent-specific metrics (Agentic RAG retriever monitoring)
Updated CLI tests with new prompt flow
All linting and type checking passes
Tested with both email provided and skipped scenarios

This reverts commit ce49d29.

gemini-code-assist · 2025-10-21T08:46:58Z

Summary of Changes

Hello @saahil-mehta, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of agents deployed via the starter pack by integrating robust monitoring and alerting capabilities. It provides a foundational Terraform setup for tracking agent performance, identifying errors, and receiving timely notifications, ensuring better operational insights and reliability for deployed agents.

Highlights

Monitoring Infrastructure: Introduced new Terraform files (monitoring.tf) to establish comprehensive monitoring and alerting for deployed agents, including log-based metrics, alert policies, and a pre-configured Google Cloud Monitoring dashboard.
Service Enablement: Enabled the monitoring.googleapis.com service in apis.tf files to support the new monitoring features.
Configurable Alerts: Added new Terraform variables (variables.tf, vars/env.tfvars) to allow users to configure alert notification emails and customize thresholds for latency and error rates.
CLI Integration: Updated the create command in the CLI to interactively prompt users for an alert notification email during agent setup, streamlining the configuration process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive monitoring and alerting solution for the agent starter pack. It adds log-based metrics, alert policies for key performance indicators like latency and error rates, and a pre-configured monitoring dashboard. The changes are applied to both dev and prod/staging environments. The CLI has also been updated to allow users to configure an email address for alert notifications.

The implementation is solid, but there is a significant amount of code duplication between the Terraform configurations for the different environments. I have added a comment suggesting a refactoring into a reusable Terraform module to improve maintainability. Other than that, the changes are well-executed and a valuable addition to the project.

agent_starter_pack/base_template/deployment/terraform/monitoring.tf

gemini-code-assist

Code Review

This pull request introduces a comprehensive and well-structured monitoring and alerting capability for the agent starter pack. The changes are extensive, covering Terraform infrastructure for metrics, alerts, and dashboards, as well as updates to the Python CLI to support configuration during project creation. The implementation is robust, leveraging a smart combination of native platform metrics and flexible log-based metrics, with thoughtful considerations for different deployment targets and agent types. The code is of high quality, and the feature is a valuable addition. I have one suggestion to improve the robustness of the Terraform dependencies.

agent_starter_pack/base_template/deployment/terraform/modules/monitoring/main.tf

This reverts commit ce49d29.

…etrics Add conditional depends_on entries for agent_retriever_latency and agent_retriever_document_count metrics when agent_name is agentic_rag. This prevents potential race conditions during terraform apply where the dashboard could be created before the log-based metrics exist. The dashboard references these metrics by name in filter strings, so Terraform cannot automatically detect the dependency. Explicit depends_on ensures proper resource creation order.

…ent-starter-pack into agent-monitoring

saahil-mehta · 2025-11-03T10:01:22Z

Prompt for monitoring email addition:

Passing tests:

@eliasecchig @allen-stephen

allen-stephen · 2025-11-04T21:29:25Z

/gcbrun

saahil-mehta added 14 commits October 20, 2025 18:13

feat : agent monitoring usage + modules

229f883

feat : agent monitoring vars

6cd0400

feat : agent monitoring apis

42a639a

feat(var) : notification email

8ed27f5

feat(vars) : user configurable thresholds

51a787a

feat : user configurable email alerting

ce49d29

update : alerting vars

82a41f4

update : doc clarity

acd627b

refactor : fmt

600c5b4

Revert "feat : user configurable email alerting"

ec677ad

This reverts commit ce49d29.

refactor : format

2f01ebe

feat : (optional) email for alerting

f8e5361

feat : alerting email

c4bfc46

fix : flow hiccup + improve coherence

900ef46

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

agent_starter_pack/base_template/deployment/terraform/monitoring.tf Show resolved Hide resolved

refactor : monitoring module + usage

376c415

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

agent_starter_pack/base_template/deployment/terraform/modules/monitoring/main.tf Show resolved Hide resolved

saahil-mehta added 12 commits November 3, 2025 08:47

feat : agent monitoring usage + modules

b180d74

feat : agent monitoring vars

2205e6f

feat : agent monitoring apis

b597ffd

feat(var) : notification email

8b81302

feat(vars) : user configurable thresholds

c24a471

feat : user configurable email alerting

c04d70a

update : alerting vars

bbc058e

update : doc clarity

35e4346

refactor : fmt

ea1bea8

Revert "feat : user configurable email alerting"

5149686

This reverts commit ce49d29.

refactor : format

7378b91

feat : (optional) email for alerting

c929ab1

saahil-mehta added 5 commits November 3, 2025 08:47

feat : alerting email

9c0aef6

fix : flow hiccup + improve coherence

9f20b51

refactor : monitoring module + usage

de64105

Merge branch 'agent-monitoring' of https://github.com/saahil-mehta/ag…

133bec3

…ent-starter-pack into agent-monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add-agent-monitoring-alerting #401

add-agent-monitoring-alerting #401

Uh oh!

saahil-mehta commented Oct 21, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

saahil-mehta commented Nov 3, 2025

Uh oh!

allen-stephen commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add-agent-monitoring-alerting #401

Are you sure you want to change the base?

add-agent-monitoring-alerting #401

Uh oh!

Conversation

saahil-mehta commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Monitoring & Alerting for Agent Deployments

Overview

What's New

User-Facing Features

Infrastructure Added

Technical Details

Metric Design Decisions

Test Coverage

Migration Notes

Example Usage

Related Documentation

Checklist

Uh oh!

gemini-code-assist bot commented Oct 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

saahil-mehta commented Nov 3, 2025

Uh oh!

allen-stephen commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saahil-mehta commented Oct 21, 2025 •

edited

Loading