-
Notifications
You must be signed in to change notification settings - Fork 883
add-agent-monitoring-alerting #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add-agent-monitoring-alerting #401
Conversation
This reverts commit ce49d29.
Summary of ChangesHello @saahil-mehta, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the observability of agents deployed via the starter pack by integrating robust monitoring and alerting capabilities. It provides a foundational Terraform setup for tracking agent performance, identifying errors, and receiving timely notifications, ensuring better operational insights and reliability for deployed agents. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive monitoring and alerting solution for the agent starter pack. It adds log-based metrics, alert policies for key performance indicators like latency and error rates, and a pre-configured monitoring dashboard. The changes are applied to both dev and prod/staging environments. The CLI has also been updated to allow users to configure an email address for alert notifications.
The implementation is solid, but there is a significant amount of code duplication between the Terraform configurations for the different environments. I have added a comment suggesting a refactoring into a reusable Terraform module to improve maintainability. Other than that, the changes are well-executed and a valuable addition to the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive and well-structured monitoring and alerting capability for the agent starter pack. The changes are extensive, covering Terraform infrastructure for metrics, alerts, and dashboards, as well as updates to the Python CLI to support configuration during project creation. The implementation is robust, leveraging a smart combination of native platform metrics and flexible log-based metrics, with thoughtful considerations for different deployment targets and agent types. The code is of high quality, and the feature is a valuable addition. I have one suggestion to improve the robustness of the Terraform dependencies.
agent_starter_pack/base_template/deployment/terraform/modules/monitoring/main.tf
Show resolved
Hide resolved
This reverts commit ce49d29.
…etrics Add conditional depends_on entries for agent_retriever_latency and agent_retriever_document_count metrics when agent_name is agentic_rag. This prevents potential race conditions during terraform apply where the dashboard could be created before the log-based metrics exist. The dashboard references these metrics by name in filter strings, so Terraform cannot automatically detect the dependency. Explicit depends_on ensures proper resource creation order.
…ent-starter-pack into agent-monitoring
|
/gcbrun |
Add Monitoring & Alerting for Agent Deployments
Overview
Pertaining #142. This PR adds monitoring and alerting infrastructure to the agent-starter-pack, giving users production-grade observability out of the box. The implementation is platform-aware, agent-aware, and fully configurable through user prompts during project creation.
What's New
User-Facing Features
1. Optional Email Alert Notifications
agent-starter-pack createasks for an email address2. Configurable Alert Thresholds
All thresholds are exposed as Terraform variables with sensible defaults:
Users can customise these in
deployment/terraform/dev/vars/env.tfvarsafter project creation.Infrastructure Added
Universal Log-Based Metrics (All Agents, All Platforms)
Agentic RAG-Specific Metrics
Agent Engine (Reasoning Engine) Platform
Cloud Run Platform
Technical Details
Terraform Structure
deployment/terraform/dev/monitoring.tf(757 lines)deployment/terraform/monitoring.tf(prod equivalent)deployment/terraform/dev/variables.tf(added 4 monitoring variables)deployment/terraform/dev/vars/env.tfvars(added default threshold values)deployment/terraform/dev/apis.tf(addedmonitoring.googleapis.com)Python CLI Integration
agent_starter_pack/cli/commands/create.py: Added interactive email promptagent_starter_pack/cli/utils/template.py: Threadalert_notification_emailthrough template processingtests/cli/commands/test_create.py: Updated test mocks to handle new promptSmart Templating
cookiecutter.deployment_target(agent_engine vs cloud_run)cookiecutter.agent_name(agentic_rag gets extra retriever metrics)count)Metric Design Decisions
Why log-based metrics for agent telemetry?
Why native metrics for platform SLOs?
Alert auto-close: 30 minutes
alert_strategy.auto_closeif users want different behaviourTest Coverage
All tests pass:
Migration Notes
Existing Projects
monitoring.tffilesterraform applyNew Projects
Example Usage
After deployment, users get:
labels.service_namefilteringRelated Documentation
The monitoring infrastructure automatically creates:
Users can find their dashboard at:
Checklist