+
Skip to content

Conversation

beengud
Copy link
Contributor

@beengud beengud commented Oct 7, 2025

Summary

Configured Prometheus in mop-edge and mop-cloud environments to remote write metrics to the central Mimir instance for long-term storage and multi-tenancy support. Added ingress endpoints for direct Prometheus access in both environments.

Changes

  • mop-edge: Added Prometheus ingress (prometheus-edge.gudo11y.local) and remote write configuration to Mimir
  • mop-cloud: Added Prometheus ingress (prometheus-cloud.gudo11y.local) and remote write configuration to Mimir
  • mop-cloud spec: Updated spec.json to use orbstack context and monitoring namespace (previously minikube/default)
  • Remote Write: Both environments configured to push metrics to http://mimir-nginx.monitoring.svc.cluster.local/api/v1/push

Architecture

Prometheus instances in edge and cloud environments now write metrics to the central Mimir instance, enabling:

  • Centralized long-term metrics storage
  • Unified querying across environments
  • Reduced local storage requirements in edge/cloud
  • Foundation for multi-tenant observability

Test plan

  • Applied mop-edge configuration with tk apply
  • Verified Prometheus ingress created successfully
  • Verified prometheus-server pod is running
  • Confirmed remote write configuration in kps.jsonnet
  • Verify metrics are flowing to Mimir (check Mimir ingester logs)
  • Access Prometheus UI at prometheus-edge.gudo11y.local
  • Query metrics in central Grafana from Mimir datasource

🤖 Generated with Claude Code

Configured Prometheus in mop-edge and mop-cloud environments to remote write metrics to the central Mimir instance. Added ingress endpoints for Prometheus access in both environments.

Changes:
- mop-edge: Added Prometheus ingress (prometheus-edge.gudo11y.local) and remote write to Mimir
- mop-cloud: Added Prometheus ingress (prometheus-cloud.gudo11y.local) and remote write to Mimir
- mop-cloud: Updated spec.json to use orbstack context and monitoring namespace
- Remote write configured to http://mimir-nginx.monitoring.svc.cluster.local/api/v1/push

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@beengud beengud merged commit dc4e4c8 into main Oct 7, 2025
1 check passed
Copy link

claude bot commented Oct 7, 2025

Pull Request Review

✅ Strengths

  1. Clear Architecture: The remote write configuration follows the distributed observability pattern well, with edge/cloud Prometheus instances forwarding to central Mimir.

  2. Consistent Implementation: Both mop-cloud and mop-edge environments have identical remote write configurations, which is good for maintainability.

  3. Good Documentation: PR description clearly explains the architecture and benefits (centralized storage, unified querying, reduced local storage).

  4. Proper Namespace Correction: The mop-cloud/spec.json update to use monitoring namespace and orbstack context aligns with the mop-central environment.


🔍 Issues and Recommendations

Critical Issues

  1. Multitenancy Disabled in Mimir (tanka/environments/mop-central/mimir.jsonnet:14)

    multitenancy_enabled: false,
    • Mimir has multitenancy disabled, but you're pushing metrics from multiple environments without tenant headers
    • Risk: Metrics from mop-edge and mop-cloud will mix together without isolation
    • Recommendation: Either:
      • Enable multitenancy and add X-Scope-OrgID header in remote write config
      • Or explicitly acknowledge single-tenant mode is intended for this architecture
  2. Missing Tenant ID in Remote Write (kps.jsonnet:22 in both files)

    remoteWrite: [
      {
        url: 'http://mimir-nginx.monitoring.svc.cluster.local/api/v1/push',
        // Missing: headers for tenant identification
      }
    ]
    • Without tenant headers, you can't distinguish metrics by environment in Mimir
    • Recommendation: Add tenant headers:
    remoteWrite: [
      {
        url: 'http://mimir-nginx.monitoring.svc.cluster.local/api/v1/push',
        headers: {
          'X-Scope-OrgID': 'mop-edge'  // or 'mop-cloud'
        },
        writeRelabelConfigs: [...]
      }
    ]

Security Concerns

  1. Unencrypted Cross-Cluster Communication

    • Using http:// for remote write instead of https://
    • Risk: Metrics in transit are unencrypted (acceptable for local dev, risky for production)
    • Recommendation: Document that TLS should be enabled for production or use a service mesh (like the Linkerd already in mop-central)
  2. No Authentication on Remote Write

    • Missing basicAuth, bearerToken, or other auth mechanisms
    • Risk: Any pod can push metrics to Mimir
    • Recommendation: Add authentication, especially important if multitenancy is enabled

Configuration Issues

  1. Redundant Write Relabel Config (kps.jsonnet:24-30 in both files)

    writeRelabelConfigs: [
      {
        sourceLabels: ['__name__'],
        regex: '.*',
        action: 'keep',
      },
    ]
    • This relabel config keeps all metrics (matches everything with .*)
    • Issue: It's functionally a no-op and adds unnecessary configuration
    • Recommendation: Either remove it or make it meaningful (e.g., filter specific metrics, add environment labels)
  2. Inconsistent Namespace Usage

    • mop-edge uses common.namespace ('mop') but hardcodes 'monitoring' in Mimir URL
    • mop-cloud uses 'monitoring' directly
    • Issue: Creates coupling between environments; if Mimir moves namespaces, both configs need updates
    • Recommendation: Extract Mimir endpoint to a shared config in common.libsonnet:
    // In common.libsonnet
    central:: {
      grafana_domain:: 'grafana.gudo11y.local',
      mimir_push_endpoint:: 'http://mimir-nginx.monitoring.svc.cluster.local/api/v1/push',
    },
  3. Missing Retention Configuration

    • Edge/cloud Prometheus still use default retention (likely 15d from kube-prometheus-stack)
    • Issue: Defeats the purpose of remote write if local storage requirements remain high
    • Recommendation: Reduce retention since metrics are centralized:
    prometheusSpec+: {
      retention: '6h',  // Short retention for local queries, rely on Mimir for history
      remoteWrite: [...]
    }

Best Practices & Maintainability

  1. Code Duplication

    • Identical configuration blocks in both kps.jsonnet files
    • Recommendation: Extract to shared config in common.libsonnet:
    // In common.libsonnet
    defaults:: {
      kps:: {
        values:: {
          // ... existing config ...
          prometheus+: {
            prometheusSpec+: {
              remoteWriteConfig(environment):: [
                {
                  url: $.central.mimir_push_endpoint,
                  headers: { 'X-Scope-OrgID': environment },
                  queueConfig: {
                    capacity: 10000,
                    maxShards: 50,
                  },
                },
              ],
            },
          },
        },
      },
    },
  2. Missing Remote Write Tuning

    • Default queue config may not be optimal for edge environments with intermittent connectivity
    • Recommendation: Add queue configuration:
    remoteWrite: [
      {
        url: '...',
        queueConfig: {
          capacity: 10000,        // Buffer capacity
          maxShards: 50,          // Parallelism
          minShards: 1,
          maxSamplesPerSend: 5000,
          batchSendDeadline: '5s',
          minBackoff: '30ms',
          maxBackoff: '5s',
        },
      }
    ]
  3. Missing Remote Write Metrics Monitoring

    • No mention of monitoring remote write health
    • Recommendation: Add ServiceMonitor or alert rules for:
      • prometheus_remote_storage_samples_failed_total
      • prometheus_remote_storage_samples_dropped_total
      • prometheus_remote_storage_queue_highest_sent_timestamp_seconds

Testing Gaps

  1. Incomplete Test Plan

    • Test plan shows unchecked items for actual verification
    • Recommendation: Before merging, verify:
      • Metrics actually arrive in Mimir (check Mimir ingester logs)
      • Query metrics from Mimir via Grafana
      • Verify tenant isolation (if multitenancy enabled)
      • Test remote write failure scenarios (network issues, Mimir down)
  2. No Validation Script

    • Manual verification is error-prone
    • Recommendation: Add a validation script:
    # scripts/validate-remote-write.sh
    #!/bin/bash
    for env in edge cloud; do
      echo "Checking mop-$env remote write..."
      kubectl exec -n mop prometheus-kps-prometheus-0 -- \
        wget -qO- localhost:9090/api/v1/status/config | \
        jq '.data.yaml' | grep -A 5 remote_write
    done

📊 Performance Considerations

  • Network Bandwidth: Remote write will consume network bandwidth proportional to metric cardinality
  • Mimir Capacity: Ensure Mimir ingester can handle the additional load from two environments
  • Backpressure: Without proper queue config, Prometheus may drop metrics under load

🧪 Test Coverage

  • ❌ No automated tests for remote write configuration
  • ❌ No integration tests to verify metrics flow
  • ❌ No validation of Mimir receiving metrics
  • Recommendation: Add Jsonnet unit tests using jsonnetunit or validate manifests in CI

📝 Documentation Recommendations

  1. Add comments in kps.jsonnet explaining the remote write strategy
  2. Document the multi-tenant setup (or lack thereof) in architecture docs
  3. Add troubleshooting guide for remote write issues
  4. Document expected metric retention policy across environments

✅ Suggested Changes Summary

High Priority:

  • Add tenant headers to remote write or document single-tenant design decision
  • Reduce local Prometheus retention to leverage centralized storage
  • Extract Mimir endpoint to shared config

Medium Priority:

  • Add authentication to remote write
  • Add remote write queue tuning
  • Complete test plan verification
  • Extract duplicated config to common.libsonnet

Low Priority:

  • Remove redundant writeRelabelConfigs or make them meaningful
  • Add remote write monitoring alerts
  • Add validation script
  • Consider TLS for production

📋 Conclusion

This PR implements a solid foundation for centralized metrics with Mimir, but needs refinement around multi-tenancy, configuration deduplication, and retention policy. The architecture is sound, but the implementation should be hardened before production use.

Recommendation: Request changes to address tenant identification and retention policy before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载