Monitoring and Health

Current infra-ansible monitoring baseline is metrics-first.

Typical stack

  • Grafana Alloy for scrape + remote_write
  • node_exporter for host metrics
  • cAdvisor for container metrics
  • SynckHub/Common app metrics endpoints

Design constraints

  • metrics endpoints stay loopback-only
  • outbound push model (no public inbound scrape)
  • stable relabeling for instance and job

Minimum alert set

Start with absence and infrastructure pressure alerts:

  • missing data for SynckHub/Common jobs
  • disk usage thresholds
  • host CPU and memory pressure

Health objective

You should be able to answer quickly:

  • Is service up?
  • Is it overloaded?
  • Is authz/provision sync stale?