Monitoring and Health
Current infra-ansible monitoring baseline is metrics-first.
Typical stack
- Grafana Alloy for scrape + remote_write
node_exporterfor host metricscAdvisorfor container metrics- SynckHub/Common app metrics endpoints
Design constraints
- metrics endpoints stay loopback-only
- outbound push model (no public inbound scrape)
- stable relabeling for
instanceandjob
Minimum alert set
Start with absence and infrastructure pressure alerts:
- missing data for SynckHub/Common jobs
- disk usage thresholds
- host CPU and memory pressure
Health objective
You should be able to answer quickly:
- Is service up?
- Is it overloaded?
- Is authz/provision sync stale?