Monitoring and Health

Current infra-ansible monitoring baseline is metrics-first.

Typical stack

Grafana Alloy for scrape + remote_write
node_exporter for host metrics
cAdvisor for container metrics
SynckHub/Common app metrics endpoints

Design constraints

metrics endpoints stay loopback-only
outbound push model (no public inbound scrape)
stable relabeling for instance and job

Minimum alert set

Start with absence and infrastructure pressure alerts:

missing data for SynckHub/Common jobs
disk usage thresholds
host CPU and memory pressure

Health objective

You should be able to answer quickly:

Is service up?
Is it overloaded?
Is authz/provision sync stale?

Deployment Models

Yjs Coauthoring