Observability bills are usage-based — per host, per GB ingested, per module — and they compound with every host, custom metric, and high-cardinality tag. Teams with engineering capacity move to open stacks (Prometheus + Grafana, Loki for logs, Tempo for traces; or Zabbix/Elastic) and cut spend sharply, trading SaaS polish for operational ownership.
Know what you’re replacing
Commercial suites bundle several products. Map each before starting: infrastructure metrics → Prometheus/OpenTelemetry; dashboards → Grafana; monitors/alerts → Prometheus rules + Alertmanager; logs → Loki or Elasticsearch/OpenSearch; APM/traces → Tempo/Jaeger; synthetics/RUM → separate tooling. The honest gap is cross-signal correlation and UX — achievable, but you own the integration.
Sizing & cost
Bills are largely per monitored host plus ingest. Self-hosting shifts cost to compute + storage + engineering time — usually far lower at scale, but not zero. Retention and cardinality are what made the incumbent expensive; they’ll size your Prometheus/Mimir and Loki storage too, so set them deliberately and use recording rules.
Migration flow
Inventory dashboards, monitors, retention, and paging integrations (export via API). Stand up the stack (the kube-prometheus-stack Helm chart is a common start) and roll out exporters/agents. Recreate the monitors that page humans first, rebuild top dashboards, then dual-run to compare coverage and false-positive rates. Cut over paging (Alertmanager → PagerDuty/Opsgenie) last.
The real shift: query language
PromQL is a different model from the incumbents’ query languages — rate calculations, histogram_quantile, label matching, recording rules. Budget time for on-call engineers to get fluent; alert quality depends on it.
Validation
Fire test alerts end-to-end (trigger → Alertmanager → pager → ack), do a dashboard-parity review, and run a retention/scale load test. The acceptance bar is “we get paged correctly for the incidents we care about.” Keep the incumbent running until that’s proven.
Open a source→target page for stack-specific steps and a per-host TCO model.