Datadog is a superb, all-in-one observability platform — and its consumption-based pricing (per host, plus per-GB ingest, plus per-module for APM, logs, RUM, and more) is one of the most unpredictable line items in modern infrastructure. Bill shock at scale is widely reported. The most common open-source replacement is a Prometheus + Grafana stack, often extended with Loki for logs and Tempo for traces. This guide is about doing that swap deliberately, not heroically.
Why teams leave Datadog
It’s almost never dissatisfaction with the product. It’s the cost trajectory: every new host, custom metric, high-cardinality tag, and module compounds the bill, and forecasting it is hard. Teams with the engineering capacity to run their own stack can cut spend dramatically — trading a managed SaaS for operational ownership.
What you’re actually replacing
Datadog bundles several products. Map each before you start:
- Infrastructure metrics → Prometheus (with
node_exporter,cAdvisor, and app/metricsendpoints) or the OpenTelemetry Collector. - Dashboards → Grafana.
- Monitors/alerts → Prometheus alerting rules + Alertmanager (routing, grouping, silences, paging).
- Logs → Loki (or Elasticsearch/OpenSearch).
- APM/traces → Tempo (or Jaeger), instrumented via OpenTelemetry.
- Synthetics/RUM → separate tooling (e.g., Blackbox exporter for uptime; RUM has fewer turnkey OSS options).
The honest gap: Datadog’s correlation across metrics/logs/traces and its polished UX take real effort to approximate. Grafana + Loki + Tempo get you most of the way, but you own the integration.
Sizing and cost model
Datadog bills largely per monitored host (plus ingest). Size your migration on the number of hosts/devices under monitoring and your metric/log volume. Self-hosting shifts cost to compute + storage + engineering time — usually far lower at scale, but not zero. Plan retention deliberately: long-retention, high-cardinality metrics are what made Datadog expensive, and they’ll size your Prometheus/Mimir and Loki storage too.
A safe migration flow
- Inventory what Datadog is doing for you: dashboards, monitors, integrations, retention, and paging/ticketing hooks. Export dashboards and monitors via the Datadog API.
- Stand up the stack. A common path is the
kube-prometheus-stackHelm chart (Prometheus + Alertmanager + Grafana) for Kubernetes, plus Loki/Tempo as needed. Deploynode_exporterand instrument apps with exporters or OpenTelemetry. - Recreate the essentials first. Translate your most important monitors into PromQL alerting rules and rebuild the top dashboards in Grafana (or import community equivalents). Don’t try to recreate everything on day one — prioritize what pages humans.
- Dual-run. Keep Datadog and the new stack running side by side; compare coverage, alert fidelity, and false-positive rates. This is where you find the gaps.
- Cut over paging. Move Alertmanager → PagerDuty/Opsgenie/email once you trust the alerts, then decommission Datadog agents.
PromQL is a real shift
Datadog’s query language and Prometheus’s PromQL are different models. Rate calculations, histogram_quantile, label matching, and recording rules all need learning. Budget time for your on-call engineers to get fluent — alert quality depends on it. Recording rules and sensible scrape intervals also keep cardinality (and cost) under control.
Validation before you trust it
Before switching paging off Datadog: fire test alerts end-to-end (trigger → Alertmanager → pager → acknowledgement), do a dashboard parity review against the metrics that matter, and run a retention/scale load test so you’re not surprised when storage fills. Treat “we get paged correctly for the incidents we care about” as the acceptance bar.
Bottom line
Datadog → Prometheus/Grafana is primarily a cost and ownership decision. The metrics and dashboards migrate well; alerting needs PromQL fluency; correlated logs/traces and polished UX take the most effort to match. Run both stacks in parallel, prove alert fidelity, then cut over paging last. Model your per-host savings in the calculator above — and validate the numbers against a real quote, since self-hosting trades license cost for engineering time.