Is migrating from Datadog to Prometheus worth it?

For most teams facing rising Datadog costs, yes — Prometheus (free (self-hosted)) typically lowers 3-year total cost of ownership, though the right answer depends on workload complexity and in-house skills. Use the calculator to model your own numbers.

How long does a Datadog to Prometheus migration take?

A typical mid-size estimate is around 18 weeks across six phases — discovery, design, pilot, waved production migration, validation, and decommission. Larger or more complex estates take longer.

What tools are used to migrate from Datadog to Prometheus?

Deploy node/exporter agents and Prometheus + Alertmanager; rebuild Datadog monitors as alerting rules; recreate dashboards in Grafana; dual-run before cutover.

Migrating from Datadog to Prometheus: Cost, Plan & Tools (2026)

Datadog is a superb, all-in-one observability platform — and its consumption-based pricing (per host, plus per-GB ingest, plus per-module for APM, logs, RUM, and more) is one of the most unpredictable line items in modern infrastructure. Bill shock at scale is widely reported. The most common open-source replacement is a Prometheus + Grafana stack, often extended with Loki for logs and Tempo for traces. This guide is about doing that swap deliberately, not heroically.

Why teams leave Datadog

It’s almost never dissatisfaction with the product. It’s the cost trajectory: every new host, custom metric, high-cardinality tag, and module compounds the bill, and forecasting it is hard. Teams with the engineering capacity to run their own stack can cut spend dramatically — trading a managed SaaS for operational ownership.

What you’re actually replacing

Datadog bundles several products. Map each before you start:

Infrastructure metrics → Prometheus (with node_exporter, cAdvisor, and app /metrics endpoints) or the OpenTelemetry Collector.
Dashboards → Grafana.
Monitors/alerts → Prometheus alerting rules + Alertmanager (routing, grouping, silences, paging).
Logs → Loki (or Elasticsearch/OpenSearch).
APM/traces → Tempo (or Jaeger), instrumented via OpenTelemetry.
Synthetics/RUM → separate tooling (e.g., Blackbox exporter for uptime; RUM has fewer turnkey OSS options).

The honest gap: Datadog’s correlation across metrics/logs/traces and its polished UX take real effort to approximate. Grafana + Loki + Tempo get you most of the way, but you own the integration.

Sizing and cost model

Datadog bills largely per monitored host (plus ingest). Size your migration on the number of hosts/devices under monitoring and your metric/log volume. Self-hosting shifts cost to compute + storage + engineering time — usually far lower at scale, but not zero. Plan retention deliberately: long-retention, high-cardinality metrics are what made Datadog expensive, and they’ll size your Prometheus/Mimir and Loki storage too.

A safe migration flow

Inventory what Datadog is doing for you: dashboards, monitors, integrations, retention, and paging/ticketing hooks. Export dashboards and monitors via the Datadog API.
Stand up the stack. A common path is the kube-prometheus-stack Helm chart (Prometheus + Alertmanager + Grafana) for Kubernetes, plus Loki/Tempo as needed. Deploy node_exporter and instrument apps with exporters or OpenTelemetry.
Recreate the essentials first. Translate your most important monitors into PromQL alerting rules and rebuild the top dashboards in Grafana (or import community equivalents). Don’t try to recreate everything on day one — prioritize what pages humans.
Dual-run. Keep Datadog and the new stack running side by side; compare coverage, alert fidelity, and false-positive rates. This is where you find the gaps.
Cut over paging. Move Alertmanager → PagerDuty/Opsgenie/email once you trust the alerts, then decommission Datadog agents.

PromQL is a real shift

Datadog’s query language and Prometheus’s PromQL are different models. Rate calculations, histogram_quantile, label matching, and recording rules all need learning. Budget time for your on-call engineers to get fluent — alert quality depends on it. Recording rules and sensible scrape intervals also keep cardinality (and cost) under control.

Validation before you trust it

Before switching paging off Datadog: fire test alerts end-to-end (trigger → Alertmanager → pager → acknowledgement), do a dashboard parity review against the metrics that matter, and run a retention/scale load test so you’re not surprised when storage fills. Treat “we get paged correctly for the incidents we care about” as the acceptance bar.

Bottom line

Datadog → Prometheus/Grafana is primarily a cost and ownership decision. The metrics and dashboards migrate well; alerting needs PromQL fluency; correlated logs/traces and polished UX take the most effort to match. Run both stacks in parallel, prove alert fidelity, then cut over paging last. Model your per-host savings in the calculator above — and validate the numbers against a real quote, since self-hosting trades license cost for engineering time.

From Datadog to Prometheus

3-year cost calculator

Quick comparison: Datadog vs Prometheus

In-depth guide

Why teams leave Datadog

What you’re actually replacing

Sizing and cost model

A safe migration flow

PromQL is a real shift

Validation before you trust it

Bottom line

Why teams evaluate alternatives to Datadog

The migration plan

Tooling & automation

Frequently asked

Get a vendor-accurate Prometheus quote

How big is your Datadog estate?