DNS (the Domain Name System) maps names to IP addresses, and when it fails, it can cause major outages. Kubernetes v1.23+ uses CoreDNS by default to provide service discovery and name resolution for various microservices in your cluster. When you are running Kubernetes in production, CoreDNS issues can potentially cause your entire cluster to go down. You can use Prometheus and New Relic to monitor, troubleshoot, and fix issues related to CoreDNS.

CoreDNS exposes Prometheus metrics on port 9153 when you install the metrics plugin. If you're not familiar with Prometheus yet, check out How to monitor with Prometheus. Prometheus is a powerful open source monitoring tool, but it can be difficult to scale and analyze your data. You can overcome those challenges by sending your Prometheus metrics to New Relic, and our CoreDNS quickstart makes it simple to monitor the performance of:
- Your overall system health
- CoreDNS latency
- CoreDNS error rates
Monitoring CoreDNS communication in Kubernetes clusters
Every time a pod or service is created in a Kubernetes cluster, CoreDNS adds a record to its database. When Kubernetes services communicate with each other, they first make a DNS query to CoreDNS. CoreDNS resolves the request and returns a virtual IP. If CoreDNS malfunctions or has degraded performance, your microservices won’t be able to communicate, leading to issues, including outages.
With the metrics plugin, CoreDNS provides the following Prometheus metrics on port 9153 to help debug potential issues:
- coredns_panics_total: total number of panics
- coredns_dns_requests_total: total query count
- coredns_dns_request_duration_seconds: duration to process each query
- coredns_dns_request_size_bytes: size of the request in bytes
- coredns_dns_response_size_bytes: response size in bytes
- coredns_dns_responses_total: response per zone, rcode and plugin
Monitor the impact of CoreDNS on system health
Because CoreDNS is a key part of communication between pods, you can use its metrics to see what’s happening inside your cluster. A simple request rate metric like coredns.request_count
will show you how often CoreDNS is called, and you can use other metrics to analyze resolved requests.
The next visualization shows the total number of CoreDNS requests sorted by type. You can see that the majority of requests are A and AAAA requests.
Here's the NRQL query for the visualization:
FROM Metric SELECT rate(sum(coredns_dns_requests_total), 1 second) facet type WHERE instrumentation.provider = 'prometheus' TIMESERIES
The next visualization shows cache hits and misses. CoreDNS caches all records except zone transfers and metadata records for up to one hour. A cache miss is when requested data isn't found in the cache memory. By visualizing cache misses, we can adjust the size and configuration of the CoreDNS cache to reduce cache misses and increase cache hits.
Here is the NRQL query for the visualization:
SELECT rate(sum(coredns_cache_hits_total), 1 SECONDS) FROM Metric
SINCE 60 MINUTES AGO UNTIL NOW FACET type LIMIT 100 TIMESERIES 300000 SLIDE BY 10000
Monitor CoreDNS latency
When CoreDNS query resolutions have increased latency, end users can experience degraded performance, even if your microservices are otherwise responding quickly. When DNS latency is the bottleneck, the coredns_dns_request_duration_seconds
metric shown in the next visualization can show you the DNS latency against the average latency via the histogrampercentile operator.
Here's the NRQL query for the visualization:
SELECT histogrampercentile(coredns_dns_request_duration_seconds_bucket, (100 * 0.99), (100 * 0.5)) FROM Metric SINCE 60 MINUTES AGO UNTIL NOW FACET tuple(server, zone) LIMIT 100 TIMESERIES 300000 SLIDE BY 10000
Monitor CoreDNS errors
CoreDNS has DNS-specific error codes called rcodes that give you context on incidents. The previous visualization shows errors from NXDomain
, FormErr
, and ServerFail
. NXDomain
and FormErr
rcodes happen when there are issues with incoming requests to CoreDNS. A ServFail
rcode happens when there is an issue with the CoreDNS server itself. This metric includes dimensions for each rcode which you can facet to create a visualization showing the DNS responses returning each rcode value. With this visualization, you can see how many errors of each type occurred during a given time interval.
Here's the NRQL query for the visualization:
SELECT (count(coredns_dns_responses_total) * cardinality(coredns_dns_responses_total)) FROM Metric SINCE 60 MINUTES AGO UNTIL NOW FACET rcode LIMIT 100 TIMESERIES
Demystify CoreDNS performance with Prometheus and New Relic
Install New Relic’s CoreDNS quickstart and start visualizing CoreDNS’s Prometheus data. Send us your Prometheus data in two ways:
- Are you tired of storing, maintaining, and scaling your Prometheus data? Try New Relic’s Prometheus OpenMetrics Integration, which automatically scrapes, stores, and scales your data.
- Already have a Prometheus server and want to send data to New Relic? Try New Relic’s Prometheus Remote Write integration.
If you don't already have a free New Relic account, sign up now. Let New Relic manage your Prometheus data while you focus on innovation. Your free account includes 100 GB/month of free data ingest, one full-platform user, and unlimited basic users.
Dandelion photo from Aaron Burden on Unsplash.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.