DNS (the Domain Name System) maps names to IP addresses, and when it fails, it can cause major outages. Kubernetes v1.23+ uses CoreDNS by default to provide service discovery and name resolution for various microservices in your cluster. When you are running Kubernetes in production, CoreDNS issues can potentially cause your entire cluster to go down. You can use Prometheus and New Relic to monitor, troubleshoot, and fix issues related to CoreDNS.

NEW RELIC COREDNS INTEGRATION
CoreDNS logo

CoreDNS exposes Prometheus metrics on port 9153 when you install the metrics plugin. If you're not familiar with Prometheus yet, check out How to monitor with Prometheus. Prometheus is a powerful open source monitoring tool, but it can be difficult to scale and analyze your data. You can overcome those challenges by sending your Prometheus metrics to New Relic, and our CoreDNS quickstart makes it simple to monitor the performance of:

  • Your overall system health
  • CoreDNS latency 
  • CoreDNS error rates

Monitoring CoreDNS communication in Kubernetes clusters

Every time a pod or service is created in a Kubernetes cluster, CoreDNS adds a record to its database. When Kubernetes services communicate with each other, they first make a DNS query to CoreDNS. CoreDNS resolves the request and returns a virtual IP. If CoreDNS malfunctions or has degraded performance, your microservices won’t be able to communicate, leading to issues, including outages.

With the metrics plugin, CoreDNS provides the following Prometheus metrics on port 9153 to help debug potential issues:

  • coredns_panics_total: total number of panics
  • coredns_dns_requests_total: total query count
  • coredns_dns_request_duration_seconds: duration to process each query
  • coredns_dns_request_size_bytes: size of the request in bytes
  • coredns_dns_response_size_bytes: response size in bytes
  • coredns_dns_responses_total: response per zone, rcode and plugin

Monitor the impact of CoreDNS on system health

Because CoreDNS is a key part of communication between pods,  you can use its metrics to see what’s happening inside your cluster. A simple request rate metric like coredns.request_count will show you how often CoreDNS is called, and you can use other metrics to analyze resolved requests. 

The next visualization shows the total number of CoreDNS requests sorted by type. You can see that the majority of requests are A and AAAA requests.

Here's the NRQL query for the visualization:

FROM Metric SELECT rate(sum(coredns_dns_requests_total), 1 second) facet type WHERE instrumentation.provider = 'prometheus' TIMESERIES

The next visualization shows cache hits and misses. CoreDNS caches all records except zone transfers and metadata records for up to one hour. A cache miss is when requested data isn't found in the cache memory. By visualizing cache misses, we can adjust the size and configuration of the CoreDNS cache to reduce cache misses and increase cache hits.

Here is the NRQL query for the visualization:

SELECT rate(sum(coredns_cache_hits_total), 1 SECONDS) FROM Metric
 SINCE 60 MINUTES AGO UNTIL NOW FACET type LIMIT 100 TIMESERIES 300000 SLIDE BY 10000

Monitor CoreDNS latency

When CoreDNS query resolutions have increased latency, end users can experience degraded performance, even if your microservices are otherwise responding quickly. When DNS latency is the bottleneck, the coredns_dns_request_duration_seconds metric shown in the next visualization can show you the DNS latency against the average latency via the histogrampercentile operator.

Here's the NRQL query for the visualization:

SELECT histogrampercentile(coredns_dns_request_duration_seconds_bucket, (100 * 0.99), (100 * 0.5)) FROM Metric SINCE 60 MINUTES AGO UNTIL NOW FACET tuple(server, zone) LIMIT 100 TIMESERIES 300000 SLIDE BY 10000

Monitor CoreDNS errors

CoreDNS has DNS-specific error codes called rcodes that give you context on incidents. The previous visualization shows errors from NXDomain, FormErr, and ServerFail. NXDomain and FormErr rcodes happen when there are issues with incoming requests to CoreDNS. A ServFail rcode happens when there is an issue with the CoreDNS server itself. This metric includes dimensions for each rcode which you can facet to create a visualization showing the DNS responses returning each rcode value. With this visualization, you can see how many errors of each type occurred during a given time interval.

Here's the NRQL query for the visualization:

SELECT (count(coredns_dns_responses_total) * cardinality(coredns_dns_responses_total)) FROM Metric SINCE 60 MINUTES AGO UNTIL NOW FACET rcode LIMIT 100 TIMESERIES

Dandelion photo from Aaron Burden on Unsplash.