DNS (the Domain Name System) maps names to IP addresses, which can cause major outages when it fails. Kubernetes v1.23+ uses CoreDNS by default to provide service discovery and name resolution for various microservices in your cluster. When you run Kubernetes in production, CoreDNS issues can cause your entire cluster to go down. Mitigate risks and ensure stability by using Prometheus and New Relic for proactive CoreDNS monitoring, troubleshooting, and issue resolution.
Key takeaways:
- CoreDNS, the default in Kubernetes v1.23+, demands vigilant monitoring to prevent major outages.
- Prometheus metrics, integrated with New Relic, offer a comprehensive solution for CoreDNS performance insights.
- Visualizing CoreDNS metrics helps assess request rates, cache efficiency, latency, and DNS errors.
- New Relic provides seamless integration, simplifying Prometheus data management with free account offerings.
CoreDNS exposes Prometheus metrics on port 9153 when you install the metrics plugin. If you're not familiar with Prometheus yet, check out How to monitor with Prometheus. Prometheus is a powerful open-source monitoring tool, but it can be difficult to scale and analyze your data. You can overcome those challenges by sending your Prometheus metrics to New Relic, and our CoreDNS quickstart makes it simple to monitor the performance of:
- Your overall system health
- CoreDNS latency
- CoreDNS error rates
CoreDNS: What it is and how it works
CoreDNS is an open-source, cloud-native Domain Name System (DNS) server that translates domain names into IP addresses. Essentially, it acts as a directory service for the internet by helping users locate and access websites through their assigned domain names.
It works by receiving DNS requests from clients and forwarding them to other DNS servers in order to resolve the requested domain name. It uses a customizable plugin-based architecture, allowing easy customization and integration with various systems and services.
CoreDNS logs contain valuable information such as the source and destination IP addresses, requested domain names, time stamps, and response codes. By monitoring these logs, users can gain insights into their system's health and performance.
Main CoreDNS metrics
The main CoreDNS metrics to monitor include cache metrics, error metrics, go metrics, performance metrics, scaling and resource metrics, and throughput metrics.
Cache metrics
Cache metrics reveal the number of cache hits and misses, giving an indication of how often CoreDNS is able to find a requested domain name in its local cache. These metrics may include:
- Cache hits, measuring the number of times a DNS query was resolved by fetching the response from the cache.
- Cache misses, which occur when a DNS query is not found in the cache, requiring CoreDNS to forward the query to upstream DNS servers.
- Cache evictions, which happen when the cache reaches its capacity. In this case, older or less frequently used entries may be evicted to make room for new entries.
Error metrics
Errors can occur in any system; monitoring them is crucial to identify and fix issues quickly. Common error metrics in CoreDNS include:
- "NXDOMAIN" (Non-Existent Domain), which indicates that the requested domain does not exist. High NXDOMAIN error rates may suggest misconfigured DNS entries or issues with DNS resolution.
- “SERVFAIL,” which occurs when a DNS server fails to provide a valid response due to server-side issues.
- “REFUSED,” which means that the DNS server refuses to answer a query. This can happen due to DNS server access restrictions or misconfigurations. Counting refused errors helps detect unauthorized access attempts.
We’ll talk more about CoreDNS errors later in this guide.
Go metrics
In CoreDNS within a Kubernetes cluster, "Go metrics" refer to metrics related to the Go programming language runtime environment. Go metrics provide insights into the internal workings of CoreDNS and may include:
- Go Goroutine count, which tracks the number of Goroutines (concurrent execution threads) in the CoreDNS process. An unusually high number of Goroutines can indicate concurrency issues or resource contention.
- Go garbage collection statistics provide information like the frequency of garbage collection cycles, the duration of each cycle, and the amount of memory reclaimed.
- Go CPU usage, as it sounds, tells you about CPU utilization, including percentages and profiling information.
Performance metrics
Performance metrics measure how quickly CoreDNS is able to respond to DNS requests. This includes metrics such as:
- Query response time, or the time it takes for CoreDNS to respond to DNS queries. It can be broken down into average response time, 95th percentile response time, and maximum response time.
- Request rate, which tracks the rate at which DNS requests are received by CoreDNS. It helps identify patterns in request traffic and can be used for capacity planning.
- UDP vs. TCP queries: Tracking the ratio of UDP (user database protocol) and TCP (transmission control protocol)-based DNS queries helps monitor the choice of transport protocol and identify situations where queries are falling back to TCP due to size or other factors.
Scaling and resource metrics
As a user's system scales up to handle more traffic, it's important to monitor metrics related to how CoreDNS is handling the increased load:
- Pod scaling, which helps you determine whether you need to scale the number of CoreDNS pods running in your Kubernetes cluster based on the incoming DNS query load. (This may include query rate per pod and CPU/memory usage per pod.)
- CPU and memory utilization, which helps you ensure that CoreDNS has enough resources to handle all incoming requests, mitigating risk of performance issues or crashes.
- Pod restart count, or the number of times CoreDNS pods have been restarted, which can indicate issues with the application's stability or configuration.
Throughput metrics
Throughput metrics are used to measure the amount of data being processed by CoreDNS in a given time period. They could include:
- Query throughput, which measures the rate at which CoreDNS handles incoming DNS queries per second (QPS). This metric helps administrators understand the overall query load on the DNS server.
- Query rate, tracking the rate at which CoreDNS receives DNS queries over a specific time period.
- Query load distribution across different CoreDNS instances or pods, helping to identify any uneven distribution of query traffic.
Monitoring CoreDNS communication in Kubernetes clusters
Every time a pod or service is created in a Kubernetes cluster, CoreDNS adds a record to its database. When Kubernetes services communicate with each other, they first make a DNS query to CoreDNS. CoreDNS resolves the request and returns a virtual IP. If CoreDNS malfunctions or has degraded performance, your microservices won’t be able to communicate, leading to issues, including outages.
With the metrics plugin, CoreDNS provides the following Prometheus metrics on port 9153 to help debug potential issues:
- coredns_panics_total: total number of panics
- coredns_dns_requests_total: total query count
- coredns_dns_request_duration_seconds: duration to process each query
- coredns_dns_request_size_bytes: size of the request in bytes
- coredns_dns_response_size_bytes: response size in bytes
- coredns_dns_responses_total: response per zone, rcode and plugin
System health impact of CoreDNS monitoring
Because CoreDNS is a key part of communication between pods, you can use its metrics to see what’s happening inside your cluster. A simple request rate metric like coredns.request_count
will show you how often CoreDNS is called, and you can use other metrics to analyze resolved requests.
The next visualization shows the total number of CoreDNS requests sorted by type. You can see that the majority of requests are A and AAAA requests.
Here's the NRQL query for the visualization:
FROM Metric SELECT rate(sum(coredns_dns_requests_total), 1 second) facet type WHERE instrumentation.provider = 'prometheus' TIMESERIES
The next visualization shows cache hits and misses. CoreDNS caches all records except zone transfers and metadata records for up to one hour. A cache miss is when requested data isn't found in the cache memory. By visualizing cache misses, we can adjust the size and configuration of the CoreDNS cache to reduce cache misses and increase cache hits.
Here is the NRQL query for the visualization:
SELECT rate(sum(coredns_cache_hits_total), 1 SECONDS) FROM Metric
SINCE 60 MINUTES AGO UNTIL NOW FACET type LIMIT 100 TIMESERIES 300000 SLIDE BY 10000
Monitor CoreDNS latency
When CoreDNS query resolutions have increased latency, end users can experience degraded performance, even if your microservices are otherwise responding quickly. When DNS latency is the bottleneck, the coredns_dns_request_duration_seconds
metric shown in the next visualization can show you the DNS latency against the average latency via the histogrampercentile operator.
Here's the NRQL query for the visualization:
SELECT histogrampercentile(coredns_dns_request_duration_seconds_bucket, (100 * 0.99), (100 * 0.5)) FROM Metric SINCE 60 MINUTES AGO UNTIL NOW FACET tuple(server, zone) LIMIT 100 TIMESERIES 300000 SLIDE BY 10000
Monitoring CoreDNS errors
As we've explained,CoreDNS has DNS-specific error codes called rcodes that give you context on incidents. The previous visualization shows errors from NXDomain
, FormErr
, and ServerFail
. NXDomain
and FormErr
rcodes happen when there are issues with incoming requests to CoreDNS. A ServFail
rcode happens when there is an issue with the CoreDNS server itself. This metric includes dimensions for each rcode which you can facet to create a visualization showing the DNS responses returning each rcode value. With this visualization, you can see how many errors of each type occurred during a given time interval.
Here's the NRQL query for the visualization:
SELECT (count(coredns_dns_responses_total) * cardinality(coredns_dns_responses_total)) FROM Metric SINCE 60 MINUTES AGO UNTIL NOW FACET rcode LIMIT 100 TIMESERIES
Why is monitoring CoreDNS crucial for Kubernetes clusters?
Monitoring is an essential aspect of maintaining the health and performance of any system, including Kubernetes clusters. But CoreDNS needs close attention.
This DNS server is responsible for resolving domain names into IP addresses within the cluster. Any issues with CoreDNS can lead to significant disruptions in application functionality, resulting in unhappy users and potential revenue loss.
To effectively monitor CoreDNS, you need a reliable tool that provides real-time insights into its performance. Fortunately, New Relic offers a powerful solution with its CoreDNS quickstart.
Demystify CoreDNS performance with Prometheus and New Relic
Install New Relic’s CoreDNS quickstart and start visualizing CoreDNS’s Prometheus data. Send us your Prometheus data in two ways:
- Are you tired of storing, maintaining, and scaling your Prometheus data? Try New Relic’s Prometheus OpenMetrics Integration, which automatically scrapes, stores, and scales your data.
- Already have a Prometheus server and want to send data to New Relic? Try New Relic’s Prometheus Remote Write integration.
If you don't already have a free New Relic account, sign up now. Let New Relic manage your Prometheus data while you focus on innovation. Your free account includes 100 GB/month of free data ingest, one full-platform user, and unlimited basic users.
Dandelion photo from Aaron Burden on Unsplash.
Les opinions exprimées sur ce blog sont celles de l'auteur et ne reflètent pas nécessairement celles de New Relic. Toutes les solutions proposées par l'auteur sont spécifiques à l'environnement et ne font pas partie des solutions commerciales ou du support proposés par New Relic. Veuillez nous rejoindre exclusivement sur l'Explorers Hub (discuss.newrelic.com) pour toute question et assistance concernant cet article de blog. Ce blog peut contenir des liens vers du contenu de sites tiers. En fournissant de tels liens, New Relic n'adopte, ne garantit, n'approuve ou n'approuve pas les informations, vues ou produits disponibles sur ces sites.