Istio is an open-source service mesh that uses Kubernetes to help you connect, monitor, and secure microservices. With Istio, you can move a microservice to a different server or cluster without having to rewrite your application. However, this flexibility means that you have to add multiple Envoy proxies per service, making it more challenging to manage and monitor your network. In this post, you’ll learn how to monitor the performance of your Istio service mesh with Prometheus and New Relic, making it easier to find and fix issues when something goes wrong.
Istio uses Prometheus to report data from your service mesh. If you're not familiar with Prometheus yet, check out How to monitor with Prometheus. Prometheus is a powerful open source monitoring tool, but it can be difficult to scale and analyze your data. You can overcome those challenges by sending your Prometheus metrics to New Relic, and our Istio quickstart makes it simple to monitor the performance of Istio using Prometheus, including:
- Communication between your Istio proxies and services.
- Istio Envoy upstream requests, including identifying services with high and low latencies and 5xx response codes.
- Istio Envoy downstream requests, including 5xx and 404 requests as well as the average performance of your micro services.
- Overall health of your service mesh, including total requests and the total amount of raw data being transmitted.
Monitoring Istio's Prometheus endpoints
Istio uses Envoy, a high-performance service proxy, to handle inbound and outbound traffic through a service mesh. Istio’s Envoy proxies automatically collect and report detailed metrics that provide high-level application information (since they are reported for every service proxy) via a Prometheus endpoint.
To query data from Istio’s Prometheus endpoints, you need to stand up a Prometheus server. However, as you scale your application, it becomes harder to maintain Prometheus infrastructure. With New Relic’s Prometheus OpenMetrics integration, you no longer need to maintain your own Prometheus servers. When you install the integration via New Relic’s Kubernetes agent, it automatically detects Istio’s native Prometheus endpoints and imports the data into New Relic.
If you already have a Prometheus server running, you can use our remote-write integration to forward your data to New Relic.
Monitoring service communication between proxies
Istio provides the following Prometheus metrics for monitoring communication between proxies.
- The counter metric
istio_requests_totalmeasures the total number of requests handled by an Istio proxy.
- The distribution metric
istio_request_duration_millisecondsmeasures the time it takes for the Istio proxy to process HTTP requests.
- The distribution metric
istio_response_bytesmeasures HTTP request and response body sizes.
These metrics provide information on behaviors such as the overall volume of traffic being handled by the service mesh, the error rates within the traffic, and the response times for requests. The overview page of the Istio quickstart has an overview of the service mesh’s throughput, latency, traffic volume, and response codes for all of the requests.
Using Data explorer in New Relic, you can view the attributes of
istio_requests_total and other metrics from Istio’s Prometheus endpoint.
As you’ll see in the next section, attributes of
istio_requests_total such as
reporter can help you pinpoint issues.
Debugging the Service Mesh
When you’re trying to pinpoint communication errors in the service mesh, you have to look at the requests and responses of your microservices. As the image below shows, separating the two different directions of communication allows you to see the entire flow of traffic through the responses and requests of each microservice: the Frontend, API gateway, and the backend Inventory Service.
In this example, we see a request from an application frontend to an API gateway, which then makes another request to an inventory service. The inventory service responds with an error, which passes through the API gateway back to the frontend. To trace the root of the frontend error (such as a 503 error), you need to trace the downstream requests by looking at the
source_workload attribute of
istio_requests_total for each microservice returning the error response. Then, you can also trace upstream requests to find the inputs to each of the services to perform root cause analysis.
With this context, let’s look at which metrics from upstream and downstream requests will best help you perform root cause analysis and find areas for optimization.
Monitoring Istio Envoy upstream requests
You can measure all requests inbound to Istio Envoy proxies in the service mesh by filtering
istio_requests_total by those with attribute reporter set to destination.
The histogram visualization in the next image of
istio_request_duration_milliseconds is useful for troubleshooting issues and optimizing performance linked to particular services.
Here's the NRQL query for the histogram:
FROM Metric SELECT histogram(istio_request_duration_milliseconds_bucket, 1000) WHERE reporter = 'destination' FACET source_canonical_service
Unexpectedly low latencies (super fast queries) can indicate an issue with the source service, which could be returning error messages or failing to fetch the requested data, even if they are also returning response codes of 200. You can also use the histogram to identify services with abnormally high latencies and optimize their performance.
The next image shows visualizing inbound errors faceted by service, which allows you to see which services are returning HTTP requests with response code 5xx.
Here's the NRQL query that generates the visualization:
FROM Metric SELECT (filter(rate(sum(istio_requests_total), 1 minute), WHERE response_code LIKE '5%')) AS 'Errors' WHERE reporter = 'destination' FACET source_workload TIMESERIES
Monitoring Istio Envoy downstream requests
You can measure all requests inbound to all Istio Envoy proxies in the service mesh by filtering
istio_requests_total by those with attribute
reporter set to
The next image shows a time series visualization of outbound requests faceted by source and response code. This visualization shows the most common responses from your microservices. If there is a spike in 5xx or 404 responses, you can quickly pinpoint which microservice is the culprit.
Here's the NRQL query that generates the previous visualization:
FROM Metric SELECT rate(sum(istio_requests_total), 1 SECOND) AS 'Req/Sec' where reporter = 'source' FACET destination_canonical_service, response_code TIMESERIES
The next image is a visualization of the client request duration, which gives you a bird's-eye view of the average performance of all your microservices.
Here's the NRQL query for the previous histogram:
FROM Metric SELECT histogram(istio_request_duration_milliseconds_bucket, 1000) AS 'Requests' WHERE reporter = 'source'
Filtering by destination service or source service
If you want to analyze metrics for a particular service, use the filter bar at the top of the quickstart dashboard to specify the attribute
destination_workload as shown in the next image.
Monitoring the ingress gateway (service mesh)
With Istio, you manage your inbound traffic with a gateway, a set of envoy proxies that act as load balancers for ingress traffic, which is traffic from an external public network to a private network. With the gateway as the central load balancer, you can leverage features like advanced routing configurations such as traffic splitting, redirects, and retry logic.
Istio gateways forward metrics like latency, throughput, and error rate just like sidecar proxies. On the Ingress Gateways tab of the Istio quickstart, you can visualize total requests.
With Istio, you get raw data directly from Envoy. By measuring
envoy_cluster_upstream_cx_rx_bytes_total of the ingress gateway faceted by cluster and namespace, you can see the total number of connection bytes for each cluster and namespace. Here's the NRQL request:
FROM Metric SELECT average(envoy_cluster_upstream_cx_rx_bytes_total) WHERE label.app='istio-ingressgateway' FACET clusterName, namespaceName TIMESERIES AUTO
The next image shows a visualization of this query.
Demystify Istio performance with Prometheus and New Relic
Install New Relic’s Istio quickstart and start visualizing Istio’s Prometheus data. Send us your Prometheus data in two ways:
- Are you tired of storing, maintaining, and scaling your Prometheus data? Try New Relic’s Prometheus OpenMetrics Integration, which automatically scrapes, stores, and scales your data.
- Already have a Prometheus server and want to send data to New Relic? Try New Relic’s Prometheus Remote Write integration.
If you don't already have a free New Relic account, sign up now. Let New Relic manage your Prometheus data while you focus on innovation. Your free account includes 100 GB/month of free data ingest, one full-platform user, and unlimited basic users.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.