When you deploy new code, there's always the potential for issues to come up. Many developers use Canary deployments to incrementally release new features to subgroups of users. If problems arise, only a small group of users is affected.
While Canary deployments are lower risk than deploying to all users at once, it's still important to monitor your deployments. In this post, you will learn how to use New Relic to drive your Canary releases, making your application deployment safer, faster, and easier to set up, and ensuring that only healthy versions of your application go into production.
This post assumes you are already using Kubernetes (also known as K8s), Docker, Service Mesh, and Canary. You’ll get an overview of Argo Rollouts Analysis powered by New Relic AIOps Proactive Detection. All examples presented are specific to this demo application. You will need to create your own recipe for your Canary releases. You can follow along with this tutorial using this GitHub repository.
Requirements
- Kubernetes cluster or minikube
- Docker
- Istio
- Argo Rollouts
Installing Argo Rollouts
Argo Rollouts is described as “a Kubernetes controller and set of CRDs which provide advanced deployment capabilities such as blue-green, canary, canary analysis, experimentation, and progressive delivery features to Kubernetes.” Learn more about Argo Rollouts’ features.
To install Argo Rollouts, input the following commands:
$ kubectl create namespace argo-rollouts
$ kubectl apply -n argo-rollouts -f https://raw.githubusercontent.com/argoproj/argo-rollouts/stable/manifests/install.yaml
This creates a new K8s namespace named argo-rollouts where the Argo Rollouts controller will run. The kubectl plugin can be installed using Brew by running on the terminal brew install argoproj/tap/kubectl-argo-rollouts
. More instructions on how to install the Argo Rollouts can be found in the official documentation.
Setting up host-level traffic splitting for Canaries
This tutorial will use a host-level traffic splitting approach that splits the traffic between a Canary and a stable service. To use this approach, you will need to create the following Kubernetes resources:
- Istio Gateway
- Service (Canary)
- Service (stable)
- Istio VirtualService
- Rollout
The ingress Istio gateway receives our application HTTP
connections on port 80
. For simplicity's sake, it is bound to all hosts (*):
$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/gateway.yaml
Next, run the following command:
$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/services.yaml
This manifest creates two K8s Services
for both versions, nr-rollouts-demo-canary
and nr-rollouts-demo-stable
. The selector of these Services
(app: nr-rollouts-demo
) will be modified by the Rollout
during an update to target the Canary and stable ReplicaSet
pods.
Next, you need to create a VirtualService (nr-rollouts-demo-virtualservice
) that defines the application traffic routing rules. Argo Rollouts continuously modifies this virtual service, such as when you set the desired Canary weight. Initially, 100% of the traffic will be routed to the stable version. Run this command to create the virtual service:
$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/virtualservice.yaml
Argo Rollouts New Relic Analysis requires a K8s Secret
containing your Account ID, Personal Key, and Region (us
or eu
) to run the analysis against your account's data. New Relic also needs another K8s Secret
to enter your New Relic License Key and pass it over to the demo application with an environment variable.
$ kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: newrelic-rollouts
namespace: argo-rollouts
type: Opaque
stringData:
personal-api-key: "<YOUR-PERSONAL-KEY>"
region: "<YOUR-REGION>"
account-id: "<YOUR-ACCOUNT-ID>"
---
apiVersion: v1
kind: Secret
metadata:
name: newrelic
type: Opaque
stringData:
license-key: "<YOUR-LICENSE-KEY>"
EOF
Setting up Argo Rollouts analysis
Argo Rollouts provides several ways to perform analysis and drive progressive delivery. This example focuses on New Relic's Proactive Detection and events reported by APM. Both data sources work out of the box.
The following command creates three AnalysisTemplate
s.
$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/newrelic-analysis.yaml
The newrelic-transaction-error-percentage-background
template checks the percentage of HTTP 5xx
responses given by the Canary's pods during the last 30 seconds. This template is used as a fail-fast mechanism and runs every 30 seconds during the deployment.
The newrelic-transaction-error-percentage
is similar to the newrelic-transaction-error-percentage-background
. However, this template does not run in the background, has no initial delay, and executes the NRQL query using the since argument instead of using the fixed 30 seconds ago
. This template is to check the overall response errors in a bigger time window.
Finally, newrelic-golden-signals
checks the New Relic Proactive Detection golden signals (throughput, response time, and errors) of the application. If New Relic detects any anomalies or an alert triggers during the deployment, the Canary is aborted.
If the Canary pods report no data to New Relic during the analysis time, Argo Rollouts returns an inconclusive result. You can also customize the failure and inconclusive acceptances using the failureLimit
, consecutiveErrorLimit
, and inconclusiveLimit
properties.
Configuring the application rollout
You can configure the application rollout by running the following command:
$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/rollout.yaml
The Rollout resource specification has a variety of properties to control how the deployment is executed. This example focuses on the Canary strategy.
Defining the Canary strategy
This example defines a specific strategy based on the demo application (not a real-world application). If you plan to do canary releases and aren’t sure how to define a good strategy for your application, this blog post is a good starting point and will help you find a good fit for your use case.
In this example, the Canary release analysis takes at least 11 minutes to be fully promoted. The plan is to gradually increase the canary's traffic every one or two minutes and run the analysis to detect problems. Here’s a summary of the strategy:
- If the Canary is completely broken, it should fail immediately. A background analysis checks the application Canary pod's
HTTP 5XX
responses every 30 seconds during the deployment. - At first, only 5% of application traffic is redirected to the Canary. You will need to carefully define the amount based on your application. Values that are too small can lead to insufficient traffic, which makes it harder to detect problems. On the other hand, if the value is too large, a broken Canary can negatively affect customers.
- New Relic Proactive Detection monitors metric data and focuses on key golden signals: throughput, response time, and errors. If one of these golden signals behaves anomalously during the deployment, the Canary fails. To ensure New Relic has enough data points, this analysis starts running after 5 minutes.
- The Canary fails if any alert triggers for the demo application.
- Finally, the analysis checks the Canary's pods’ golden signals and HTTP responses from the previous 11 minutes—the duration of the Canary deployment.
Here's the rollout.yaml
file:
...
strategy:
canary:
stableService: nr-rollouts-demo-stable
canaryService: nr-rollouts-demo-canary
trafficRouting:
istio:
virtualService:
name: nr-rollouts-demo-virtualservice
routes:
- primary
# The following analysis will run in the background while the canary progresses through its
# rollout steps. Every 30 seconds, the analysis checks if the application has reported more than 1% of
# HTTP 5XX responses to New Relic. If so, the Canary fails and the deployment is aborted.
analysis:
templates:
- templateName: newrelic-transaction-error-percentage-background
args:
- name: app-name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: canary-pod-hash
valueFrom:
podTemplateHashValue: Latest
steps:
# First, only 5% of application traffic is redirected to the Canary. This amount is only an example
# and should be carefully defined based on your application. Values that are too small can
# lead to insufficient traffic to spot problems. Bigger values can affect customers if the Canary is
# broken.
- setWeight: 5
- pause: { duration: 60s }
- setWeight: 15
- pause: { duration: 60s }
... # increases the traffic gradually
- setWeight: 30
- pause: { duration: 120s }
# If the background analysis doesn’t report a failure, New Relic checks the Canary’s
# golden-signals since the deployment started.
- analysis:
templates:
- templateName: newrelic-golden-signals
args:
- name: app-name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: since
value: "5 minutes ago"
... # Increase traffic gradually and run newrelic-golden-signals.
- setWeight: 90
- pause: { duration: 120s }
# When the Canary is handling 90% of application traffic, both golden signals and and
# the HTTP error percentage reported during the entire deployment process (11 minutes ago) are checked.
- analysis:
templates:
- templateName: newrelic-transaction-error-percentage
- templateName: newrelic-golden-signals
args:
- name: app-name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: canary-pod-hash
valueFrom:
podTemplateHashValue: Latest
- name: since
value: "11 minutes ago"
# If the Canary succeeds it is automatically promoted to stable.
# You can pause the Canary here and promote it manually by adding a pause{} step with no duration.
Testing the Argo Rollouts with New Relic integration
This example uses a modified version of the rollouts-demo application which sends metrics to New Relic using the Go-Agent. The next step is to verify that all resources have been properly created:
$ kubectl get ro
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE
nr-rollouts-demo 1 1 1 1
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nr-rollouts-demo-canary ClusterIP 10.110.165.97 <none> 80/TCP 1h
nr-rollouts-demo-stable ClusterIP 10.104.226.126 <none> 80/TCP 1h
$ kubectl get virtualservice
NAME GATEWAYS HOSTS AGE
nr-rollouts-demo-virtualservice ["nr-rollouts-demo-gateway"] ["*"] 19h
$ kubectl get gateway
NAME AGE
nr-rollouts-demo-gateway 19h
Once everything is verified, you need to access the demo application’s front end by exposing the Istio Gateway and accessing it on the browser using the http://localhost URL.
For Minikube, run this command: $ minikube tunnel
For Kubernetes port forwarding, run this command: $ kubectl port-forward svc/istio-ingressgateway 80:80 -n istio-system
Great! You can see that only one version (blue) is deployed and receiving traffic. Next, check the application rollout status by running the following command:
$ kubectl argo rollouts get rollout nr-rollouts-demo --watch
At this point, everything looks fine with the demo application and all metrics are being reported to New Relic.
Before testing the Canary strategy, here’s a quick list of a few useful Argo Rollouts commands:
$ kubectl argo rollouts promote nr-rollouts-demo # Manually promote a rollout to the next step.
$ kubectl argo rollouts abort nr-rollouts-demo # Abort the rollout.
$ kubectl argo rollouts promote --full nr-rollouts-demo # Skip all remaining steps and analysis.
Test 1: Healthy
The following command triggers a healthy (green
) version of the demo application
$ kubectl argo rollouts set image nr-rollouts-demo nr-rollouts-demo=edmocosta/nr-rollouts-demo:green
Test 2: HTTP 500
The bad-red
image adds 15%
of HTTP 500
errors to the API responses. This Canary version should fail as the maximum percentage allowed by the AnalysisTemplate is 1%.
$ kubectl argo rollouts set image nr-rollouts-demo nr-rollouts-demo=edmocosta/nr-rollouts-demo:bad-red
Test 3: Alerts
For this test, deploy the slow-yellow
version. This image delays all API responses by 2 seconds, affecting the demo application’s Apdex score. Because New Relic One has been configured to trigger an alert for Apdex values lower than 0.9
, the rollout will fail.
$ kubectl argo rollouts set image nr-rollouts-demo nr-rollouts-demo=edmocosta/nr-rollouts-demo:slow-yellow
Test 4: Proactive Detection
This last experiment tests the proactive detection analysis. Deploy the purple
version and set the demo application error rate to 100%
. This is definitely an abnormal error rate and should trigger an anomaly incident in New Relic One.
$ kubectl argo rollouts set image nr-rollouts-demo nr-rollouts-demo=edmocosta/nr-rollouts-demo:purple
Via the Anomalies tab on the Alerts & AI Overview page. New Relic One provides you a list of all the recently detected anomalies in your environment, giving you a detailed analysis and valuable insights into the problem source.
What else can you do with Argo Rollouts?
Argo Rollouts supports different types of analysis. For example, a Kubernetes job can be used to run analysis and experiments. Those capabilities make it possible to include other types of healthiness checks in your Canary pipeline such as E2E tests and performance benchmarks. It also integrates with Argo CD, making Argo Rollout resources states understandable and allowing you to build automation to react to those states, such as actions to unpause and promote a rollout.
The Canary analysis presented in this post is only a starting point. Depending on your application’s characteristics, you can also include Logs, Metrics, Tracing, and your own set of Alerts in the Canary analysis. Having a good strategy that fits your application is key for your Canary releases.
Próximos passos
If you’re not already a New Relic customer, then request a demo or sign up for a free trial today.
As opiniões expressas neste blog são de responsabilidade do autor e não refletem necessariamente as opiniões da New Relic. Todas as soluções oferecidas pelo autor são específicas do ambiente e não fazem parte das soluções comerciais ou do suporte oferecido pela New Relic. Junte-se a nós exclusivamente no Explorers Hub ( discuss.newrelic.com ) para perguntas e suporte relacionados a esta postagem do blog. Este blog pode conter links para conteúdo de sites de terceiros. Ao fornecer esses links, a New Relic não adota, garante, aprova ou endossa as informações, visualizações ou produtos disponíveis em tais sites.