When you deploy new code, there's always the potential for issues to come up. Many developers use Canary deployments to incrementally release new features to subgroups of users. If problems arise, only a small group of users is affected.

While Canary deployments are lower risk than deploying to all users at once, it's still important to monitor your deployments. In this post, you will learn how to use New Relic to drive your Canary releases, making your application deployment safer, faster, and easier to set up, and ensuring that only healthy versions of your application go into production.

This post assumes you are already using Kubernetes (also known as K8s), Docker, Service Mesh, and Canary. You’ll get an overview of Argo Rollouts Analysis powered by New Relic AIOps Proactive Detection. All examples presented are specific to this demo application. You will need to create your own recipe for your Canary releases. You can follow along with this tutorial using this GitHub repository.

Requirements

Installing Argo Rollouts

Argo Rollouts is described as “a Kubernetes controller and set of CRDs which provide advanced deployment capabilities such as blue-green, canary, canary analysis, experimentation, and progressive delivery features to Kubernetes.” Learn more about Argo Rollouts’ features.

To install Argo Rollouts, input the following commands:

$ kubectl create namespace argo-rollouts
$ kubectl apply -n argo-rollouts -f https://raw.githubusercontent.com/argoproj/argo-rollouts/stable/manifests/install.yaml

This creates a new K8s namespace named argo-rollouts where the Argo Rollouts controller will run. The kubectl plugin can be installed using Brew by running on the terminal brew install argoproj/tap/kubectl-argo-rollouts. More instructions on how to install the Argo Rollouts can be found in the official documentation.

Setting up host-level traffic splitting for Canaries

This tutorial will use a host-level traffic splitting approach that splits the traffic between a Canary and a stable service. To use this approach, you will need to create the following Kubernetes resources:

  • Istio Gateway
  • Service (Canary)
  • Service (stable)
  • Istio VirtualService
  • Rollout

The ingress Istio gateway receives our application HTTP connections on port 80. For simplicity's sake, it is bound to all hosts (*):

$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/gateway.yaml

Next, run the following command:

$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/services.yaml

This manifest creates two K8s Services for both versions, nr-rollouts-demo-canary and nr-rollouts-demo-stable. The selector of these Services (app: nr-rollouts-demo) will be modified by the Rollout during an update to target the Canary and stable ReplicaSet pods.

Next, you need to create a VirtualService (nr-rollouts-demo-virtualservice) that defines the application traffic routing rules. Argo Rollouts continuously modifies this virtual service, such as when you set the desired Canary weight. Initially, 100% of the traffic will be routed to the stable version. Run this command to create the virtual service:

$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/virtualservice.yaml

Argo Rollouts New Relic Analysis requires a K8s Secret containing your Account ID, Personal Key, and Region (us or eu) to run the analysis against your account's data. New Relic also needs another K8s Secret to enter your New Relic License Key and pass it over to the demo application with an environment variable.

$ kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: newrelic-rollouts
  namespace: argo-rollouts
type: Opaque
stringData:
  personal-api-key: "<YOUR-PERSONAL-KEY>"
  region: "<YOUR-REGION>"
  account-id: "<YOUR-ACCOUNT-ID>"
---
apiVersion: v1
kind: Secret
metadata:
  name: newrelic
type: Opaque
stringData:
  license-key: "<YOUR-LICENSE-KEY>"
EOF

Setting up Argo Rollouts analysis

Argo Rollouts provides several ways to perform analysis and drive progressive delivery. This example focuses on New Relic's Proactive Detection and events reported by APM. Both data sources work out of the box.

The following command creates three AnalysisTemplates. 

$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/newrelic-analysis.yaml

The newrelic-transaction-error-percentage-background template checks the percentage of HTTP 5xx responses given by the Canary's pods during the last 30 seconds. This template is used as a fail-fast mechanism and runs every 30 seconds during the deployment.

The newrelic-transaction-error-percentage is similar to the newrelic-transaction-error-percentage-background. However, this template does not run in the background, has no initial delay, and executes the NRQL query using the since argument instead of using the fixed 30 seconds ago. This template is to check the overall response errors in a bigger time window.

Finally, newrelic-golden-signals checks the New Relic Proactive Detection golden signals (throughput, response time, and errors) of the application. If New Relic detects any anomalies or an alert triggers during the deployment, the Canary is aborted.

If the Canary pods report no data to New Relic during the analysis time, Argo Rollouts returns an inconclusive result. You can also customize the failure and inconclusive acceptances using the failureLimitconsecutiveErrorLimit, and inconclusiveLimit properties.   

Configuring the application rollout

You can configure the application rollout by running the following command:

$ kubectl apply -f https://raw.githubusercontent.com/edmocosta/newrelic-rollouts-demo/master/rollout.yaml

The Rollout resource specification has a variety of properties to control how the deployment is executed. This example focuses on the Canary strategy.

Defining the Canary strategy

This example defines a specific strategy based on the demo application (not a real-world application). If you plan to do canary releases and aren’t sure how to define a good strategy for your application, this blog post is a good starting point and will help you find a good fit for your use case.

In this example, the Canary release analysis takes at least 11 minutes to be fully promoted. The plan is to gradually increase the canary's traffic every one or two minutes and run the analysis to detect problems. Here’s a summary of the strategy:

  • If the Canary is completely broken, it should fail immediately. A background analysis checks the application Canary pod's HTTP 5XX responses every 30 seconds during the deployment.
  • At first, only 5% of application traffic is redirected to the Canary. You will need to carefully define the amount based on your application. Values that are too small can lead to insufficient traffic, which makes it harder to detect problems. On the other hand, if the value is too large, a broken Canary can negatively affect customers.
  • New Relic Proactive Detection monitors metric data and focuses on key golden signals: throughput, response time, and errors. If one of these golden signals behaves anomalously during the deployment, the Canary fails. To ensure New Relic has enough data points, this analysis starts running after 5 minutes.
  • The Canary fails if any alert triggers for the demo application.
  • Finally, the analysis checks the Canary's pods’ golden signals and HTTP responses from the previous 11 minutes—the duration of the Canary deployment.

Here's the rollout.yaml file:

...
  strategy:
    canary:
      stableService: nr-rollouts-demo-stable
      canaryService: nr-rollouts-demo-canary
      trafficRouting:
        istio:
          virtualService:
            name: nr-rollouts-demo-virtualservice
            routes:
              - primary
      # The following analysis will run in the background while the canary progresses through its 
      # rollout steps. Every 30 seconds, the analysis checks if the application has reported more than 1% of 
      # HTTP 5XX responses to New Relic. If so, the Canary fails and the deployment is aborted.
      analysis:
        templates:
          - templateName: newrelic-transaction-error-percentage-background
        args:
          - name: app-name
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: canary-pod-hash
            valueFrom:
              podTemplateHashValue: Latest
      steps:
      	# First, only 5% of application traffic is redirected to the Canary. This amount is only an example
      	# and should be carefully defined based on your application. Values that are too small can
      	# lead to insufficient traffic to spot problems. Bigger values can affect customers if the Canary is
        # broken.
        - setWeight: 5
        - pause: { duration: 60s }
        - setWeight: 15
        - pause: { duration: 60s }
        ... # increases the traffic gradually
        - setWeight: 30
        - pause: { duration: 120s }
        # If the background analysis doesn’t report a failure, New Relic checks the  Canary’s 
        # golden-signals  since the deployment started.
        - analysis:
            templates:
              - templateName: newrelic-golden-signals
            args:
              - name: app-name
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.name
              - name: since
                value: "5 minutes ago"
        ...  # Increase traffic gradually and run newrelic-golden-signals.
        - setWeight: 90
        - pause: { duration: 120s }
        # When the Canary is handling 90% of application traffic, both golden signals and and 
        #  the HTTP error percentage reported during the entire deployment process (11 minutes ago) are checked.
        - analysis:
            templates:
              - templateName: newrelic-transaction-error-percentage
              - templateName: newrelic-golden-signals
            args:
              - name: app-name
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.name
              - name: canary-pod-hash
                valueFrom:
                  podTemplateHashValue: Latest
              - name: since
                value: "11 minutes ago"
       # If the Canary succeeds it is automatically promoted to stable.
       # You can pause the Canary here and promote it manually by adding a pause{} step with no duration.

Testing the Argo Rollouts with New Relic integration

This example uses a modified version of the rollouts-demo application which sends metrics to New Relic using the Go-Agent. The next step is to verify that all resources have been properly created:

$ kubectl get ro
NAME            DESIRED   CURRENT   UP-TO-DATE   AVAILABLE
nr-rollouts-demo   1         1         1            1

$ kubectl get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
nr-rollouts-demo-canary   ClusterIP   10.110.165.97    <none>        80/TCP    1h
nr-rollouts-demo-stable   ClusterIP   10.104.226.126   <none>        80/TCP    1h

$ kubectl get virtualservice
NAME                              GATEWAYS                       HOSTS   AGE
nr-rollouts-demo-virtualservice   ["nr-rollouts-demo-gateway"]   ["*"]   19h

$ kubectl get gateway
NAME                       AGE
nr-rollouts-demo-gateway   19h

Once everything is verified, you need to access the demo application’s front end by exposing the Istio Gateway and accessing it on the browser using the http://localhost URL.

For Minikube, run this command: $ minikube tunnel 

For Kubernetes port forwarding, run this command: $ kubectl port-forward svc/istio-ingressgateway 80:80 -n istio-system

Great! You can see that only one version (blue) is deployed and receiving traffic. Next, check the application rollout status by running the following command:

$ kubectl argo rollouts get rollout nr-rollouts-demo --watch

At this point, everything looks fine with the demo application and all metrics are being reported to New Relic.

Before testing the Canary strategy, here’s a quick list of a few useful Argo Rollouts commands:

$ kubectl argo rollouts promote nr-rollouts-demo          # Manually promote a rollout to the next step.
$ kubectl argo rollouts abort nr-rollouts-demo            # Abort the rollout.
$ kubectl argo rollouts promote --full nr-rollouts-demo   # Skip all remaining steps and analysis.

Test 1: Healthy

The following command triggers a healthy (green) version of the demo application

$ kubectl argo rollouts set image nr-rollouts-demo nr-rollouts-demo=edmocosta/nr-rollouts-demo:green

Test 2: HTTP 500

The bad-red image adds 15% of HTTP 500 errors to the API responses. This Canary version should fail as the maximum percentage allowed by the AnalysisTemplate is 1%.

$ kubectl argo rollouts set image nr-rollouts-demo nr-rollouts-demo=edmocosta/nr-rollouts-demo:bad-red

Test 3: Alerts

For this test, deploy the slow-yellow version. This image delays all API responses by 2 seconds, affecting the demo application’s Apdex score. Because New Relic One has been configured to trigger an alert for Apdex values lower than 0.9, the rollout will fail.

$ kubectl argo rollouts set image nr-rollouts-demo nr-rollouts-demo=edmocosta/nr-rollouts-demo:slow-yellow

Test 4: Proactive Detection

This last experiment tests the proactive detection analysis. Deploy the purple version and set the demo application error rate to 100%. This is definitely an abnormal error rate and should trigger an anomaly incident in New Relic One.

$ kubectl argo rollouts set image nr-rollouts-demo nr-rollouts-demo=edmocosta/nr-rollouts-demo:purple

Via the Anomalies tab on the Alerts & AI Overview page. New Relic One provides you a list of all the recently detected anomalies in your environment, giving you a detailed analysis and valuable insights into the problem source. 

What else can you do with Argo Rollouts?

Argo Rollouts supports different types of analysis. For example, a Kubernetes job can be used to run analysis and experiments. Those capabilities make it possible to include other types of healthiness checks in your Canary pipeline such as E2E tests and performance benchmarks. It also integrates with Argo CD, making Argo Rollout resources states understandable and allowing you to build automation to react to those states, such as actions to unpause and promote a rollout. 

The Canary analysis presented in this post is only a starting point. Depending on your application’s characteristics, you can also include Logs, Metrics, Tracing, and your own set of Alerts in the Canary analysis. Having a good strategy that fits your application is key for your Canary releases.