A number of customer questions on multi-cloud applications and evaluating cloud platforms inspired us to explore the automation, configuration, and testing of a simple application we were building for multiple cloud platforms. The goal was to create an (almost) exact copy of the application and its service dependencies in Amazon Web Services, Microsoft Azure, and Google Cloud Platform. It was the first time we had run this specific type of application in three platforms at the same time, so we used monitoring data to confirm we had deployed it correctly and to learn more about the differences among services with similar features in each of the leading clouds. The idea is to better understand some of the issues that can crop up when working in a multi-cloud environment.

This post describes how we automated the setup of the application behind cloud-managed load balancers with autoscaling using Terraform, and details some surprises we found when analyzing metrics collected from the application and infrastructure.

Automating multi-cloud environments with Terraform

Managed cloud services like Amazon’s S3, Google Cloud Platform’s Container Engine, and Microsoft Azure Load Balancer allow development and operations teams to use storage, compute, and networking resources with less operational overhead. Software updates, security patches, and availability are managed by the provider.

This kind of dynamic infrastructure in the cloud promises to reduce costs and increase speed and reliability of applications and services. A typical implementation, which we used for our simple app, consists of an internet-facing load balancer that routes traffic to a dynamic pool of identical hosts. As inbound traffic changes, a service automatically increases or decreases the size of the pool based on a pre-configured metric such as average CPU utilization. As of mid-2017, all major cloud providers support this type of autoscaling, under such names as Amazon EC2 Autoscaling Groups, Google Cloud Compute Engine Instance Groups, and Azure Virtual Machine Scale Sets.

In order to make our test load-balancing environment as similar as possible across all three clouds, we used Terraform to define the infrastructure resources as code. This made it easy to recreate and change the infrastructure across all three cloud providers at the same time and define similar machine types and identical images. The amount of effort to define all of the required resources, virtual machines, and images was not trivial—it tripled the amount of work since each cloud has a unique set of configuration options and resources (and associated documentation).

It’s possible to visualize the dependency graph of resources created in Terraform (the code is on GitHub). This includes all of the resources and variables needed to create the load balancers and pools of instances, virtual machines, or machines (each provider calls hosts by slightly different names). For our test it looked like this:

Infrastructure resources visualized using Terraform
Infrastructure resources visualized using Terraform

The simple Go web application we deployed to each host was baked into a cloud-specific image created using Packer and instrumented using New Relic’s Go agent. This ensured the same operating system (Ubuntu 16.04) with the same application version was running in each cloud, so our cloud-services comparison would be an apples-to-apples situation.

Load testing across Azure, AWS, and Google Cloud Platform

After creating the infrastructure and host images using Terraform and Packer, we had applications running in three clouds behind internet-facing load balancers. Using the open-source load testing project vegeta, we started sending a constant stream of HTTP requests to an application endpoint named /stress that waited one second before returning a 200 response. The handler for the endpoint, written in Go, looked like this (inspired from a Stack Overflow question):

func stressHandler(w http.ResponseWriter, r *http.Request) {

   done := make(chan int)

   for i := 0; i < runtime.NumCPU(); i++ {

      go func() {

         for {

            select {

            case <-done:









   fmt.Fprintf(w, "done")


As expected, once we started sending traffic to the clouds, the time spent inside the /stress transaction increased in all three clouds:

load test graph
Amount of time spent inside web transactions after starting the load test across all three clouds.

By grouping the nine application hosts (three per cloud) together by provider type using Infrastructure custom attributes, we observed CPU utilization increases after testing began. We also saw an unexpected result—average CPU utilization on Azure was much lower than on Google and Amazon.

CPU utilization graph

We found the reason in throughput metrics for the Azure hosts. Due to Azure Load Balancer’s Hash-based distribution mode, requests were being routed to the same machine because the load-testing tool, by default, was reusing TCP connections:

load-testing configuration chart
Our load-testing configuration inadvertently caused all requests to be routed to the same host in Azure.

A quick change to the vegeta load-test tool options (adding the Connection: Close header), corrected this and balanced traffic equally among all three Azure hosts:

load-test traffic in Azure chart
Forcing vegeta to use new TCP connections balanced load-test traffic in Azure.

While we observed different performance characteristics of different clouds when looking at throughputs grouped by host, the biggest surprise was found in application transaction metrics.

Spotting an unusual issue in Azure

As mentioned earlier, the application endpoint we were testing was configured to return after exactly one second. We expected that some requests would take more than a second under heavy load. Looking at average transaction duration, everything looked as expected:

average transaction duration graph
Averages of transaction duration look as expected.

However, after writing a NRQL query in New Relic Insights that displayed the histogram of response times over the previous hour, we saw that some requests were somehow taking less than one second, but only on Azure:

response time histogram by cloud
We were surprised our 1-second timer finished early in some cases, but only on Azure.

Given that the application code, Go runtime, and operating system were identical across all three providers, this suggested that something related to Azure Virtual Machines was causing this behavior.

After some searching we found an open bug report and GitHub issue related to clock skew on Linux machines running on Azure’s virtual machine hypervisor, Hyper-V. This meant that our application would potentially run faster than the expected one-second delay until we applied a software update.

Our original assumption that application behavior would be “more or less” consistent across virtualized instances running the same operating system and application was proven false. While this particular situation didn’t have any negative consequences, for time-sensitive applications it might be a significant (but hard to notice) issue.

Evaluating cloud providers with application and infrastructure metrics

Applications in the cloud increasingly interact with multiple cloud-managed services like load balancers. Setting up, testing, and comparing these services side by side, however, isn’t trivial—even with tools like Terraform and Packer that can automate the infrastructure-creation process.

While exploring the performance of different autoscaling scenarios across different clouds was interesting (and we saw some differences in how quickly different instances scaled up and down on different providers), metrics collected from the infrastructure and application layers of these systems were even more useful to pinpoint configuration issues and nuances specific to individual cloud providers during testing—and ultimately allowing us to confirm if we had deployed the application correctly. That’s of particular importance when trying to create the same environment in more than one cloud.

What we noticed in the application transaction response time, throughput, and CPU-utilization metrics led us to make multiple changes in our provider configurations and application code. The ability to explore and visualize this data in different ways at different times was also important to our cloud-services comparison because we didn’t anticipate many of the issues we found—we relied heavily on ad hoc queries and filtering. As described above, less than 30 minutes after we started our initial test, we found an Azure issue that was visible only when we looked at the distribution of response times grouped by cloud-provider type.

In general, data from the instrumented application often provided clues and context that helped us configure and troubleshoot services across multiple clouds. Integrating traditional cloud-monitoring metrics like CPU utilization with application metrics made setup, testing, and understanding Azure, Google Cloud, and AWS faster—and a complex multi-cloud deployment a little easier.

Additional New Relic resources