Guest author Anton Malinovskiy, principal software engineer at Ocado, discusses the architecture of Ocado’s metrics and monitoring system, the role that New Relic One and Prometheus play in conjunction with Micrometer.
Ocado Group is a British online grocery technology company. At Ocado Technology, the tech specialist wing, we design and build a huge amount of cutting-edge automation technology in-house. We use machine learning for demand forecasting, perform around 20 million demand forecasts a day, and use custom robots to pick items in our warehouses. If you haven’t seen our robots in action, check out our YouTube channel. While technology is at the heart of what we do today, Ocado Group also owns 50% of a retailer in the United Kingdom in a joint venture alongside M&S at Ocado.com delivering over 330,000 orders per week.
The Ocado business model has no physical stores; instead, we use huge, dedicated warehouses that we believe to be the largest and most sophisticated of their kind in the world. Since 2018 we have made the Ocado Smart Platform available to other retailers—eight around the world including Morrisons in the United Kingdom, Groupe Casino in France, Sobeys in Canada, Kroger in the United States, Coles in Australia, and Aeon in Japan.
Most of our software is JVM-based, including the software we use to control our robot swarms and grid. To monitor this software, we use both Prometheus (especially for on-premises installations) and New Relic One.
In addition, since much of our infrastructure runs in the public cloud on AWS, we use Amazon CloudWatch both for metric storage and for monitoring anything AWS-specific, such as our Lambda functions and auto-scaling groups.
In this blog post, based on my Nerd Days presentation, I’ll describe the architecture of Ocado’s metrics and monitoring system, the role that New Relic One and Prometheus play, and why and how we use Micrometer.
Monitoring architecture overview
We use Micrometer as part of our in-house metrics and monitoring system, which we refer to as Flux. Flux provides complex, business-oriented, aggregated metrics such as “order abandonment rate” and “the number of order calculations started but not completed within a given timeframe.”
Architecturally speaking, at a very high level Flux works by having our apps send their business events through Amazon Kinesis. We ingest these events, filter them, and send the filtered events to a single Kinesis stream where a custom event stream processor performs the necessary aggregation and calculates the metrics. These calculated metrics are then sent to CloudWatch for storage. If any alerts are generated, these get sent to both PagerDuty, and also out via email.
Of course, the event stream processing system itself also needs to be monitored. For this, we track three main metrics:
- Rate: The event rate
- Loss: The error rate, defined as events that didn’t reach the end-point
- Delay: The end-to-end journey time for a given event
These metrics are created using Micrometer, sent through a Kinesis stream as before, and then fanned out to both Prometheus and New Relic One.
Here is a screenshot of one of our New Relic One dashboards:
To get a bit more of a feel for how this works, I’ve prepared a small demo application that is available via my GitHub repo. The application shows a simplified version of how we use Micrometer with Prometheus and New Relic registries. To keep the example simple, I’ve used Kafka Streams rather than Kinesis because it’s easier to run Kafka Streams on a local machine.
You might be wondering why we chose to use this fan-out approach of sending events from Micrometer to both Prometheus and New Relic, rather than making use of New Relic’s Prometheus integration. The reason is that the latter approach would have required us to scale and ensure the high availability of our Prometheus cluster.
Advantages of Micrometer
The use of Micrometer gives us several advantages. First, its close association with the Spring framework means that it works well with Spring and natively supports dependency injection. Second, the metrics are part of the application code which means we can easily have automated test coverage for those metrics, incorporating both unit and integration testing. Micrometer is also vendor-neutral, so it provides us with some protection against vendor lock-in; you can think of it as analogous to SLF4J, but for metrics. To put this another way, we use New Relic because we find value in the product rather than because we are locked into it.
Micrometer supports four main metric types:
- Counters: Report a single metric, a count. The counter interface allows you to increment by a single, positive amount. One application for counters is to calculate the rate of change, so, for example, you can use a counter to calculate uptime and then use that to see when there was a restart of an application.
- Gauges: A handle to get the current value, useful for monitoring something with natural upper bounds such as the size of a map or a collection, or the number of threads in a running state.
- Timers: Intended for measuring short-duration latencies such as the execution of some part of the code.
- Distribution summaries: Used to track the distribution of events, it includes the standard deviation, mean, and sample set for the particular metric.
Finally, Micrometer has a dimensional data model, which we find useful. Dimensional metrics have a metric value, which can be a scalar or a vector of double values, alongside tags, which are arbitrary sets of key-value pairs. Examples of tags might be your application name, the version of a particular library, and so on. This model allows you to perform aggregation and drill-down of the various metrics by different tags, which creates flexibility in terms of both investigations and creating the alerts.
Check out Monitoring Spring Boot Applications using Micrometer Metrics in New Relic to learn more about how the New Relic Micrometer registry works and how to install and use it to send metrics to New Relic from an example application.