William Hill improves MTTR by 80%

Region
Business Challenge

Every day, William Hill publishes 5.1 million price changes, where everything is updated in real-time. That’s 74% more than Amazon UK on its highest-ever trading day. Founded in the UK in 1934, William Hill is one of the world's leading betting and gaming companies and one of the most trusted brands in the industry. 

Decreasing downtime and improving troubleshooting

Issues can occur rapidly, given the real-time nature and complexity of the tech stack at William Hill. “The odds change immediately and we have to be on the button. If people get better odds elsewhere, they'll go. if we lose a minute, we lose thousands of customers. We need to know what's going on within every stack and every application,” says Stephen Wild, engineering manager for observability and automation at William Hill. It is very difficult to work in advance exactly how busy they are going to be.

To try to see what was happening, William Hill had a number of different monitoring tools to monitor their tech stack. But they were repeatedly failing, often overnight, which meant a wake-up call for Stephen’s team. “We knew we needed to replace what we had. It just wasn't cutting the mustard. We needed something that was, first of all, easy to use. Something reliable, stable, and elastic," says Stephen.

5.2
million online transactions every day
80%
improvement in MTTR
25%
improvement in resolving P1 incidents within 60 minutes

Stephen Wild, engineering manager for observability and automation at William Hill, discusses how New Relic helped them improve MTTR by 80%.

Real-time data puts a price on downtime

"When we have downtime, we need to know how much that downtime is costing us as a company. Every second counts. And the real-time nature of New Relic actually lets us work out those costs, exactly. So we can then integrate with a notification system that then writes back to New Relic as a dashboard, so that the whole company can see it. And it's very, very accurate. That lets us prioritize what we need to fix first, and what we need to work on next,” says Stephen.

To gain real-time insight into the revenue impact of technical outages, the Impact Listener application was built on top of New Relic capabilities to track priority one (P1) incidents. The tool can be mapped onto any business service and any metric in real-time to provide context and insights into service-impacting incidents during the entire incident lifecycle. New Relic is the primary trigger to launch the Impact Listener workflow: Alerts for critical incidents are sent to PagerDuty. At the same time, Impact Listener correlates the issue to the revenue being lost, and this data is shown in New Relic dashboards in real time. With an improved ability to correlate technical problems to business outcomes, teams have seen significant improvements in their troubleshooting efforts—including a 25% improvement in resolving P1 incidents within 60 minutes.

Data powered retrospectives

For incident retrospectives, William Hill leverages Impact Listener to create post-mortem reports for operational support teams, SREs, and development teams to evaluate how they can triage similar incidents in the future. This, together with real-time analytics, allows the teams to start to drive KPIs and continuous improvement. The KPIs are published, tracked, and made accessible to all employees via New Relic's dashboards for each business service. William Hill also uses dashboards for proactive alerting to spot trends and flags where teams need to improve. 

In terms of reliability, it's 100%. There's been absolutely no downtime. We don't have a problem with it at all. It's become a bit of a cliche.

A reliable platform allows teams to do their best work

“What I like the most about New Relic, is that it’s reliable, it works. It does what it says on the tin. I like the people, I like the support that they give me. You could have a five-star product, if you haven't got the support, you may well not have a product at all,” says Stephen.

“In terms of reliability, it's 100%. There's been absolutely no downtime. We don't have a problem with it at all. It's become a bit of a cliche. We don't worry about it. Mean-time to resolve, it's now much better at 80%. We were down at 50-60%. And it was just untenable what we had before. It's reliability alone that has allowed the teams to do whatever they need to do now. And they're not just concentrating on reviving dead product,” says Stephen.

“We have three big events that we need to seriously prepare for and it's a nightmare for observability. The Grand National to us is the equivalent of five Saturdays all rolled into one. And we thought that not a single monitoring platform would be able to cope with the Grand National. New Relic just did. In the last three Grand Nationals, we've not had to involve New Relic, because it's just worked. It's been stable, it's kept ingesting data. We've had no failures. What more could you ask?” says Stephen.