Like many startups, DAZN has a number of engineering teams with different products, features, and targets to achieve. One challenge for DAZN Head of SRE Craig Mclean is consolidating and standardizing that tech stack. The company’s “reverse pyramid model,” where each team can choose their product and monitoring tools, adds another layer to that complexity.

In a recent observability panel, now available to watch on demand, Craig shares some advice on testing, retrospectives, and getting to 5,000 live releases per day.

‘Anything that fails, we retrospect it’

Anyone who writes, operates, and deploys software knows that at one point or another, something is going to fail. But building that failure into the process, and learning from it, is key to innovation.

“If you just fail and keep failing—you're just failing,” Craig says. “If you fail and learn, then you are embracing Agile. You're basically saying there is a better way to do it. And retros are a key way to do that.”

DAZN has a standard documentation flow that blamelessly captures events and facts to identify what went well and what didn’t go so well. “It's good to reinforce good behaviors as it is to coerce away from bad and worse behaviors,” says Craig.

“We ticket up anything that needs to be fixed and we learn things. It's essentially the same way we deal with scrum—where at the end of every sprint you do a retrospective,” says Craig. “This is a generic day-by-day improvement, and this is how we’ve got to where we are today. We didn't all start out as experts at this stuff. We started out as reasonably good engineers … but we got better every time we did it.”

The journey to 5,000 live releases a day

“10s, 100s, 1000s, it's all the same problem,” Craig says. “I didn’t walk into DAZN and snap my fingers and suddenly all this stuff sprang into life. This has taken years.” 

“Like any big problem, the trick is to break it down into manageable chunks. This is where microservices help,” he says.

Craig’s advice for getting over the fear of instant release is to pick a small service that has limited impact but visibility across the business. “Don't worry if you break it. Because you have a test environment, I hope—or a staging environment or a dev environment at the very least—you can test it in there and break it in there and roll it forward through the lifecycle until you get to production, and then get comfortable with the idea of releasing like that.”

DAZN measures metrics across its tech stack—by embedding observability to show performance issues, from ideation to ticket delivery time—but it didn’t start like that.

“The journey of a million miles starts with a single step. You have to take that first step. You have to get used to it. You have to build policies and practices and ideas and get your developers in that mindset of knowing that, ‘well, this is a small thing, I can do a PR for this and it will automatically release itself once I tag it.’”

“If you want to build it, just try it. And then if you're going to observe it, don't worry about the epic tooling that's out there that does all this observability, worry about your golden signals. Worry about things like saturation, error rates, latency rate, that sort of thing. Just measure those things. Because you start there. You've got to start somewhere.” 

“We ticket up anything that needs to be fixed and we learn things. It's essentially the same way we deal with scrum—where at the end of every sprint you do a retrospective."

The proof is in the testing

It’s also important to know what your service will do under stress by generating that stress for it, Craig says. “If it's not tested, it's worthless.”

“What if I threw random customer data at my login screen? What if I started throwing in special characters into logins? Or passwords? I've seen more than once passwords fail because they've had special characters in them, and some backend system is stripping those special characters out so the password that's stored in the database is not the password that you put in.”

“Sometimes it's just about the sheer volume of traffic. Do you know what's going to happen if you throw 10 million concurrent users at your website?”

DAZN has load-tested its components to ensure the site can take millions of concurrent users, and it knows what to do if it needs to scale to the next million too.

“When you combine stress testing in some form with observability not only do you see the system break but you see exactly where it broke and you can go and attack that one problem.”