As a travel search aggregator, Skyscanner draws data from multiple vendors and services outside of their control. In the past, Skyscanner employed user runbooks to address specific failure modes. But with a distributed infrastructure, some of the errors were unknowns, which made runbooks less useful for team problem-solving. Teams needed a general approach to digging into the core root of a problem, rather than a pattern to follow. With an observability platform, they could quickly draw in datasets for actionable insights into what was really going on in any given scenario.
Skyscanner uses OpenTelemetry (OTel) and other open standards as part of their open source architecture approach. The OpenTelemetry Protocol (OTLP) allows Skyscanner to point the standard OTLP exporter in the Collector to New Relic, which accepts OTLP natively. Using this ecosystem provides the benefit of a single open standard for instrumentation that future proofs operations by avoiding vendor-specific packages. It also allows advanced data sampling, which can help reduce overall data ingest as well as the time to get to the core of an incident or problem.
Observability game days at Skyscanner
Observability game days, which simulate real world incidents, are a fun approach to training developers. Skyscanner uses the official OTel demo for this exercise, with New Relic as their observability platform. These game days empower them with the autonomy to address incidents as they arise, rather than following a formula that might not be appropriate for conditions they are experiencing.
Using a “wheel of misfortune,” teams spin the wheel to land on a specific incident at random. This keeps everyone on their toes by mimicking the chaotic, unexpected nature of an incident as it arises, and is a fun icebreaker start to the day. Skyscanner’s game day facilitator, the game master (GM), divides participants into teams and tasks them with identifying and resolving the issue as quickly as possible.
Such simulations help surface which teams might need additional training on New Relic, as there might be teams that resolve an incident in a less optimal or efficient way since they are less familiar with the platform. The GM can share how they can utilize New Relic to identify and resolve the issue faster, by introducing a new tool or view that can be used. After each round, a new participant spins the wheel and the process is repeated for the next incident.
When it comes to assigning the role of GMs, it’s most effective if they are observability experts within the organization. This means being able to demonstrate the tooling and features that speed up understanding the variables that are influencing the incident. Having a guide of “happy paths,” or approaches to solving each incident, in the game day catalog helps GMs guide teams to the relevant features that can help them resolve their observability challenges faster and more accurately.
While happy paths might represent a preferred path, team members can come up with other approaches that are equally as valid. The feedback sessions allow this to be shared and learned by all teams. In some cases, teams will even find a better happy path than the observability expert: that’s the beauty of this approach to game day, as teams can work cohesively and come up with the best solutions at times, and at other times, it reveals an organizational gap in what features might be used within the New Relic platform.
Game day identifies new opportunities to use error tooling
On game day, teams were originally less inclined to use the errors inbox feature and the correlation between distributed tracing and logs available in New Relic. While teams relied on distributed tracing, they may not have started to debug using it in a way that helped them track down to the root cause.
One incident in the Game Day Wheel of Misfortune required scanning all services to identify where errors were occurring, and digging into each issue to understand the root cause. The incident drew on data from a Product Catalog service error that had been observed in the past.
Teams first had to open all entities and choose the service with the highest error rate:
By clicking on the service, teams would then see the errors inbox:
This was a new approach for some team members who were not used to using the errors inbox as the first step in finding a solution. By drilling down now into the error group, teams could filter to only show those with a distributed tracing error:
From here, teams could now look at all errors with distributed traces and then delve deeper to look at the HTTP POST flow for this error:
Using this entity map, teams identified that the error occurred in the frontend and that the error occurs with GetProduct, which in turn reveals the product ID involved.
By then, teams understood where the error initially occurred, the transaction/request involved (i.e. GetProduct in the Product Catalog service) and the specific product ID that lead to this issue. However, teams were still unclear whether this was a generic issue with the service or linked to the specific product.
The next stage involved teams clicking on the Product Catalog service link to see the service telemetry data:
Again, using the error inbox feature, teams saw the productID involved and clicked on the error profile:
Finally, they uncovered that the error is solely attributed to this one productID:
Root cause unlocked!
Implementing game days in your organization
Observability game days require some forward planning: having a number of incident examples and setting up dummy dashboards with the data is important, but if you are using OTLP standards, it is often easy to pull in data from previous incidents, especially those where a feature flag has been set up, into a dummy data set to create the examples: you can use the data from previous incidents to build the examples for the wheel of misfortune. This also allows full tracing data to be imported so that teams can analyze regressions to solve issues. Where possible, look for opportunities to address incidents that involve back-end and front-end use cases so that all engineering teams can learn during the day.
You don’t need to buy a full raffle spinning wheel, any sort of interactive random number generating system will add the color and flair to your sessions and create the game day culture you want to cultivate to make this fun and encourage engagement. Use this process as a bit of an icebreaker to get participants over the wariness of the”yet another training session” vibes that might make people reluctant to engage.
Participants at Skyscanner overwhelmingly found that the game day approach to learning was exciting and generated increased confidence in debugging their production systems. Over 90% wanted other opportunities to use game days as a way of learning to improve their observability work. The fun and interactive format, timer countdown approach to team solving and opportunity to work with colleagues on a clear problem helped all participants to improve, regardless of their starting skills level.
New Relic is using Skyscanner feedback to build and improve, with Skyscanner engineers in direct communication with engineers at New Relic.
Read Skyscanner engineer Jordi Bisbal Ansaldo’s account on how Skyscanner increased engineers' confidence in incident management using a game on Github.
As opiniões expressas neste blog são de responsabilidade do autor e não refletem necessariamente as opiniões da New Relic. Todas as soluções oferecidas pelo autor são específicas do ambiente e não fazem parte das soluções comerciais ou do suporte oferecido pela New Relic. Junte-se a nós exclusivamente no Explorers Hub ( discuss.newrelic.com ) para perguntas e suporte relacionados a esta postagem do blog. Este blog pode conter links para conteúdo de sites de terceiros. Ao fornecer esses links, a New Relic não adota, garante, aprova ou endossa as informações, visualizações ou produtos disponíveis em tais sites.