Skyscanner pairs open standards with observability

業界

地域

Business Challenge

Tool Consolidation

Skyscanner started as a flight search engine in 2003.

Today, the company has millions of travelers relying on its app and website to plan and book their trips each month. To support its continued growth, Skyscanner needed visibility into all its services and how they interacted with each other and those of third parties. They had built a custom monitoring system but it was becoming complex to manage—a lot of engineering resources were maintaining siloed tools and separate vendor relationships, and real user monitoring didn’t correlate to the backend. A multi-year project soon began to centralize and simplify Skyscanner data, vendors, and infrastructure—without vendor lock-in.

~15 minutes

saved per merge request on mobile build pipelines to identify and fix bottlenecks

internal and external systems retired

Commitment to OTel and open standards

Investing in open standards and engineer development go hand in hand for Skyscanner, as a company heavily based on Cloud Native Computing Foundation (CNCF) projects. Virtually all Skyscanner workloads run on Kubernetes and Amazon Web Services (AWS). Implementing open standards helps Skyscanner reduce toil for teams, understand applications and systems, and detect and debug regressions.

Skyscanner continually weighed a buy-versus-build solution to monitor their open source tech stack. As they grew in numbers and data, Skyscanner needed the analysis and correlation of an out-of-the-box solution, in addition to dashboards and alerting, storage and querying, and export and transport. The developer hours and budget that would be saved in the long term convinced Skyscanner to partner with an experienced observability vendor to focus resources on key business goals. That solution needed to be aligned with Skyscanner: committed to open source and OpenTelemetry (OTel), integrated with CNCF, and applicable to the entire tech stack.

Open standards and observability

By pairing open standards with observability, Skyscanner gets a single stream of correlated data. The instrumentation is completely open source, so there is no vendor lock-in. This means that Skyscanner could do a gradual migration to New Relic with OTel, with minimal disturbance to engineers. When data is sent to New Relic, tagging and semantic conventions means alerts and dashboards can be spun up quickly. When something goes wrong, dashboards show what’s happening down to the code, and the relevant team is alerted immediately.

Data is easily distributed across teams via automated processes and Terraform definitions—Terraform enables infrastructure as code so that definitions can be quickly written across all services according to different fields.

“We wouldn’t be able to build something as powerful as Terraform internally,” says Michael Tweed, Principal Engineer at Skyscanner. “It’s something that has a lot of power that can be used with any platform, including sending the data to multiple places. Terraform was a big part of that smooth transition to New Relic.”

When a concept or library is instrumented, Skyscanner doesn’t have to do code changes depending on where the data is sent. Service owners can switch between telemetry backends without changing code. This means that telemetry libraries and export pipelines can be simplified.

“The integrations with OTel and Terraform have allowed us to move fast. New Relic also complemented areas in which OTel wasn’t so stable yet, like browser monitoring and mobile agents, while still integrating with open standards like trace context propagation from these user devices to backend services,” says Daniel Gomez Blanco, Principal Software Engineer at Skyscanner, and author of “Practical OpenTelemetry: Adopting Open Observability Standards Across your Organization.”

Simple pricing model

Skyscanner has since transformed how they handle data for reporting, tracking, and monitoring. Each team is responsible for the telemetry data they ingest. Dashboards—including a dashboard that monitors cost that was custom-built by New Relic—help promote the use of good telemetry signals and visualize ingest.

“The billing model of pay per gigabyte allows us to distribute those costs to teams within the organization. So everyone knows the cost of the telemetry data they are producing. They can make informed decisions about the return on investment, and use signals like distributed tracing to debug their services faster—and cheaper,” says Daniel.

With New Relic, Skyscanner can see costs broken down by team, system, and data. Teams can continuously review their ingest alongside product health and service costs. “Once the data quality is better, you get insights at a cheaper price,” says Daniel.

Skyscanner was able to replace over 12 monitoring and point solutions with New Relic. Previously 10 experienced engineers were dedicated to upkeeping these monitoring solutions. Now they can focus on accelerating adoption rather than maintaining it. The time and costs saved by reducing tools have helped Skyscanner technology excel.

Before New Relic, we didn't really enable engineers to produce data that is well-structured, meaningful, and cost-effective. With New Relic, we have established a default set of metrics, alerts, and dashboards, giving our team the observability experience from the get-go.

Leading mobile in the industry

“Prior to New Relic, we tracked errors and crashes, but we had a scattered approach,” says Michael. “We had 10 mobile teams working across different timezones and offices. But we had no consistent way of doing things. We weren’t able to answer questions about how features were performing in relation to each other, or the availability of different sections in the app.”

Skyscanner used custom code for mobile requests that required knowledge—data scientists had to help query data. Adding small metric points took days, from merging codes to building dashboards, because it relied on other internal pipelines. Then one central team had to deal with troubleshooting.

By consolidating data and processes with New Relic, New Relic mobile SDK took over mobile requests and mobile request error integration, making custom code redundant. New Relic Query Language (NRQL) lets engineers query and make sense of large volumes of data, and instantly put together dashboards to look for patterns. New Relic also helped Skyscanner enforce conventions and patterns by using abstractions internally. Now, Skyscanner has the confidence to spread alerts and responsibilities across teams, down to the individual level. If a certain feature in an application is causing spikes, that alert is automatically routed to the relevant team.

“We carried out surveys before and after the migration to New Relic. All mobile engineers said they felt more confident in our monitoring capabilities and features being in production,” says Michael.

Skyscanner now has full coverage and adoption of New Relic across all mobile squads.

Applying SLO and SLI principles to mobile

Now that Skyscanner doesn’t have to dedicate engineering resources to maintain the monitoring stack and tool upkeep, it’s freed up time to apply site reliability engineering (SRE) principles of service level objectives (SLO) and service level indicator (SLI) reporting for mobile, which is groundbreaking for the industry. “We’re one of the first companies doing this. A lot of people are interested in our approach,” says Michael.

Monitoring SLOs used to be a manual process. Skyscanner had an internal service-level management solution that allowed anyone to define SLOs and get alerted based on certain HTTP and gRPC metrics. These metrics came from backend services and helped monitor the health of API usage but didn’t focus on the traveler success metrics or user experience. To make matters more difficult, this set of pipelines was maintained by custom code that wasn’t integrated into different parts of the system, so it was hard to identify why an SLO was broken.

New Relic allows Skyscanner to create an SLO from any metric, event, or telemetry data, beyond simple HTTP or gRPC services to create a standardized SLO definition, regardless of whether it comes from the frontend or backend.

Service definitions can be integrated with Terraform to handle everything as code. Once SLO targets are defined as code, teams care. “Now, when SLOs are broken, product managers care about the user experience, and react, it helps them balance product health and feature delivery,” says Daniel. By presenting SLOs in a unified way, engineers see the breach, which instantly links back to events. “It's a standard format that makes it very easy for engineers to investigate, and see, where this is happening,” says Michael.

Previously, engineers could only choose 10 or 11 thresholds for response time SLOs with the internal solution. With New Relic, using NRQL, Skyscanner set up SLOs for latency that allows thresholds of any value. “It’s super flexible and something we’ve always wanted to implement, but we had technical challenges before,” says Daniel.

My teams aren’t disconnected or buried in the complexity of our services, alerts, logs, or data. Our integration platform gives one clear view, so issues can be identified and fixed before it impacts our customers. That’s what truly matters.

From reactive to proactive

“The moment that you put data in front of teams they become proactive,” says Daniel.

Skyscanner squads use New Relic to visualize what’s happening, what services are doing, and how features are performing in a distributed environment. With New Relic, Skyscanner can define how to track performance and availability metrics, in addition to metrics that are interesting for the business, like the time it takes the user to see the first and last search result.

With all of this data available, Skyscanner has prevented a lot of near misses. “It’s made us proactive, instead of reactive,” says Michael. “Previously, we’d find out after something had happened and fix it.” By moving beyond simple monitoring, beyond app crashes, Skyscanner can see key user journey moments and see into the customer experience.

“What we’ve seen from incident calls and postmortems is that teams that make use of New Relic, and the correlation between services via distributed tracing, are able to find the root cause of a regression faster. Sometimes the most experienced engineer loses out to the new engineer that knows how to use New Relic—in terms of MTTR,” says Daniel.

New Relic offers no vendor lock-in and combines all telemetry data—metrics, events, logs, traces— in one place, with the ability to partner with other organizations via infrastructure and AWS integrations, including Amazon CloudFront and Kinesis Data Firehose. By instrumenting New Relic, Skyscanner standardized tools and data which resulted in cost savings, both in engineering hours and tool provisioning.

Visit our customers page to discover more about how our customers are using New Relic.

Recommended