A version of this post previously ran on Diginomica.
Most companies aren’t adopting modern software development approaches such as microservices architectures without compelling business reasons for doing so: to accelerate time to market, gain greater agility to respond to emerging opportunities, and deliver differentiated experiences that grow the customer base, customer lifetime value, and market share. Modern software development is good for the bottom line.
But wait, there’s more to the business case than simply delivering innovative new applications using a microservices approach. The improvements to the bottom line are only sustainable if your company maintains and improves the software and the digital customer experience—something that becomes increasingly difficult to achieve in complex microservices environments.
That’s why companies need to think about the economics of observability and, specifically, the business case for a managed distributed tracing solution.
The impact of microservices on MTTR
The downside of modern environments and architectures is complexity, making it more difficult to quickly diagnose and resolve performance issues and errors that impact customer experience. This increases mean time to resolution (MTTR), which means that customers are potentially impacted for longer periods of time.
In DORA’s annual State of DevOps report, elite performers had an average time to restore service of less than one hour, compared to high and medium performers that restored service in less than one day. Low performers reported an average time to restore of between one week and one month.
If moving to a microservices architecture increases your MTTR from one hour to four hours or longer, how does that potentially impact your business metrics like revenue or customer service costs? And more importantly, what can you do to improve your MTTR in a microservices environment?
Distributed tracing to the rescue
Distributed tracing enables teams to understand the flow of requests through a microservices environment and pinpoint where failures or performance issues are occurring and why. It’s table stakes for helping software teams shorten MTTR, minimize the impact of issues on customers, and understand the effect of code changes to the customer experience.
However, not all distributed tracing solutions deliver the same business value. Some cost far more to operate than many companies can afford, which means that teams have to make choices that limit the quality and quantity of data they can use for pinpointing issues and restrict access for teams to distributed tracing capabilities.
Managed versus unmanaged solutions
Distributed tracing solutions that your company deploys and manages on its own may seem cost-effective when you consider the pricing plan from the vendor, but in actuality, they require costly “heavy lifting” on your company’s part to configure, manage, and operate them on an ongoing basis.
For example, you’ll probably need the equivalent of a full time IT operator to scale the distributed tracing software, optimize it, and load balance it. Unmanaged tools create an unnecessary, operational burden because you have to staff and plan for operating gateways, proxies, and satellites—including taking into consideration how to handle usage spikes, resiliency, and scalability in the underlying infrastructure for the tracing software.
A managed distributed tracing solution—that is, a software-as-a-service (SaaS) solution—is more cost effective because you don’t need to dedicate staff to managing and operating the software. The price point is almost always better than a do-it-yourself approach where you manage the software on your own. You’re also not diverting resources from your core business.
All relevant trace data versus a random sample
Distributed tracing generates a massive amount of telemetry data. It’s not unreasonable to assume a distributed tracing tool could ingest 25 million spans (an individual call within a request) per minute, which would add up to more than a trillion spans each month. If each span is 500 bytes, that could result in a multimillion-dollar monthly bill just for data storage and network data transfer costs.
Because companies that self-manage their distributed tracing software find that collecting and storing data for every span can quickly become cost prohibitive, they choose to randomly collect and store samples of the data. The major drawback with this approach in microservices-based environments (called head-based sampling) is that your teams likely won’t get the data they need to reduce MTTR because the decision to sample is made before traces have completed and those with errors might be sampled out.
In contrast, tail-based sampling captures and analyzes 100% of traces and then visualizes the most actionable data. In a managed distributed tracing solution like New Relic Edge with Infinite Tracing, tail-based sampling lets you observe every span and save all the ones that contain errors, unusual latency, or anomalies because there’s unlimited, on-demand scalability.
Democratized observability versus reduced access to tracing
Some solutions require DevOps or site reliability engineering (SRE) teams to manually handle many of the tasks associated with tail-based sampling and storage of the tracing data. This creates additional complexity and effort, which can discourage engineering teams from using distributed tracing in a meaningful way to monitor their microservices.
Instead, a solution that automates much of the tasks associated with distributed tracing makes it easy for teams to adopt and use the tool. A managed, automated solution lets your company democratize access to the telemetry data your teams need to pinpoint issues and reduce MTTR. (By the way, this white paper is an excellent resource on how to reduce MTTR using best practices for fast incident response.)
The business case for observability of microservices
Delivering new features faster using a microservices architecture is only half the battle. Your engineering and operations teams have to keep those features available and performing to customers’ expectations or your company risks losing the business benefits that your innovative new digital experiences can deliver.
A managed, distributed tracing solution lets you observe every trace to find and resolve issues in complex systems without the operational burden of complex, on-premises tracing software. It helps you protect your investment in your modern software and the business benefits you gain from it, resulting in economic value that you can measure in improved MTTR, better performance, and improved customer experience.
To learn more about distributed tracing, including how it works and when you should use it, read the ebook “A Quick Introduction to Distributed Tracing: Gain Visibility and Reduce MTTR in Complex Application Environments.”
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.