The 10 Principles of Observability: Guideposts on the Path to Success with Modern Software

An edited version of the following post appeared initially on Diginomica.com on July 29, 2019.

Observability is an increasingly important concept for enterprise technology teams. That’s because even as new technologies and approaches—cloud, DevOps, microservices, containers, serverless, and more—increase velocity and reduce friction getting from code to production, these innovations also introduce complex new challenges. “It’s amazing how much stuff has to work perfectly for things to work in production,” New Relic Founder and CEO Lew Cirne told the audience at our FutureStack18 conference last September.

Don’t miss: FutureStack18—Lew Cirne on the 4 Pillars of an Observability Platform [Video]

True observability—not merely tactical monitoring—is key to mastering this complexity and fully understanding what’s happening in your company’s software and systems. But what does that mean in the real world? How do we define a modern observability platform?

We think it’s a good idea to start with 10 core principles:

The 10 principles of observability

1. Curation vs. participation

A modern observability platform excels at curation: cutting complexity down to size, and selecting and presenting relevant insights for users. But such a platform should also support participation—for example, making it easy for users to work with custom metrics and data sources.

Curation and participation are equally important in a modern observability platform. Curation gives teams a critical productivity and efficiency edge: the smaller the haystack, the easier it gets to find the needle. (New Relic customers might recognize our distributed tracing anomaly detection or Kubernetes cluster explorer as examples of how curation helps to achieve observability.)

Participation, on the other hand, puts a premium on versatility—capturing and manipulating data in valuable ways, even when the platform doesn’t know how to shape or present that data. Participation also relies on programmability: giving users the tools, and especially the APIs, to help them help themselves.

2. Support power users

Power users are an important segment of any product’s user base. These are the users most likely to access—and to appreciate—the deeper capabilities that set a product apart from its competitors. And power users are often a product’s most respected and effective champions.

When it comes to application monitoring and observability, power users tend to have very tough and demanding jobs; many of them, for example, practically live in their integrated development environments (IDEs). These users want to automate everything, and they stand to benefit the most from a programmable and extensible observability platform. The New Relic platform, for example, addresses this goal via APIs that allow power users to consume data (such as creating custom metrics,) in addition to injecting data for the New Relic platform to use.

3. Applications rule

When we speak with New Relic customers, many of them deliver a similar message: “What matters to us is whether our application is healthy or not.” And when an application experiences problems, customers want to pinpoint he source of the issue as quickly and accurately as possible.

The lesson we learned from these customers is loud and clear: An observability platform is most valuable when it focuses on measuring application performance and on surfacing application-performance roadblocks.

4. Embracing change

The pace of change in the observability space is breathtaking, and observability solutions must make tough decisions about capabilities and priorities. The plans and features that made sense six-months ago may no longer be relevant, and while product roadmaps remain important, observability solutions must adapt readily to the realities of fast-moving technology innovation.

5. Full transparency

Sometimes observability requires a comprehensive, high-level view of application performance. Other times, it’s all about drilling down into very granular details—with no surprises, and full context.

A good observability platform delivers both of these capabilities. It also provides a consistent, intuitive, and transparent path for moving between high-level and lower-level views.

For example, let’s say that you’re looking at a summary view of performance in a time-series chart. You notice a spike in errors, and you want to know more about what’s happening. You should be able to drill down from that summary view into the underlying data—to view unhandled exceptions, perhaps, or even to view the stack frame or lines of code that introduced the error.

Just as important, such a view should show the useful metrics you expect to see, along with the context required to understand what’re really going on. This type of transparency is especially important in high-stress, high-urgency situations where dev and ops teams want to focus on fixing the problem—not on finding it.

Don’t miss: Complexity in Context: Microservices and Distributed Tracing [Video]

6. Nobody knows everything

Observability is not like a Hollywood movie: The days of monolithic applications that a single person could fully understand—from soup to nuts—are gone. There’s no heroic genius riding in on a white horse to save the day when you have hundreds or even thousands of variables to observe. In complex, modern environments, even the best on-call engineers may understand one slice of the full picture, but they’re unlikely to have a comprehensive view of everything they need to track.

Here at New Relic, for example, our engineering organization includes more than 60 development teams. In such an environment, it’s well-nigh impossible for anyone to have a truly up-to-date and complete understanding of what every team does and of how their projects are progressing. And the biggest enterprise development organizations are orders of magnitude larger than ours.

All of this demonstrates why a modern observability platform has to provide enough information for whoever is on call—not just some mythical support hero who knows all and sees all—to find and fix the problem.

7. Easy to start

Time to value is especially important in an observability platform—which teams rely on to solve their most urgent and expensive application problems. But quickly getting started out of the box isn’t always easy, especially as observability platforms increasingly take on more sources of data and cover more use cases.

This is why an observability platform should be updated constantly to make more elements—for example, new user agents and new metrics—trackable right out of the box. And the platform developer should strive to make its out-of-box experience as intuitive as possible—knowing that many customers, for better or for worse, will first experience the platform while actually using it to resolve an incident.

8. It’s all about the platform

A modern observability platform must take a full-stack, end-to-end approach. Sure, there are plenty of perfectly competent observability point solutions. And they’re fine for solving many types of problems—a frontend monitoring point solution, for example, can identify JavaScript issues that may create major performance bottlenecks.

Performance issues, however, aren’t always polite enough to stop where point solutions can find them. Many frontend problems, for example, originate deep in the application stack or even within infrastructure issues. And as applications and infrastructure continue to get more complex, the need for a full stack observability platform will become even more important.

9. “Fast” is a feature

For a modern observability platform, it’s supremely important to get the right information quickly to the people who need it most. Achieving this goal can make the difference between solving a problem before it affects customers; or catching the problem too late and potentially losing thousands—or even millions—of dollars in revenue, not to mention possible damage to a company’s brand image and customer relationships.

But moving fast isn’t just about going fast; it’s also about precision, and reliability, and responsiveness.

Sure, it’s essential to minimize “time to glass”—the critical gap between the moment when an event happens and the point when a platform issues an alert. Within this process, however, there are a lot of moving parts involved—from detecting a problem, to alerting the right team members, to providing actionable information—all of which must come together and work right now.

This is why it’s especially important, yet often quite challenging, for an observability platform to deliver relevant and targeted alerts. It’s also important for vendors to respond promptly when customers have questions or concerns about these critical capabilities.

10. Open by design

Open systems and standards, such as the recently announced OpenTelemetry project, are becoming increasingly central as modern enterprises work to manage complexity, reduce friction, and avoid vendor lock-in. New Relic, for instance, is fully invested in bringing OpenTracing, OpenCensus, and OpenTelemetry support to our customers—enabling users to access and visualize all their correlated telemetry data, including custom metrics, through New Relic distributed tracing and the New Relic One platform.

New Relic’s goal is to allow customers to move more quickly and with greater agility, even as we learn more about our customers business needs and priorities. And we believe these are all worthwhile objectives for any modern observability platform.

Don’t miss: OpenTracing, OpenCensus, OpenTelemetry, and New Relic