Table of contents
When New Relic founder Lew Cirne created application performance monitoring (APM), the key innovation was deep code visibility into monolithic applications running in a data center. He then made it available to every engineer in development and operations as a SaaS solution. Today, as new technologies and practices—cloud, microservices, containers, serverless, DevOps, site reliability engineering (SRE), and more—increase velocity and reduce the friction of getting software from code to production, they also introduce greater complexity.
As the company that pioneered and perfected application performance monitoring, we believe the challenges facing modern software teams require a new approach. The antidote to complexity and the way for engineers to keep modern systems available and delivering excellent customer experiences is observability.
Observability is gaining attention in the software world because of its effectiveness at enabling engineers to deliver excellent customer experiences with software despite the complexity of the modern digital enterprise.
But let’s get one thing out of the way: observability is not a fancy synonym for monitoring.
Monitoring gives software teams instrumentation, which collects data about their systems and allows them to quickly respond when errors and issues occur. Put another way, monitoring is building your systems to collect data, with the goal of knowing when something goes wrong and starting your response quickly.
Observability, on the other hand, is the practice of instrumenting those systems with tools to gather actionable data that provides not only the when of an error or issue, but—more importantly—the why. The latter is what teams need to respond quickly and resolve emergencies in modern software.
Observability helps modern software teams:
- Deliver high-quality software at scale
- Build a sustainable culture of innovation
- Optimize investments in cloud and modern tools
- See the real-time performance of their digital business
At New Relic, we believe that metrics, events, logs, and traces (or M.E.L.T. as we refer to them) are the essential data types of observability. But observability is about much more than just data.
How can you establish observability of your systems? And what results can you expect when you have observability? In our opinion, there are four key challenges driving the need for observability. And to meet these challenges head on, organizations need to adopt an observability practice based on three components: open instrumentation, connected data, and programmability. In this ebook, we’ll introduce you to those trends, challenges, and components.
Chapter 1: Modern Architectures Require a New Approach to Monitoring
The rate of technological innovation over the past five to ten years has been mind-boggling, and it has had a tremendous impact on software teams. Key trends include:
- Pressure to innovate fast: Software teams face enormous pressure to rapidly and frequently ship new features and experiences to market faster than the competition. The cloud has grown the competitive landscape by lowering the barrier to entry, demanding that software teams deliver and adapt faster than ever—often doing so with fewer resources. High performers deploy software between once per hour and once per day, with elite performers deploying on-demand multiple times per day.
- Higher customer expectations: Customers expect more and tolerate less. Slow, error-prone, or poorly designed user experiences are non-starters with customers. If they can’t do what they came to do, they won’t come back. According to mobile app developer Dot Com Infoway, 62% of people uninstall an app if they experience mobile crashes, freezes, or errors. Elite performers in software delivery performance restore service in the event of an incident or defect that impacts users in less than one hour, compared to low performers who take between one week and one month to restore service.1
- More technology options: Today, organizations build microservice architectures and distributed systems on any number of cloud providers and compute platforms. These services are easier than ever to adopt and use, and increasingly work together seamlessly. You can pick and choose various systems and services to support everything you need in a modern technology stack while abstracting away the management effort to configure and maintain the stack.
- The rise of DevOps and automation: Companies are organizing around autonomous teams responsible for the end-to-end design, delivery, and operation of services they own in production. They sometimes leverage common platforms and tooling that are provided as services by internal platform teams. Automation reduces repetitive, low-value work (or, toil) and improves reliability. In a cloud native architecture, everything in the stack is controlled by software; the entire surface area is programmable. And since all that automation is software, it can fail. Teams need to monitor their CI/CD and other automation tooling exactly as they would applications that directly serve their customers. Gathering data about every component within a system is the essence of observability.
These trends are creating four major challenges that drive the need for observability of modern systems:
- Greater complexity: While cloud native technologies have transformed the way applications are built, delivered, and operated, they’ve also created more complexity for the teams responsible for maintaining them. As monolithic applications are refactored into microservices, where the lifetime of a container may be measured in minutes or less, suddenly software teams have services that are constantly changing. Since each individual application is deconstructed into potentially dozens of microservices, operations teams face a complexity of scale: they’re now responsible for services they know little about yet must maintain.
- Higher risk: Frequent deployments and dynamic infrastructure means introducing more risk more frequently. This increased risk makes instant detection and rollback much more important than in the days of infrequent deployments. And as companies adopt agile practices and continuous delivery to ship software faster, they’re adding yet another surface area of software (via delivery tools and pipelines) that must be monitored and maintained.
- Skills gaps: The explosion of microservices architectures has introduced new challenges as software teams must rethink how they design, build, and deploy applications. Each team member must also understand and be able to troubleshoot parts of an application they weren’t previously familiar with; today a database expert, for example, must know about networking and APIs as well. The downside is that the number of new and different technologies that teams must learn to use are too vast for any one person to master. Teams need ways for better understanding those technologies in context of the work they accomplish.
- Too many tools: Hybrid environments, thousands of containers in production, and multiple deployments per day result in huge volumes of operational telemetry data. Juggling multiple monitoring tools and the necessary context switching to find and correlate the data that matters most, or to find and resolve issues, takes up precious time that teams don’t have when their customers are impacted by a production problem.
Given these trends and challenges, as well as the overall rate of technological change, teams need a single solution that reduces complexity and risk, and that does so with low overhead. The solution must close the skills gap and be easy to use, understand, and navigate through when gathering essential context. The solution must allow any team within an organization to see all of their observability data in one place and get the context they need to quickly derive meaning and take the right action.
1. “Accelerate: State of DevOps 2019,” DORA, September 2019
Chapter 2: The Age of Observability
While monitoring in general goes back to at least the start of the Unix era (the first edition was released in 1971), the term application performance monitoring (APM) didn’t come into widespread use until the early 2000s. Since then, monitoring has evolved to deliver detailed metrics, tracing, and alerting on performance and user experience across the entire technology stack, including the cloud.
Now, as modern environments have become increasingly more complex, observability is extremely important for the future success of software teams and their organizations. It gives teams the ability to see a connected view of all of their performance data in one place, in real time, to pinpoint issues faster, understand what caused the issue, and ultimately deliver excellent customer experiences.
Observability isn’t a new concept at all. It originally comes from engineering and control theory and was introduced by Hungarian-American engineer Rudolf E. Kálmán for linear dynamic systems. A generally accepted definition of observability as applied in engineering and control theory is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
In the software lifecycle, observability encompasses the gathering, visualization, and analysis of metrics, events, logs, and traces in order to gain holistic understanding of a system’s operation. Observability lets you understand why something is wrong, compared to monitoring, which simply tells you when something is wrong.
Yuri Shkuro, author and software engineer at Uber Technologies, explains the difference this way: Monitoring is about measuring what you decide in advance is important while observability is the ability to ask questions that you don’t know upfront about your system.
As we said previously, at New Relic we believe that metrics, events, logs, and traces are the essential data types of observability, and that events are a critical (and often overlooked) telemetry type that must be part of any observability solution—and we’ll talk more about that shortly. Ultimately, when we instrument everything and use that telemetry data to form a fundamental, working knowledge of the relationships and dependencies within our system, as well as its performance and health, we’re practicing observability. However, at New Relic, our approach is even more nuanced than that, as we believe strongly in three core components of observability.
Chapter 3: The Three Core Components of Observability
So far, we’ve defined observability as the practice of instrumenting systems to gather actionable data that provides not only the when of an error or issue, but also the why. The ability to answer why is how teams truly resolve issues at the root cause and ensure system reliability. To achieve observability of your systems, we believe you need three core elements:
- Open instrumentation: We define open instrumentation as the collection of open source or vendor-specific telemetry data from an application, service, infrastructure host, container, cloud service, serverless function, mobile application, or any other entity that emits data. It provides visibility to the entire surface area of business-critical applications and infrastructure.
- Connected entities: All that telemetry data needs to be analyzed so the entities producing it can be identified and connected, and metadata needs to be incorporated to create correlation amongst the entities and their data. Those two actions create context and meaning from large volumes of data. From that context, curation can be delivered as visual models of the system or technology without any additional configuration. The last benefit of connected entities is that intelligence can be applied to derive even more meaning. Applied intelligence is the application of machine learning and data science to look for patterns or anomalies in data so humans can make decisions and take action.
- Programmability: Every business is unique, and no amount of automatic curation can meet all the different needs of a business or fit all of its use cases. Businesses need a way to create their own context and curation on top of all the telemetry data, mixing in critical business data and dimensions. New Relic is unique in the observability space in recognizing the importance of this need, giving customers the ability to build applications on top of all that telemetry data. One example: having the ability to clearly show the cost of errors and failures in a business process, attach real dollars in aggregate to those failures, and provide a path to drill into the data to find the reason why.
To learn more about how observability is evolving to support modern software, read The 10 Principles of Observability: Guideposts on the Path to Success with Modern Software.
Chapter 4: Open Instrumentation
When New Relic started in 2008, the best way to collect telemetry for observability was through agents. Software developers and operations teams deployed agents inside their applications and hosts, and these agents would collect metrics, events, trace, and log data, package it up in proprietary ways, and send it for aggregation and display.
While that continues to be an effective route for collecting telemetry today, the industry has changed. Now there are many more sources of telemetry. Many open systems and frameworks for software development have built-in metrics, events, logs, and traces that they emit in common formats. For observability, you need to collect data from both open and proprietary sources and combine it in one place. You need to automatically apply instrumentation wherever it makes sense, and add instrumentation where you need visibility the most.
M.E.L.T.: A quick breakdown
In most cases, metrics are the starting point for observability. They are low overhead to collect, inexpensive to store, dimensional for quick analysis, and a great way to measure overall health. Because of that, many tools have emerged for metric collection, such as Prometheus, Telegraf, StatsD, DropWizard, and Micrometer. Many companies have even built their own proprietary formats for metric collection on top of open timeseries-friendly datastores like Elasticsearch. An observability solution needs to be able to consume metrics from any of these sources that diverse teams have adopted in the modern digital enterprise.
Traces are valuable for showing the end-to-end latency of individual calls in a distributed architecture. These calls give specific insight into the myriad customer journeys through a system. Traces enable engineers to understand those journeys, find bottlenecks, and identify errors so they can be fixed and optimized. Similar to metrics, many tools have emerged (Jaeger, Zipkin, and AWS X-ray, just to name a few) from custom solutions created by sophisticated organizations.
W3C Trace Context will soon become the standard for propagating “trace context” across process boundaries. Trace context provides a standard way to track the flow of data through a system, tracking originating calls—parent spans and their children—across complex distributed systems. When developers use a standard for their trace context, spans from many different systems can be reliably stitched back together for display and search in an observability platform. Trace context also contains important tags and other metadata that make search and correlation more powerful.
Part of the Cloud Native Computing Foundation (CNCF), the OpenTelemetry project, merges metric and trace collection in an open format. As more organizations adopt OpenTelemetry, we expect to see more standard and common built-in instrumentation that reduces the need to run agents for bytecode instrumentation at runtime. Given the breadth of tools like Kubernetes and Istio in the CNCF and their rapid adoption, OpenTelemetry is likely to become ubiquitous in modern software as a source of telemetry.
Logs are important when an engineer is in “deep” debugging mode, trying to understand a problem. Logs provide high-fidelity data and detailed context around an event, so engineers can recreate what happened millisecond by millisecond. Just as with metrics and traces, tools have emerged to reduce the toil and effort of collecting, filtering, and exporting logs. Common solutions include Fluentd, Fluent Bit, Logstash, and AWS CloudWatch, as well as many other emerging standards.
All of these projects for metrics, logs, and traces are building for a future in which instrumentation becomes easier for everyone through this “batteries included” approach.
Events are a critical (and often overlooked) telemetry type that must be part of any observability solution. Unfortunately, though, while events and logs share some similarities, the two are often mistakenly conflated. Events are discrete, detailed records of significant points of analysis. But they contain a higher level of abstraction than the level of detail provided by logs. Logs are comprehensive and discrete records of everything that happened within a system; events are records of selected significant things that happened with metadata attached to the record to sharpen its context. For example, when New Relic collects transaction events—individual instances of the execution of a method or a code block in a process—data is automatically added to show the number of database calls executed and the duration of those calls.
What are events?
Events are the most critical data type for observability. Events are distinct from logs. They are discrete, detailed records of significant points of analysis but provide a higher level of abstraction than the details provided by logs. Alerts are events. Deployments are events. So are transactions and errors. Events provide the ability to do fine-grained analysis in real time.
While most open source tools that provide essential instrumentation also come with a discrete data store for collecting, storing, and making data available for analysis, this undermines the utility of observability: it forces engineers and teams to know and understand multiple tools. Without a unified datastore, when issues—or worse, emergencies—arise, engineers need to context switch through multiple tools to find the source of the problem. An open observability solution has interoperability of all this data, irrespective of the source. And it automatically creates the entities and connections between them, providing critical context.
Chapter 5: Connected and Curated Data
Getting telemetry data from virtually anywhere into one place is a good start, but it isn’t enough. Your data needs to be connected in a way that lets you understand relationships between entities, and it needs to be correlated with metadata so you can understand its relationship to your business. Such connections give your data context and meaning. Context, for example, leads to curated views that surface the most important information about your data, and model its specific environment. Additionally, when all of your telemetry data and connections are stored in one place, you can apply intelligence to those very large data sets, and surface patterns, anomalies, and correlations that are not easily identifiable by humans watching dashboards.
Essentially, you need a way to see how all entities in your system are related to each other at any moment in time. It’s simply not feasible to maintain a mental map of your system when it changes by the day, hour, or minute. Nor is it feasible to rely on configuration to manage those relationships. As teams add new services, refactor old ones, and spin up and shut down ephemeral application instances, it becomes impossible to maintain a mental map. But entities, their connections, and relationships are one part of essential context for observability.
Context is impossible without metadata and dimensions. Depending on your system, business, or application, the spectrum of valuable data is potentially enormous. For example, in the case of an e-commerce application, helpful context includes, but isn’t limited to:
- Details about the team that owns the application, runbook, and code repository
- Tags from Docker or the cloud provider where it’s deployed
- Its service type and function
- The regions where it has been deployed
- Its upstream and downstream dependencies
- Its deployment or change events
- Its alert status
- Any trace or log data associated with the transactions it performs
- Additional business data (e.g., cart value)
Curation of data visualizations is a powerful tool for surfacing connected, well-understood, and well-defined entities. We already know how best to represent a Java application process running in a container, or an AWS Lambda function that calls DynamoDB after a call from SQS, or a Kubernetes cluster running a dynamic deployment—we’ve solved these problems. And for a busy SRE or DevOps engineer, modeling those environments in a set of dashboards is a waste of valuable time. An observability platform must incorporate the best practices from industry leaders and surface the most important signals of health as well as provide interactive experiences that let engineers troubleshoot problems quickly. Manually creating visualizations and dashboards for specific and ubiquitous technology is toil, plain and simple.
Curation through context also helps with the challenge of the skills gap in a complex digital enterprise. It provides a way for everyone in the organization to visualize the flows and dependencies in their complex systems and to see everything that’s relevant to the entire environment. Because this curation models a variety of systems well, it makes understanding more accessible for people, even when they are not familiar with that specific technology or code.
Observability is nothing if you can’t quickly take action when your system isn’t working correctly. Through machine learning and predictive analytics, applied intelligence takes observability data and makes it meaningful and actionable. Sometimes called artificial intelligence for IT operations, or AIOps by industry analyst firm Gartner, applied intelligence finds the signal within the noise so you can take the right action.
Applied intelligence delivers clear guidance, even when data sets are large and complex. Machines are very good at identifying patterns, trends, and errors in data at a scale humans just can’t replicate. The right applied intelligence capabilities detect issues as early as possible from telemetry data, and correlate and prioritize events to reduce noise and alert fatigue. Applied intelligence can automatically enrich incident alerts with relevant context, guidance, and suggestions, including recommendations that can help you rapidly pinpoint the true root cause of a problem and how to resolve it.
Here’s an example of applied intelligence in action: Your team receives an alert about a response-time threshold violation for an application. Intelligence has automatically examined throughput, latency errors, and transaction signals related to the application in the six hours before the alert. In this scenario, intelligence detects latency in the datastore that the application relies on, and it reveals a direct connection between the database issue and the slow response time of the application. The benefits here are two-fold:
- Because applied intelligence has already performed crucial troubleshooting analysis and reduced your mean-time-to-discovery (MTTD), your team can more quickly resolve the underlying issue and, in turn, reduce its mean-time-to-resolution (MTTR).
- Because applied intelligence becomes more useful when trained with more data, and can filter out noise from minor or false alarms, your team will greatly reduce its overall alert fatigue, allowing them to focus on shipping better software, faster.
When you can visualize dependencies and drill down across telemetry types in real time, you can more quickly and easily understand system problems and troubleshoot issues to get to the reason “why” behind the data. When they effectively model the technical environment automatically, curated visualizations make it easier for everyone to find root causes. And applying intelligence to large datasets surfaces connections in the data, allowing people to do what they are best at: making nuanced decisions about what to do in a tough situation.
Chapter 6: Programmability
Connecting observability data to business outcomes is a critical step that organizations must take to become mature digital businesses. You need to start with critical business measures of success, and then identify the key performance indicators (KPIs) that directly contribute to success for those metrics. Metrics such as latency, errors, or page load are obvious choices to understand application performance, but they aren’t as helpful to understand an application’s impact on the customer experience and business KPIs.
That’s why it’s important to connect observability back to the business and provide teams with the insight they need to make data-driven decisions. The question is, how?
For most solutions, the answer has been to visualize KPIs in dashboards. Dashboards are a great tool for showing ad hoc views of data quickly. They’re flexible, powerful tools fundamental to any observability solution. But given your business’s particular technology environment and unique KPIs, it’s more important than ever to move beyond the dashboard and embrace building applications to bring in data about your digital business and combine it with your telemetry data. By connecting business data on top of your observability platform, an application delivers a curated experience that is interactive; it frequently has workflows built right into it; and it enables the combination of external data sets in real time. Dashboards can’t do that—but applications can.
To connect business and telemetry data in applications, your observability solution needs to be a platform, and you must be able to build on it. It needs to be programmable.
When you have an observability platform on which you can build applications tailored to your unique needs, it opens up the ability to do things not previously possible in an observability tool, such as:
- Prioritize investments in software and measure the effectiveness of those investments in real time.
- Understand, with rich context, the relationships between your technology, business, and customers.
- Make data-driven decisions that have the biggest direct impact on specific KPIs.
- Share understanding through interactive visualizations built to model your unique business, not just your technology environment.
Finally, a programmable observability platform gives teams the ability to build applications that feature in their single system of record without needing to deploy another tool. This provides a number of benefits: it reduces context switching between tools in an emergency; it reduces the time and toil of provisioning, operating, maintaining, and monitoring another system; and it reduces the cost of buying, building, or learning yet another tool.
Chapter 7: Bringing It All Together
As software innovation progresses, the world will continue to move faster and get more complex. Just as the latest technologies and tech trends couldn’t have been anticipated just a few years ago, we don’t know what the next big things to come will be. What we do know is that this continuous innovation and complexity will keep ramping up the expectations on your teams to move faster, embrace more technologies, and deliver zero errors at lightning-fast speed. You’ll also have to automate more and keep pace with customer expectations that have been set by other companies—including your competitors—delivering cutting-edge customer experiences.
Given these challenges, you need a single observability platform that reduces complexity and risk, and that does so with low overhead. You need a platform that closes the skills gap by being easy to use, understand, and traverse to gather essential context, so it’s not a barrier to use for any team within an organization. You need one platform that allows your teams to see all of their telemetry and business data in one place, get the context they need to quickly derive meaning and take the right action, and work with the data in ways that are meaningful to you and your business.
An observability platform should:
- Collect and combine in one place telemetry data from both open and proprietary sources. This open instrumentation reduces tool proliferation and context switching when issues and emergencies arise—because it delivers interoperability of all the data, no matter the source.
- Form connections and relationships between entities and apply those connections to create context and meaning so you can understand the data. Context should be presented in curated views that surface the most important information.
- Give you the ability to build custom applications on top of it. Unlike dashboards, applications deliver interactive, curated experiences; they frequently have workflows built into them; and they enable the combination of external data sets in real time. Programmability redefines the possibilities of observability.
When you have an observability platform that is open, connected, and programmable, the benefits to your business are profound: faster innovation, speedier deploys, less toil, reduced costs, and better understanding of how to prioritize your finite time and attention. All of this leads to a much deeper, shared understanding of your data, your systems, and your customers. All of which will improve your culture, and lead to business growth as you gain real-time views into how your digital systems perform and how your customers engage with your software, which lets you focus on what matters most—the business outcomes you are tasked to deliver every day.