현재 이 페이지는 영어로만 제공됩니다.

Traditional methods of monitoring don’t give you the full picture. Explore how to move to modern o11y (observability) and harness real-time insights in this article.

Key takeaways:

  • Modern observability offers proactive insights into system health, surpassing traditional monitoring methods.
  • Integration of old and new techniques is crucial for a balanced approach to monitoring.
  • Prioritizing user experience drives the shift towards observability, enhancing problem resolution.

The term "old-school" has two decidedly different meanings. On the one hand, it can mean classically trendy, something that never went (or will go) out of style. On the other, it connotes something that's outdated, outmoded, and fairly irrelevant.

I cut my teeth on ping and SNMP, so—while it pains me to say this—old-school monitoring is decidedly in the latter category. Back in the day, if the experience your company provided (I’ll call this a service) could be measured, it wasn’t easy to do. You were left trying to infer user experience based on data from parts of the service as a whole. It included measuring CPU, RAM, and database metrics and hoping that those metrics reflected enough to tell whether your service was truly working for your customers. With this “old school” way of thinking, enough tangential insight could add up to a full picture of a user's actual experience. 

The problem with that approach is that it doesn’t provide a complete picture. In this blog post, I’ll show you how to reframe your thinking, so you can better understand how to move towards modern and more complete observability—or o11y, as we’ll call it.

Old school monitoring vs. modern o11y

"That's how we've always done it" is never an excuse for not improving or adapting to what's possible with monitoring and o11ytoday.

As a reminder, one of the hallmarks of observability is the ability to understand the internal operation of a system from its external outputs. A system that's "observable" will tell you how it's doing without you needing to ask. This sums up one of the functional differences between old-school monitoring and modern o11y: 

Monitoring is the act of repeatedly asking a system about its current state.

Observability means the system outputs its current state as a part of its normal operation rather than as an interruption.

One major difference between monitoring and modern o11y is their focus. Monitoring primarily refers to the collection and tracking of specific metrics, such as CPU usage or response time, to provide a high-level overview of system health. On the other hand, modern o11y goes beyond just metrics and focuses on collecting and analyzing logs, traces, and events from across the entire system in real time. This allows for a more comprehensive understanding of the system's behavior and potential issues.

Better still, it's never been easier to instrument applications for o11y. Whether the output is metrics, events, logs, or traces, modern o11y solutions offer a variety of methods to snap into existing code, from APIs to agents, allowing you the freedom to incorporate and extend observability to best fit your needs.

The synergy between monitoring and o11y

Let’s be clear: I’m not saying that the so-called old-school metrics are no longer needed. I want to emphatically state that this is a both-and rather than an either-or situation. Those low-level data points are not only helpful, they're necessary.

If a drive within a SAN array begins throwing errors (a fault that could progress forweeks before it completely fails), traces and user experience monitoring will never reveal the root cause. Likewise, the "view from the top" might not differentiate between the SAN issue and a problem in a memory module of a network device, a corrupt driver on a server, or even a misconfiguration in an image file that's used by the container orchestration system. For this kind of insight, all those traditional monitoring techniques and technologies are still needed.

Benefits of modern o11y

Implementing modern observability methods has numerous benefits for businesses and organizations. Specifically, it enhances productivity, reduces costs, and mitigates risks, fostering overall operational excellence.

Improves productivity

By constantly monitoring key metrics such as response time, latency, and error rates, developers can quickly identify and address any issues or bottlenecks in the system. This allows them to make necessary improvements or fixes before they impact the end-user experience. As a result, teams can work more efficiently and effectively without being bogged down by unexpected errors or performance issues.

Saves money

Modern o11y has cost-saving benefits for businesses. By proactively monitoring system performance and identifying potential problems early on, companies can avoid costly downtime and disruptions to their operations. Addressing issues promptly also prevents future problems that could result in expensive repairs or lost revenue.

Reduces risks

Businesses can mitigate potential risks and prevent critical failures by having a comprehensive understanding of their systems' performance at all times. O11y enables teams to track changes made to the system and identify any anomalies or errors that may have occurred as a result. This allows for quick troubleshooting and resolution before they escalate into larger issues that could impact the business's operations or reputation.

Moreover, o11y also helps with risk management by providing valuable data for making informed decisions about future developments or updates to the system. With real-time insights into how changes will affect overall performance, teams can ensure that updates are rolled out smoothly without causing any negative impacts on productivity or customer experience.

Examples in observability

What does that look like in real-world terms? Let’s take a look. 

You suspect the WebPortal application in this next example is experiencing an issue, such as an uptick in customer calls. You need to understand what’s really happening in production without running a fake routine every five minutes from a single location. The next logical step is to turn to our tools and see what might be happening:

While you can see that the load average may have “spiked” a little higher than normal, at 0.06 that’s not anywhere near critical. Meanwhile, all the other stats have remained flat. 

Even a few years ago, this is where problem-solving would start. Limited to just this data set, there’s really no telling what the problem may be. 

But now robust tools support more telemetry options than just metrics. Application tracing is the act of collecting information on how your code is running for real users in the environment as it is happening. And having that in your suite of tools allows you to see this:

With this assistance, it’s easy to see what happened, and when, and even get an idea of why. Those little gray dots? Those are “deployment markers” and then show when code was changed and deployed to production. The telemetry from traces is granular, meaningful, and specific, which allows us to dig even further, look at the specific transactions:

From here, you can decide whether to investigate the browse/plans.jsp transaction, which is an eye-watering 7 seconds or the appropriately named oops.jsp with its 98.16% error rate. 

But the point of showing you this isn’t to teach you the specifics of application performance monitoring (APM), it’s to illustrate the way that real-time performance metrics completely change the way you can identify and investigate problems in your applications.

That doesn’t mean that metrics go out the window. As stated before, the problem could just have easily been with a bad stick of RAM or a corrupt database table. It just means you allow the customer experience—as visualized by APM and traces—to take priority and frame the moments when you might need to dig deeper.

Shifting to modern o11y

The question isn't if you need traditional monitoring or observability techniques, but when to use them.

At its heart, monitoring and observability is the consistent collection of telemetry from entities. Everything else—alerts, reports, dashboards and the like—is simply a happy byproduct of gathering the telemetry in the first place.

So if you're still collecting all the data, what has to change in order to go from old school monitoring to modern o11y? In a word, "perspective".

The leap to modern o11y doesn't require that we give up our old tools but instead give up our old perspective and way of thinking. Start by measuring what matters: the experience of the people using the service. A failure at that level is a "real" failure and requires immediate response.

That response can include automation which might look like increasing the resources available or re-deploying a container with the latest code. If those automated responses don't resolve the issue, that's when people need to get directly involved.

Now, this is when lower-level information becomes essential. Because once all the standard, easily automated responses have been done, it's highly likely that the problem lies deeper in the stack. But collecting data after a problem has started is obviously ineffective—the data has to be collected all along.

How to implement modern o11y

“OK, I’m convinced, but how do I get there? Do I have to throw out everything I have and build a modern o11y solution from the ground up?” I hear you asking. Thankfully, no.

Although you should add observability to your current monitoring solutions, having tools that do the same thing can waste time, focus, or money. I suggest you consider a process that has the following milestones:

  1. Add observability capabilities.
  2. Integrate your tools.
  3. Simplify your inputs.

1. Add observability capabilities

Identify the elements of modern observability that are missing from your environment—whether that means the capability is completely missing or possible but with an unacceptable cost (in time, effort, or actual money).

This is the point in the process to be rational about what you need but also thoughtful about what the future holds. It is much harder to bolt on an overlooked-but-critical capability later than it is to select a solution that has both what you need now and what you are likely to grow into later. 

For example, if you aren’t specifically shopping for something that uses machine learning to identify probable root cause by combining metrics, logs, events, and traces, you’d have a hard time convincing me I’ll never need something like that. 

Once you’ve selected a tool, get it installed and operational, and train your entire team on its capabilities, operation, and maintenance. This includes finding out where the new solution blends most naturally with your organization’s priorities. Finding those natural synergies is the best way to ensure some early wins and establish the value of the new tool.

2. Integrate your tools

After you’ve installed your solution, it’s time to stop playing “dashboard whack-a-mole” with a half dozen screens showing disparate data. Instead,  integrate everything together. I’m going to be really prescriptive here and tell you that trying to get your new observability solution’s data into your (old) existing monitoring tools is the wrong choice.

Robust, modern observability solutions have multiple ways to ingest data including tool- and language-specific agents, data connectors, custom integration, and APIs. No matter what monitoring tools you’re using, there’s very likely a way to display that data alongside your observability telemetry.

The benefit of doing this is that you can view both high and low levels of information side by side, in context with each other. This provides the next step in the path to starting with the view of customer experience and then moving organically to view systems and infrastructure data as needed.

3. Simplify your inputs

On the final leg of your journey to peak observability, you’ll review your results and metrics and consider removing redundant inputs. Yes, that means it might be time to say goodbye to some cherished old tools. But the truth is that modern o11y solutions can usually cover the lower levels as well. 

At this stage, you and your team should be very familiar with your observability tool’s capabilities and which overlaps are sufficient for your needs. So now is the time to start migrating, rather than integrating, to use the new tool’s functions for best practice observability.

This is also a good time to step back and ask yourself which of the old-school data sets are still needed. Does your organization need network packet information now that you can see how customers interact with the application? Those, and, likely, many others, may be metrics you can bid a fond farewell to

Looking ahead without forgetting the past

Modern o11y means putting the user's experience where it belongs: first and foremost. That includes understanding and setting the correct expectations for what the user experience ought to be. If the experience falls short, that's when additional insight is brought to bear.

Modern o11y techniques and tools have finally allowed IT practitioners to attain the point of view of the application we always wanted. Now our job is to shift our perspective without losing sight of all the things we learned on the journey to the peak.