When Crowdstrike released a platform update on July 19, 2024, it caused many Windows-based machines to go dark. An incident like this hit home how much we rely on the digital world. Airlines, emergency call centers, hospitals, banks, and many more that we take for granted to just be online went dark in an instant, leaving organizations grappling with understanding the full extent of its impact and identifying where dependencies exist within their affected systems. This wasn’t the first time we have seen an outage with a widespread ripple effect, and it won’t be the last.
Observability is a critical tool when dealing with system outages
Observability can help during an outage because it provides real-time insights into your system’s performance and health. Observability tools like New Relic offer a clear view of the interdependencies within your IT ecosystem, calling attention to where failures are happening and their impact on other components.
In this instance, our customers actively monitoring their estate immediately received notifications if their systems were failing or observed third-party dependencies that were failing. Here are just a few things you can do to quickly restore normal operations and ensure the resilience of your IT infrastructure:
- Use monitored Windows system logs, entity synthesis, and mapping: Use both monitored Windows System logs and the entity synthesis and relationship mapping to find out exactly what servers have been impacted.
- Investigate ownership: Once you know which systems are affected, find out who’s responsible and notify them about remediation steps.
- Continuous monitoring: after patching and validating the systems, keep monitoring to ensure everything is fully recovered and restored to normal operations.
The following is a query using the New Relic Query Language(NRQL) that allows you to see which Windows hosts have Falcon running, and quickly determine if the Windows platform version is affected.
Having this type of visibility at your fingertips allows you to understand the full extent of the problem and prioritize resources and return to normal operations.
Outages are only going to get more complex
Back in the 80s and early 90s, the worst outages were with Telcos, which caused havoc with the ability to communicate and access essential services like 911 and call center support. In the late 90s and early 2000s, the internet became all about e-commerce. Outages then mostly just paused online shopping, causing some inconvenience to individuals.
Fast forward to today, and over 5 billion people—almost two-thirds of the world’s population—depend on the internet everyday. From ordering a coffee, to grabbing an Uber, software is behind all these moments.
I have to give a big shout out to the team at Crowdstrike for their amazing work recovering from the incident this week. They are accustomed to being catapulted into high-pressure, time-sensitive situations, and their response top-notch. As a business with millions of agents in critical workloads, we know the effort it takes to keep things running smoothly.
This is only going to become more important and complex over time for two main reasons:
- Continued digitalization: Many countries are still rapidly digitizing their economy. For instance, India still has over 50% of its population unconnected to the Internet, some parts of Africa have up to 80% of their population not connected yet.
- Increasing integration of AI: We are bringing more intelligence closer to people and intertwining AI into our daily lives, making us even more reliant on software for both work and personal activities. Digitalization is everywhere: 45% of TV viewing is via streaming, over 4 billion people shop online, and more than 70% of advertising has moved online and so on..
Businesses can monitor everything and still see nothing
Our world is indeed powered and intertwined with software, making the safeguard of our digital experiences mission critical.
Even if businesses think they are monitoring everything, they can still miss a lot without the right tools. Observability tools like New Relic can be game-changing for keeping digital businesses reliable. Think of it like having a superpower that lets you see everything happening in your digital world.
Our platform pulls together all your telemetry data—metrics, events, logs, traces, security vulnerabilities, and more—giving you a clear, unified, and fast path to resolution. At your fingertips you have a comprehensive entity and relationship dependency map, detailing the interactions between technologies including servers, processes, and applications across data centers and multi-cloud environments.
In times of outages, lean into observability. It’s not just about fixing what’s broken; it’s about gaining a deeper understanding of your systems. While this event and other disruptions in the future are unfortunate and unplanned, with observability, your response can be precise and swift, with the greatest opportunity to verify your operations and performance are restored in the shortest time possible.
이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.