How to survive a horror movie (if that movie is your prod environment)

I love watching horror movies, but being in one? No thank you! The spooky stories in this blog are all based on real-world situations… If you doubt that your production environment could become a setting for a horror movie, just remember: something doesn’t have to be ghostly or supernatural to be scary. Luckily for you, you can make it out alive (unlike being a minor character in a horror movie)!

I framed each of the following stories around a scary movie that I love, but don’t worry – you don’t have to have seen the movies to relate to the stories. (But you should definitely give them a watch after this–with the lights on.)

The Orphanage: Do you know where your missing spans are?

PagerDuty screams: your checkout service is failing. You pull up a distributed trace, expecting a clear, end-to-end map of the request, from your frontend service, through product-catalog, recommendation, and so on, through to payment service (maybe).

Instead, it looks like someone’s taken an axe and chopped it up. There are gaping holes in between calls, and as you reach the bottom of the trace waterfall, there they are, hauntingly out of place.

The orphaned spans.

You stare, spooked, your eyes traveling warily across the fragmented trace. The span for checkout service is there, but the next step, the call to product-catalog, isn’t. The trace is broken, fragmented. Is the checkout service the cause of the failure, or a victim?

The ghastly truth (and the antidotes)

When spans are missing or have invalid parent span IDs, their children spans become what are referred to as orphaned spans, causing traces to become fragmented. In New Relic, orphaned spans appear at the bottom of the trace, but won’t have lines that connect them to the rest of the trace.

This can happen for a number of reasons, but you can save future spans from the span orphanage!

If your application receives high throughput, it may exceed APM agent collection limits, or API limits. One option to fix this is to disable some reporting, so it doesn’t reach the limit.
Missing instrumentation is another probable cause; if a related service isn’t instrumented, or is instrumented incorrectly, trace context won’t be passed on. Luckily, New Relic surfaces gaps in your instrumentation to help pinpoint where you need to add or fix it.
Spans might still be arriving and haven’t been indexed yet, causing temporary gaps until the entire trace has been reported.
Although I have never seen a trace of this magnitude this myself, if a trace exceeds the 10,000 span UI display limit, you might see orphaned spans. (What is happening that you have more than 10,000 spans in a single trace?!)

Paranormal Activity: Who – or what – is messing with my service?

Another alert, another big problem. You groan. Not checkout service again! You’ve fixed all the instrumentation issues from before, so you’re boldly assured there won’t be any orphaned spans. No spooky fragmented traces trying to piece themselves back together.

You hop online and check out (ha!) the APM dashboard. The service’s error rate is spiking, but the transaction traces are… fine. You look at a trace for a failed request: every span is green. The database queries are fast (20ms), and the application code is fast (50ms). The app itself is reporting that everything is healthy.

Yet, the end-to-end request time is a whopping 25 seconds, in itself a pretty scary thing to see. You check other traces, but you see the same thing, over and over and over again. It’s like… you gasp. It’s as though something otherworldly is reaching its spindly arms into your service and causing paranormal activity.

The app swears it's fast, but users are timing out. You're staring at what you can only describe as… not normal. The logs show no errors, just a 25-second gap of nothing. Who–or what–is messing with your service?

The terrifying truth (and how to remedy it)

When you figure out the issue, you’re momentarily stunned, because how did you not think of this before? Hindsight is 20/20, my courageous friend. You were looking for the source of the paranormal activity inside the application code (a slow function, a bad query), but the house itself–in this case, the underlying infrastructure–is the problem.

In fact, your app was fine. The underlying host the app was running on was the source all along. On a completely separate, disconnected monitoring tool, you eventually uncovered that the underlying host's Disk I/O wait time was at 95%. A rogue logging process on the same VM was writing gigabytes of data, and your "fast" application was simply stuck in line, waiting for its turn to write a single-byte transaction log.

Because your application and infrastructure data were not connected, you had a critical blind spot. You couldn't see that the host machine was the source of the "paranormal" latency.

New Relic automatically correlates your infrastructure and APM telemetry and surfaces these correlations in a single view, reachable by both APM and the infrastructure UIs. If you’re observing both your infrastructure and applications in New Relic, you’ll immediately see how the two impact each other, and whether one is causing issues for the other. You’ll be able to see the "Web transactions time" chart spike at the exact same time as the "Host CPU" or "Disk I/O" chart. No more paranormal activity on your watch!

IT: It’s all (telemetry) bloat down here

Ever since you fixed the orphaned spans issue and connected your infrastructure and application monitoring tools, your team–and more importantly, your manager and their manager–has been very enthusiastic about the value of observability. In fact, they’ve gone a bit overboard with the idea of observing everything, and have added instrumentation everywhere.

Your team has just deployed the new recommendation service. In the name of "total observability," logs were set to DEBUG, and custom metrics were created for every possible action, tagged with high-cardinality attributes like user_id and product_sku (a horror story for another time).

Having barely recovered from the previous two horror scenarios, you stumble to your laptop when PagerDuty goes off in the middle of the night: "High Latency." By now, you’re a pro at this. You go to load the APM dashboard, but it times out. You try to query logs, then traces, but the query spinner just continues to spin.

What’s happening now? In fact, you’re trying to sift through a petabyte of noise to find a single, useful signal. You’re running blind; not from a lack of data, but because you’re drowning in it.

The real jump-scare comes a few hours later, at 9AM sharp. This time, it’s from the CFO (cue scary sound effects). The cloud-spend alerts are blazing: your company's observability bill just 10x'd over the weekend. They’re paying a fortune to ingest and store terabytes of useless DEBUG logs and metrics that are too granular to be actionable. (By the way, this all happened before your team migrated over to New Relic, where your dashboards load just fine and your bills are reasonable.)

Welcome to Pennywise’s sewer, where it’s all (telemetry) bloat down here.

The real demons (and how to stop them)

Telemetry bloat happens when you collect too much low-value data. It increases costs and actively destroys your observability. You’re paying to ingest, store, and query data that provides no business value; you get alert fatigue from too many high-cardinality metrics, which masks the real issue; your platform may be forced to drop data because you’re reaching and exceeding its limits.

Of course, one thing to do is to strip out some of the unnecessary instrumentation. You can also filter your data, either by using drop rules, tail sampling, or some other kind of data filtering mechanism.

In New Relic, you can set Data Drop Rules. You can, for example, drop 99% of verbose DEBUG or INFO logs at ingest while keeping 100% of ERROR or WARN logs. You get the signal without the noise.
You can also manage your cardinality. Instead of using unbounded values (like user_id or session_id) as metric labels, add them as attributes to logs or traces. You can still query for them if needed, without creating millions of expensive, unique metric time series.
Instead of sampling randomly (e.g., "keep 10% of all traces"), New Relic can use tail-based sampling to intelligently find and keep 100% of the interesting traces—like those with errors, high latency, or rare attributes—while discarding the common, "everything is fine" ones.

This turns your telemetry bloat back into a clear, actionable stream. Take that, Pennywise!

You survived!

Congratulations–you made it out of the equivalent of three horror movies! Not only that, but you left armed with elixirs to exorcise future ghosts and demons that might be hiding in your production environment. For everything else, New Relic can help.

Check out our best practices guide for more survival tips!

Porträtfoto von Reese Lee, einer asiatischen Frau mit langen schwarzen Haaren und einem breiten Grinsen.

Reese Lee, Senior Developer Relations Engineer

Senior Developer Relations Engineer Reese Lee beschäftigt sich vorwiegend mit Open-Source-Software. Sie spricht regelmäßig über OpenTelemetry-Themen und tüftelt gern an der Behebung technisch komplexer Probleme. In ihrer Freizeit macht sie brasilianisches Jiu-Jitsu, sieht sich Gruselfilme an und liest Science-Fiction.

Die in diesem Blog geäußerten Ansichten sind die des Autors und spiegeln nicht unbedingt die Ansichten von New Relic wider. Alle vom Autor angebotenen Lösungen sind umgebungsspezifisch und nicht Teil der kommerziellen Lösungen oder des Supports von New Relic. Bitte besuchen Sie uns exklusiv im Explorers Hub (discuss.newrelic.com) für Fragen und Unterstützung zu diesem Blogbeitrag. Dieser Blog kann Links zu Inhalten auf Websites Dritter enthalten. Durch die Bereitstellung solcher Links übernimmt, garantiert, genehmigt oder billigt New Relic die auf diesen Websites verfügbaren Informationen, Ansichten oder Produkte nicht.

780+ Integrationen für Ihren Einstieg ins Stack-Monitoring. Kostenlos.

Alle Integrationen

In this article