The dreaded Java Virtual Machine (JVM) crash: it’s not very common, but if you work on Java applications, you probably know how difficult recovery can be. We certainly do.

The data ingest pipeline for the New Relic Browser application is a high-throughput, multi-JVM, distributed app for which we have set up plenty of alerting. But one day, all of the JVMs quit unexpectedly at the same time. When the alerts went off, I thought it had to be a mistake. Graphs from our New Relic monitoring tools looked normal until they just simply disappeared!

Luckily, no customer data was harmed in this incident. At New Relic we take data integrity very seriously, and all short-term data is backed up wherever possible. So we got the system back online as quickly as we could and began an immediate diagnosis of the root cause of the crash.

But how do you even start diagnosing problems like this? You need to know why the JVM crashed and what the JVM was doing at the time it crashed, so where do you look?

Don’t worry, this story has a happy ending.

Starting with the operating system logs

On Linux machines, when a JVM terminates there is usually a specific operating system signal sent to the JVM process, which you can find in the OS logs. Since our app runs on Red Hat Linux servers, the first place I started looking was in the logs.

Before long, I found the following line:

abrt-hook-ccpp: Process 111915 (java) of user 0 killed by SIGSEGV - dumping core

For most Java applications, memory access is in JVM-managed memory, in the heap, thread stacks, and string table. The heap is a memory space that stores long-lived objects referenced by a Java application as it runs. Thread stacks store objects locally scoped to a single method or used as parameters. The string table is a special region that stores static strings and interned strings.

So this line in the log indicated a segmentation fault killed the JVM process. In other words, the JVM was attempting to access data that did not belong to it, most likely from outside the heap.

To figure out exactly why the segmentation fault happened, I had to dig a little deeper.

The log line was generated by Red Hat’s Automatic Bug Reporting Tool (ABRT). ABRT writes files to /var/spool/abrt by default and is allowed to write files up to 1GB in size. I had ABRT installed, but it didn’t capture anything useful when the JVMs crashed because the data ingest application’s heap is far larger than 1GB.

However, that made me realize that I could prepare a better system to get more information if the crash happened again.

How about a core dump?

So I configured ABRT to write core file crash dumps in larger file sizes to a large disk array attached to our application servers. Core files contain much more detail about what’s happening in a process as it fails.

And … the JVMs all crashed again.

This time I used GDB (which lets you examine why a program has stopped) to open the core files captured when the JVMs crashed. Using the GDB backtrace command—which summarizes how a program got into its current state, one frame at a time—I looked at the last few lines of the stack trace that were recorded at the time of the crash.

Frames #0 - #6 did not seem significant, just general signal handling. Frame #7 is where things got interesting:

#0 0x00007f1b6626c5f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56


#6 <signal handler called>

#7 0x00007f1b64c56cd7 in ciObjectFactory::create_new_metadata(Metadata*) () from /opt/java/jdk1.8.0_74/jre/lib/amd64/server/

#8 0x00007f1b64c57105 in ciObjectFactory::get_metadata(Metadata*) () from /opt/java/jdk1.8.0_74/jre/lib/amd64/server/

#9 0x00007f1b64c4edd7 in ciSpeculativeTrapData::translate_from(ProfileData const*) () from /opt/java/jdk1.8.0_74/jre/lib/amd64/server/

#10 0x00007f1b64c4f0e8 in ciMethodData::load_extra_data() () from /opt/java/jdk1.8.0_74/jre/lib/amd64/server/

#11 0x00007f1b64c51258 in ciMethodData::load_data() () from /opt/java/jdk1.8.0_74/jre/lib/amd64/server/

#12 0x00007f1b64c43fb4 in ciMethod::method_data() () from /opt/java/jdk1.8.0_74/jre/lib/amd64/server/

#13 0x00007f1b65168055 in Parse::Parse(JVMState*, ciMethod*, float) () from /opt/java/jdk1.8.0_74/jre/lib/amd64/server/

Frames #7 - #13 told me enough to understand what caused the JVM crash—the HotSpot compiler had crashed while allocating metadata.

The HotSpot compiler is responsible for just-in-time compilation of interpreted code into native methods. When the compiler works correctly, it transparently increases performance. In this case, however, it read from unallocated memory and crashed the JVM.

When others have seen it, too!

At this point, I strongly suspected the problem was a bug in the JVM rather than in our application. Looking through the bug database for the Oracle Java, I found an issue that closely matched the stack trace from the core file.

This was good news! The bug was indeed in the JVM and not in Browser’s data ingest pipeline. I tracked the progress of the fix, and when it hit the Oracle JDK release notes, I deployed that version of the JVM. I haven’t experienced any crashes since.

Anyone who works with Java applications knows JVM crashes can cause widespread outages and can be frustratingly tough to diagnose. While many of these problems are mysteries that go unsolved, all it took was some persistent troubleshooting to give this story a happy ending.