We use application performance monitoring (APM) extensively at New Relic to watch and tune our own services. APM gathers relevant metrics and traces to identify and diagnose service issues. But sadly, even for us, it's impossible to instrument everything we want to know ahead of time. While adding instrumentation for an investigation or after an incident is a good practice, what if you want to solve your problem right now with the data you already have?

Logs to the rescue!

What's great about logs?

Unstructured text written to a file may be the oldest form of telemetry. Programs have been spitting text out since the dawn of modern computing. That ubiquity is the first super-power of logs. Even when other information is lacking, odds are someone emitted a log that says what's up.

Logging has become so common that the standard libraries for many languages have logging support built right in. Those ecosystems without a standard logger almost always have robust and flexible third-party alternatives. Additionally, many tools exist to format, control, capture, and help interpret your logs across the stack.

What's less awesome about logs?

Remember that word "unstructured”? That's the sticking point about logs as a telemetry type. Because of their long textual heritage, logs can end up as a junk drawer for odds and ends, tossed together without consistency or planning. Compared with the clear meaning of a metric value or the well-known schema and relationships for trace spans, logs are messy.

Further complicating the lack of structure, you may not control the formatting of the logs you care about the most. Whether it's your web server or cloud provider, some of that critical output might be "take what you can get" when it comes to format and content.

Logs can also be massive. That ease of generation means every level of the stack may be unleashing an enormous torrent of text. If you're not careful, some debug logging can even cause local application problems when disks run out of space. Ingesting, storing, and querying that huge volume of data can be a challenge.

How can you use logs more effectively?

Given those strengths and weaknesses, how can you up your logging game and get the most from this venerable data type? We'll look at a couple different features of New Relic logging that highlight how logs can give you insights beyond the traditional instrumentation we all know and love.

Structured logging

I compared logs to a cluttered junk drawer. But there's nothing saying your drawer has to be a mess. If logs are coming from an application you control, you can take advantage of structured logging to tame the madness.

In the most common form, we bring clarity to our logs by emitting JSON data instead of raw text lines. This JSON can have standard attributes to help later processors. So instead of having a log like this:

Something happened!!1!

You can get all this richness when figuring out what happened after the fact:

{
  "appName": "My Application",
  "hostname": "myhost.example.com",
  "level": "ERROR",
  "message": "Something happened!!1!"
}

Often, such log formatting can integrate with your web framework so you don't even have to set additional values or craft the JSON directly yourself. Great examples of this in the Rails world are the lograge and semantic_logger gems.

Relate logs to other data

With the ability to add structure and meaning to our logs, it'll be easier now to work with them in a centralized system, be that an on-prem ELK (Elasticsearch, Logstash, and Kibana) stack or a software-as-a-service (SaaS) vendor like New Relic. For application developers, filtering to logs from their application is the first step in finding those needles in the haystack.

New Relic APM agents all have built-in log forwarding capabilities that automatically add metadata to connect not just the application, but also traces and spans (aka individual requests in your system). With this, you can see, for example, the logs that occurred alongside a specific error trace.

If the stack trace and other properties of the error weren't telling you the whole story, having those logs right at hand may provide the insights you need.

Not every organization wants to use APM agent forwarding; for example, you may have a log forwarding solution already in place, configured within your Kubernetes cluster or with a centralized syslog server. Even if you have something other than the APM agent sending your logs, you can still take advantage of this rich connection between logs and APM through the local decoration option. When enabled, this option modifies the written logs to have the necessary metadata so you get all the goodness of seeing your logs connected in the New Relic UI.

Parsing the logs

But what if your logs aren't formatted as JSON and they have useful numbers buried in the text instead? For example, some of our Java apps have internal "service" classes with their own lifecycle system that logs messages like this:

2023-10-04 23:10:49,612{UTC} [main] INFO  c.n.a.OrderedServiceManager - Started AccountClientAutoService [RUNNING] in 519 milliseconds

That timing information printed at the very end of the line would be a lovely metric or event, but we only have it buried in the log messages. However, we can still query what we're after.

In this case we start by finding the lines we're interested in via the logging UI, which makes it easy to search on keywords and basic attribute value pairs. Beside the "Query logs" button there's the "NRQL" button, and that's where the real power starts. Click that and we go to the Query Builder. Here you can see the New Relic Query Language (NRQL) query that was used to find the current set of logs:

SELECT `message`
FROM Log
WHERE `container_name` = 'log-parser-consumer'
  AND allColumnSearch('Started', insensitive: true)
  AND allColumnSearch('RUNNING', insensitive: true)

Since we're filtered to the messages we care about, you can pull apart the message text with one of NRQL's newer functions, aparse. I won't restate the docs, but for the message above we can use aparse like this to extract our timing information:

SELECT aparse(`message`, '% in * milli%') as startupDuration
FROM Log
WHERE `container_name` = 'log-parser-consumer'
  AND allColumnSearch('Started', insensitive: true)
  AND allColumnSearch('RUNNING', insensitive: true)

Excellent! This will now give us an attribute in our result named startupDuration. We'd like to get the name of the class too. Let's pull our aparse up to a WITH block to do that:

WITH aparse(`message`, '%Started * [% in * milli%')
  AS (service, startupDuration)
SELECT service, startupDuration
FROM Log
WHERE `container_name` = 'log-parser-consumer'
  AND allColumnSearch('Started', insensitive: true)
  AND allColumnSearch('RUNNING', insensitive: true)

Now we see a table with services and their timing. Let's try to do some math with it! We'll group by the service and average the startupDuration to see how each class is performing:

WITH aparse(`message`, '%Started * [% in * milli%')
  AS (service, startupDuration)
SELECT average(startupDuration)
FROM Log
WHERE `container_name` = 'log-parser-consumer'
  AND allColumnSearch('Started', insensitive: true)
  AND allColumnSearch('RUNNING', insensitive: true)
FACET service

Sadly this comes up with blanks on our average column. What? Well, remember we're in log-land—everything is text! A quick function call can fix that for us though.

WITH aparse(`message`, '%Started * [% in * milli%')
  AS (service, startupDuration)
SELECT average(numeric(startupDuration))
FROM Log
WHERE `container_name` = 'log-parser-consumer'
  AND allColumnSearch('Started', insensitive: true)
  AND allColumnSearch('RUNNING', insensitive: true)
FACET service

For long-term usage, it would be worth recording this data as a real metric—either in the application directly, or via the New Relic events-to-metrics feature. Alternatively, we could create a parsing rule that would extract the value to its own attribute when the data is ingested, making it easier to query later on.

Either way, we can do our job today with the logs we already have. And because those logs are just another bit of data in NRDB, you can use these results in other parts of New Relic such as alerts, dashboards, or defining service level objectives (SLOs).

Correlating logs across time

One last trick to show the power that arbitrary logs and NRQL buy you. Like before, let's say that we're interested in the startup time of our application. But this time we don't have a single line with nicely embedded times. Instead, we have two lines we'd like to get the time span between; one where the app booted, and then a later one where we started taking traffic.

To make this more concrete, let's take an example from the Logging team at New Relic itself! When our primary ingest service boots up, we get two different lines:

  • The first says starting log-parser-consumer and indicates the Java application has started.
  • The second calls out the class name for our Kafka consumer  (LogsToParseKafkaConsumerAutoService) and tells us it's running.
starting log-parser-consumer
... lots of other work done to start up...
2023-10-04 23:26:52,429{UTC} [main] INFO  c.n.a.OrderedServiceManager - Started LogsToParseKafkaConsumerAutoService [RUNNING] in 0 milliseconds

We'll start by building a query to find those respective lines and nothing else. This looks about right:

SELECT message
FROM Log
WHERE `container_name` = 'log-parser-consumer' AND
  (message LIKE '%starting log-parser-consumer%' OR
   message LIKE '%Started%LogsToParseKafkaConsumerAutoService%RUNNING%')
LIMIT MAX

Next up, we need a piece of unique data to facet on, something that's different for each application startup. New Relic uses structured logging, and we get a unique pod_name, which is perfect for this purpose. This next query aims to show the two messages we care about grouped together by pod. One column gets the starting message, and the other the ending message:

SELECT
  filter(latest(message), WHERE message LIKE '%log-parser%') AS firstLog,
  filter(latest(message), WHERE message NOT LIKE '%log-parser%') AS secondLog
FROM Log
WHERE `container_name` = 'log-parser-consumer' AND
  (message LIKE '%starting log-parser-consumer%' OR
   message LIKE '%Started%LogsToParseKafkaConsumerAutoService%RUNNING%')
FACET pod_name
LIMIT MAX

We can confirm this matches what we're after, giving us the two lines in one row for each pod_name:

You may not have seen the latest function before in NRQL, but it's an aggregate function that does just what you'd think; from the data set, it takes the latest value. Here we pluck the message out, but having confirmed it's working, what we really want is the latest(timestamp). With that, we can do the math on the start and end timestamps and, voila, we've got timing data between any two arbitrary points in our log stream!

SELECT
  filter(latest(timestamp), WHERE message NOT LIKE '%log-parser-consumer%') -
  filter(latest(timestamp), WHERE message LIKE '%log-parser-consumer%') as startupDuration
FROM Log
WHERE `container_name` = 'log-parser-consumer' AND
  (message LIKE '%starting log-parser-consumer%' OR
   message LIKE '%Started%LogsToParseKafkaConsumerAutoService%RUNNING%')
FACET pod_name
LIMIT MAX

Conclusion

If you're looking to measure something numeric, your best bet may be a metric. But if you're left with just logs, don't despair! On the Logging team at New Relic we've used the tricks outlined above multiple times to find new performance data we wanted to track, easily locate the source of errors, and drive alerting and monitoring improvements. Your logs can bring new visibility to your system that you didn't have before, so get logging!