When you're troubleshooting an issue in your app or host, you need to have complete log data that you can analyze. An application log provides granular information about events that occur in your application, including errors, user activity, and HTTP requests. Each line in a log includes a timestamp and information about the event, providing vital data that you can use to analyze how your application is performing—and troubleshoot errors when they arise.
However, logs are only useful if you’re managing them correctly in the context of the application, and that means handling each stage in the log management lifecycle, including:
- Analysis, querying, visualization, and alerts
In this article, you’ll learn the basics of handling each step in the log management process. To learn more about why logs and log management are important, see What is log management?
Let’s take a look at each stage in the process.
The first step is to generate logs for each service in your application. So how do you do that? There are several options:
- Some services and platforms automatically create logs for you.
- You can implement custom logging, often at the code level.
- You can instrument your services with an observability platform like New Relic. Instrumentation is the process of installing agents in your code that collect data and send it to a log management solution.
Let’s briefly go over each option.
First, many services and technologies automatically generate logs. For instance, if you use an MVC platform like Rails, Django, or .NET, logs are automatically outputted to the terminal in development. Cloud services like AWS and Azure also provide logs by default. You’ll need to take a look at the documentation for each service to determine whether logs are provided by default, whether they need to be turned on, and whether you need to configure them further.
If you’re creating custom frontends and backends, you’ll need to take a closer look at the documentation for the frameworks you’re using and the programming languages themselves. For example, both Ruby and Python have a `Logger` class that you can implement to add custom log messages. As a coding best practice, you should implement error handling within your applications—and you can add custom log messages whether or not the application successfully handles those errors.
Finally, if you’re using a log management solution like New Relic, you can automatically instrument your applications by installing agents in your code that send telemetry data to the platform you’re using. You can also use an open source solution like Prometheus to monitor services and generate log data.
If you’re starting from scratch, don’t reinvent the wheel. Use a solution that can instrument your services for you, and use the logs that your services already provide.
After your services are creating logs—either by default or because you’ve configured them to do so—you should collect them in one place where they can be analyzed, queried, and stored. It is extremely important to centralize your logs. Otherwise, you’ll have a difficult time troubleshooting issues and understanding how your services are interacting with each other. For example, a problem in an upstream service may cause an error in a downstream service—but if you are sending the logs for those services to different places, you won’t be able to correlate and analyze the data to find the root cause of the problem.
If you’re using a platform that provides instrumentation, agents automatically collect and send log data to the platform for you. For services that you don’t instrument, you need to set up log forwarding. This can be a real headache, especially if you need to forward logs from a lot of services. New Relic provides logs in context, which supports automatically forwarding logs without the need to install or maintain third-party software. Otherwise, you’ll need to go through the process of reviewing documentation for each service, setting up log forwarding, and ensuring that logs are properly forwarded to the centralized location you’ve specified.
There’s one other very important thing you need to consider: do any of your services produce events that absolutely shouldn’t be logged? An example is protected health information, which needs to be HIPAA-compliant. You may need additional configuration to ensure that sensitive data isn’t logged.
Technically, aggregation includes both collecting and consolidating logs, but this article includes it as a separate step because it’s an important part of the process. It’s not enough to simply collect incoming data—that data needs to be made consistent. Think of it as the difference between throwing papers in a box versus meticulously filing those papers away. You can’t easily find what you’re looking for if it’s stacked somewhere in a random pile. If logs are properly organized and stored, they are easier—and more efficient—to query.
That means formatting your logs so they are consistent and provide the information you need. That includes:
- Standardizing logs, such as by converting them to JSON format or ensuring that each log has certain key-value pairs. As an example, different services might format timestamps differently, use different keys to store data, or not use key-value pairs at all.
- Adding context to logs, such as information about the service they originated from. For instance, if you’re collecting logs from many different servers, you’ll want to have additional fields in the log data to differentiate them. That might include fields like an ID, server name, and server location.
Once again, if you’re using an observability platform or another log management solution and you’re installing agents and instrumenting your services, your solution may do a lot of the consolidation for you. If you’re not using an external solution, you’ll need to create a custom code-level solution that standardizes and contextualizes data as it’s collected.
After you’ve organized the data, you need to store it, whether that’s in a custom database or an external solution. If you’re using a log management solution or an observability platform to store data, make sure you fully understand how long each type of data is stored. For example, New Relic stores key metric application performance data forever, allowing you to visualize changes in your application over time. However, logs themselves are only stored for 30 days by default, with longer-term storage available with the Data Plus option.
Analysis, querying, visualizations, and alerts on log data
Your log data is only useful if you can make sense of it. That’s where automatic analysis and visualization can help you. If you are simply storing and collecting logs, you won’t be able to pick up patterns in your application. You can still use logs to troubleshoot reactively, but it’s challenging to proactively detect issues and fix them before they impact your users.
Data visualization gives you a high-level view of key metrics in your application. To visualize your data, you can use an observability platform or an open source tool like Grafana. The next image shows the APM dashboard in New Relic.
This dashboard includes visualizations of important metrics including transaction time, throughput, error rate, and Apdex score. By visualizing data, you can see how your application is performing and act quickly when metrics reach critical thresholds, such as when your error rate spikes.
With an observability platform, you can query your data to get more detailed visualizations or to examine specific transactions. New Relic includes NRQL, and you can use the query builder to query specific logs and build custom dashboards.
It’s also a good idea to set up alerts based on your incoming telemetry data. That way, you can notify your teams when a metric crosses a critical threshold.
Log archival and deletion
Log data tends to be most useful for solving problems that are happening right now. While log data also provides a useful historical record of what happened in your application in the past, at some point, you may want to archive or delete old log data.
Archiving is the process of moving data from “hot” into “cold” storage. Data in “hot” storage is queryable and can be accessed immediately. However, it needs to be stored on fast storage, which is more expensive. Data in “cold” storage is cheaper but takes longer to access and query. Ultimately, how long you keep data in “hot” storage depends on your organization’s needs. A range of 30-90 days is typical, and New Relic stores log data for 30 days by default.
Deleting log data is the final part of the log management lifecycle. Once again, when (or even if) you delete your logs depends on your organizational needs. While long-term cold storage is cheaper than hot storage, storing log data indefinitely—especially as that log data grows year after year—comes with a cost. So the primary reason you might want to delete logs eventually is to save on storage costs.
Some types of logs do have minimum storage requirements. For instance, security incident logs and HIPAA-compliant logs need to be stored for a minimum of six years.
Finally, you’ll want to ensure data privacy while managing your logs. Data collection always carries the risk of inadvertent sensitive data disclosure. Be sure to follow your organization's security guidelines by considering additional filtering, which is controlled by the configuration of the log forwarder you use.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.