Regex Parsing: Extract Data from Logs

Experience has taught me that regular expressions are the Swiss Army knife of the developer’s toolbox, and there's almost always a better regular expression for the job at hand. Developing a good regular expression tends to be iterative, and the quality and reliability increase the more you feed it new, interesting data that includes edge cases.

A regular expression that works is often good enough. If your data is highly predictable, then optimizing a regex may be an unnecessary endeavor. However, once you start using a regex as part of a wider system, at scale, or across unreliable data sets, the more you should ensure it is reliable, resilient, and performant.

Regex can seem complicated at first, but the system is logical and predictable once you can understand it. However, reverse-engineering a complex regular expression isn’t much fun.

In this blog post, you'll learn how to put together a regex for an important use case: extracting name-value pairs from a log line, which is often an important part of managing your logs. Logs are a good example of when you need to have strong regular expressions because typically, logs are part of a wider system (ideally, you have logs for your entire stack), need to scale with your application, and are often inconsistent. So let’s take a look at some regexes—on the way, you’ll hopefully learn to strengthen other regexes you work with.

Regex parsing for logs

This use case is based on a real-world requirement that was originally used to assist a customer with parsing their logs in New Relic. New Relic has a powerful data parsing mechanism that lets you ingest raw log data and parse it into individual semantically meaningful columns.

Here are the requirements for the real-world use case:

The log data contains multiple name-value pairs as well as other data.
The pairs appear in the format: (attr=value).
The values can contain white space.
Not all name-value pairs need to be collected.
Some pairs might be present in all log lines, but some might not.
The pairs may appear in any order.

Here's an example log line:

my favourite pizza=ham and pineapple drink=lime and lemonade venue=london name=james buchanan

For this example data, let’s say you want to extract the pizza, drink, and name fields from the data. However, you don’t want to extract the venue data or any other data in the log line. To make things more complicated, what if you want to collect this data from many log lines, and the data isn’t always presented consistently? What regular expression will capture those values for you?

TL;DR, here's the regex parser

Maybe you arrived here via Google and just want to copy and paste the rule to see if it works for you. Here it is—a regular expression for extracting name-value pairs, separated by the = sign:

(?:^|\s+)(?=.*?attrname=(?<attrname>[^=]+?(?=(?:\s+\b\w+\b=|\s*?$))))?

And here’s the Grok log parsing version:

(?:^|\s+)(?=%{DATA}attrname=(?<attrname>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?

For these rules:

Not all of the key-value pairs have to be present. The rule still functions on key-value pairs that are present but won't break if some of the key-value pairs aren’t present in a line.
The order of the key-value pairs does not matter.
White space is allowed within the value.

To learn more about how the rule works, read on.

Parsing with Grok patterns

This discussion will focus on the Grok log parsing version of the rule because it's a little cleaner. Also, parsing rules in New Relic are written in Grok, which allows you to use existing named Grok patterns. Because Grok log parsing is based on regular expressions, any valid regular expression is also a valid Grok expression. If you’re not using Grok patterns, just use the standard regular expression version provided in the previous section.

Starting with a fragile regex parsing rule

Let’s start with some data to test the regex. I love both beer and pizza, and even have my own wood-fired oven, so here’s a pizza-themed data set:

1: my favourite pizza=ham and pineapple drink=lime and lemonade name=james buchanan

2: my favourite drink=lime and lemonade name=james buchanan pizza=ham and pineapple

3: my favourite name=james buchanan pizza=ham and pineapple drink=lime and lemonade

4: my favourite pizza=ham and pineapple drink=lime and lemonade

5: my favourite name=james buchanan pizza=ham and pineapple foo=bar drink=lime and lemonade

6: my favourite drink=lime and lemonade

You’ll see that this data set has the key-value pairs in different orders, various amounts of whitespace, and even different numbers of key-value pairs.

In this example data, key-value pairs on each line are delimited with equal = signs such as drink=coke. Let's say you want to extract three values: pizza, drink, and name.

If the data always appears like line one, you could write a Grok parsing rule like this that extracts each of the values:

pizza=(?<pizza>%{DATA})drink=(?<drink>%{DATA})name=(?<name>%{GREEDYDATA})

This works, but the rule is fragile. It requires the values to always be in the same order. If any values are missing or there is any additional data, the entire rule fails. This is bad. You don’t want data to go missing because it doesn’t quite match. And even if you're pretty sure your data is consistent, can you ever be 100% sure?

If you want to try this out yourself with the built-in logs parsing test tool in New Relic, go to Logs > Parsing > Create parsing rule. You can paste in an example log line along with the rule to see the output. Alternatively, you can try the Grok rule out using this Grok log parsing tool.

Using a lookahead rule with regex parsing

So how can you make this parsing rule more robust? Using a lookahead comes to the rescue here. In order to target a single key-value pair, you need to know two things: when to start the match and when to end it. Let's work through this step-by-step.

Find the value pair

Take this pizza value pair as an example. It always starts like this: pizza=. Since the pattern is consistent, you can look ahead and capture the text like this:

(?=%{DATA}pizza=(?<pizza>.*))

This will return the following:

pizza: ham and pineapple drink=lime and lemonade name=james buchanan

DATA is equivalent to the expression .*?. See this useful list of Grok patterns. This lookahead rule finds anything after the string pizza= and captures it into a field called pizza. While this works, the drink and name values are captured, too. So the rule needs to be restricted to capture characters and whitespace up to the next name-value pair only.

Capture just the attribute you need

To capture just the pizza value, you can use another lookahead. The following rule captures any character that is not an equal sign. This should be non-greedy, meaning ? is appended to the pattern [^=]+. This is followed by whitespace character(s), a word, and then another equal sign. Here’s the rule:

(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=))))

This returns the following for #1: pizza:ham and pineapple ✅

However, it returns the following against #2: no match! ❌

Much better...but wait! Line two failed to match the pizza. Can you see why?

The pattern matches data followed by another name-value pair, but in this case, the rule has searched the entire line and there are no additional name-value pairs. The capture needs to extend to either be followed by another name-value pair or the end of the line, which is signified by $. It’s also important to consider trailing white space, which you can discard with the non-greedy %{SPACE}?.

Here’s the updated pattern:

(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))

Returns against #1: pizza:ham and pineapple ✅

Returns against #2: pizza:ham and pineapple ✅

This is much better and more reliable. If you just want to capture one field, you’re finished. However, with logs, you’ll often need to capture multiple fields.

Capture multiple fields in logs

You can chain multiple expressions together to capture other values by repeating the same expression and changing the value names as needed:

(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))(?=%{DATA}drink=(?<drink>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))(?=%{DATA}name=(?<name>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))

This returns the following:

Line #1: pizza:ham and pineapple, name:james buchanan and drink:lime and lemonade ✅

Line #2: same as #1 ✅

Line #3: same as #1 ✅

Line #4: no match! ❌

This works for lines one through three of the sample data. The rule now returns matches regardless of the order of key-value pairs. Unfortunately, it fails for line four of the input:

4: my favourite pizza=ham and pineapple drink=lime and lemonade

You may have noticed that line four is missing the name key. The regex rule requires name to be present or the whole pattern fails. This is a common failure that often goes unnoticed when using regexes with data sets. As you can imagine, these kinds of problems can be very tricky to deal with because it looks like the rule is working correctly, but it isn't gathering critical information. You can fix this by making each pattern optional. To do so, add ? to the end of each expression.

This is the generalized pattern for each key-value pair:

(?=%{DATA}attrname=(?<attrname>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?

Let’s try this regex against the data. The following expression includes the pattern three times, one for each attribute that needs to be captured (name, pizza, and drink):

(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?(?=%{DATA}drink=(?<drink>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?(?=%{DATA}name=(?<name>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?

This returns:

Line #1: pizza:ham and pineapple, name:james buchanan and drink:lime and lemonade ✅

Line #2: same as #1 ✅

Line #3: same as #1 ✅

Line #4: pizza:ham and pineapple, drink:lime and lemonade ✅

Line #5: same as #1 ✅

Line #6: drink:lime and lemonade ✅

The rule correctly matches all test input data in any order and continues to work for missing fields.

Regex lookaheads performance

Lookaheads do have additional performance overhead, so if your data is reliably consistent, you may be able to use a simpler, more performant rule that doesn’t have lookaheads. You can also make this rule much more performant by adding the prefix (?:^|\s+) at the beginning of your rule:

(?:^|\s+)(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?(?=%{DATA}drink=(?<drink>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?(?=%{DATA}name=(?<name>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?

This small change ensures that lookaheads happen only at the beginning of a line or when there is a space, not with every character. This stops the rule from using lookaheads where they aren’t needed.

Best practices for using regex to parse log data

Using regular expressions (regex) to extract data from logs can be an effective way to distill invaluable information. Here are some best practices to ensure accuracy, performance, and maintainability:

Start simple: Before diving into complex patterns, begin with a simple regex to capture the most straightforward and common log entries. This can help in understanding the structure of your logs.
Use specific patterns: Instead of using broad patterns like .* which matches almost anything, try to be as specific as possible. For example, if you know an IP address will appear, use a pattern like \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}.
Non-capturing groups: If you're grouping just for logical sequences but don't need the actual data, use non-capturing groups with (?:...).
Avoid greedy matches: By default, regex is greedy, meaning it captures as much as possible. This can be problematic in logs with repetitive patterns. Use ? to make your pattern non-greedy. For instance, use .*? instead of .*.
Optimize for performance: Complex regexes can slow down log processing. Test your regex patterns for efficiency, especially if applied to large log files or streams.
Use named groups: Instead of relying on the order of capture groups, use named groups like (?P<name>...). This makes your regex more readable and allows for easier extraction based on field names.
Be mindful of multiline entries: If your logs can span multiple lines for a single entry, ensure that your regex accounts for this by using the appropriate multi-line flags or patterns.
Test extensively: Before deploying a regex pattern in a production environment, test it on a sample set of log data to ensure it captures everything accurately without false positives.
Comment your regex: Regex patterns can become complex and hard to decipher over time. If your regex tool/language supports it, add comments explaining challenging parts.
Stay updated: Log formats can change over time, especially if you upgrade systems or software. Regularly review and adjust your regex patterns to accommodate these changes.

These best practices can help ensure that your regular expressions are not just efficient but effective in extracting the data needed from your logs.

Conclusion

Hopefully, you have a better understanding of how this rule works and have a good sense of how you can iteratively improve a rule to make it more reliable. There is always a better regular expression out there if you put enough thought into it. Good luck finding one that’s even more effective for your use case!

Next steps

Interested in learning more about regex parsing and managing your logs in New Relic? Check out the logs documentation.

Just getting started with logs? Learn more about log management.

You can start accessing your logs in just a few minutes with a free New Relic account. Your account includes 100 GB/month of free data ingest, one free full-access user, and unlimited free basic users.

Regex parsing FAQs

1. What is regex and why should I use it for log parsing?

Regex, short for regular expression, is a powerful tool for pattern matching and extracting specific information from text. It's particularly useful for log parsing because logs often follow specific patterns. Regex allows you to define these patterns and extract meaningful data from log entries.

2. What are some common regex patterns for log parsing?

Common regex patterns for log parsing include:

Extracting IP addresses: \b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b
Parsing dates: \b\d{4}-\d{2}-\d{2}\b
Extracting URLs: (https?|ftp):\/\/[^\s/$.?#].[^\s]*

3. How can I optimize my regex parsing for performance?

Use specific patterns instead of generic ones to avoid unnecessary backtracking.
Utilize non-capturing groups (?:...) when you don't need to extract the matched content.
Be mindful of greedy vs. lazy quantifiers (e.g., * vs. *?) to avoid excessive matching.
Test your regex with a variety of input data to ensure it performs well under different conditions.

4. How do I handle multiline logs with regex?

Use the re.OTALL flag or (?s) modifier at the start of your regex pattern to make . match newline characters. Alternatively, you can use \n to explicitly match newline characters in your regex pattern.

5. How can I debug complex regex patterns?

Break down your regex pattern into smaller parts and test each part individually. Use comments within your pattern to annotate each section, making it easier to understand. Additionally, regex visualizers can help you visualize how your pattern matches input data.

6. Are there any common regex pitfalls I should be aware of?

Greediness: Greedy quantifiers can match more than intended. Use lazy quantifiers (*?, +?) when appropriate.
Overlooking special characters: Special characters like . or * need to be escaped (\. or \*) if you want to match them literally.
Not handling edge cases: Consider edge cases in your data, like empty fields or special characters, and adjust your regex pattern accordingly.

By James Buchanan, Senior Solutions Architect

James Buchanan is a Senior Solutions Architect at New Relic.

The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.

780+ integrations to start monitoring your stack for free.

See All Integrations

In this article

Regex parsing: Using regular expressions to extract data from your logs

Build a robust rule to get key-value pairs from log lines

Regex parsing for logs

TL;DR, here's the regex parser

Parsing with Grok patterns

Starting with a fragile regex parsing rule

Using a lookahead rule with regex parsing

Find the value pair

Capture just the attribute you need

Capture multiple fields in logs

Regex lookaheads performance

Best practices for using regex to parse log data

Conclusion

Next steps

Regex parsing FAQs

1. What is regex and why should I use it for log parsing?

2. What are some common regex patterns for log parsing?

3. How can I optimize my regex parsing for performance?

4. How do I handle multiline logs with regex?

5. How can I debug complex regex patterns?

6. Are there any common regex pitfalls I should be aware of?

Regex parsing: Using regular expressions to extract data from your logs

Build a robust rule to get key-value pairs from log lines

Regex parsing for logs

TL;DR, here's the regex parser

Parsing with Grok patterns

Starting with a fragile regex parsing rule

Using a lookahead rule with regex parsing

Find the value pair

Capture just the attribute you need

Capture multiple fields in logs

Regex lookaheads performance

Best practices for using regex to parse log data

Conclusion

Next steps

Regex parsing FAQs

1. What is regex and why should I use it for log parsing?

2. What are some common regex patterns for log parsing?

3. How can I optimize my regex parsing for performance?

4. How do I handle multiline logs with regex?

5. How can I debug complex regex patterns?

6. Are there any common regex pitfalls I should be aware of?

Tags

Related