Subquery JOINs, lookups, and GeoIP allow you to resolve business-critical incidents with your data in one simple query. In this blog, we'll take a deep dive on subquery JOINs, including joining related datasets, making calculations across datasets, and tips on writing great subquery JOIN statements. But first, a brief explanation of why this powerful query capability is so important to helping you better understand your data.
Your telemetry data is vast, so knowing how to query your data effectively is crucial to answering the questions you have about the performance of your systems. Often, combining various data can illuminate correlations and trends, giving you a better understanding of the health and performance of your apps and systems. One way to combine data is using subqueries, which allow you to use the result of one query as the basis of another. Subqueries are limited, however, because subqueries can only return a single result or list of results.
Which brings us to today’s topic: Subquery JOINs, a more powerful solution to solve these needs. With subquery JOINs, you can join two result tables together, allowing for analysis and enrichment across datasets. In this blog you’ll learn just how easy it is to do. Check out the video or continue reading below to learn more.
Whether you want to combine application performance monitoring (APM) event data and metrics, your logs and infrastructure, your synthetics and
NrAuditEvent, or your custom events, subquery JOINs can help. Consider this next example of joining infrastructure data with transaction data:
FROM ProcessSample SELECT average(cpuPercent) FACET containerId
FROM Transaction SELECT average(duration) FACET containerId
In this example there are two datasets: average CPU percentage of infrastructure by container
(ProcessSample) and average duration of the transaction on the app by container. Often, data from different sources is correlated. In this case you can determine if a container's higher CPU usage is causing slower transaction by using the following subquery JOIN:
JOIN (FROM ProcessSample SELECT average(cpuPercent) AS cpu
FACET containerId LIMIT MAX) ON containerId
SELECT average(duration)/latest(cpu) FACET containerId, containerName
This is amazing! With this subquery JOIN, you can see the containers that have a higher transaction duration (on average) compared with their average CPU percentage (on average). With this data you can then research the topmost containers to ascertain if there's a bug to fix or optimizations to be made.
Next let’s take a deeper look at the syntax.
Below, you’ll learn how to get started with the syntax. To begin with, understand that a subquery JOIN requires three components:
- Dataset 1
- Dataset 2
- Primary key that links the two datasets together
FROM Event [INNER|LEFT] JOIN (subquery) ON [key =] key SELECT ...
There are a few simple rules to the structure:
- The JOIN clause must always follow immediately after the FROM clause.
- The JOIN can be prefixed with the join type. INNER or LEFT is optional, and defaults to INNER when omitted.
- Parentheses containing a subquery must immediately follow JOIN.
- The ON clause must immediately follow the subquery and has two forms (more details below).
Read the subquery JOINs documentation to learn the nuances of the syntax.
Let's look at an example where subquery JOINs help enrich your data. In the next example, there's APM data in two places: PageView provides information on a page visited by an end user, while
PageAction provides the actions taken on the page. These two events both have session IDs that identify the session of an end user, but some data is contained in one event and not the other.
In this example,
city is an attribute in
PageView but not
currentUrl is an attribute in
PageAction but not
FROM PageView SELECT count(*) FACET session, city
FROM PageAction SELECT count(*) FACET session, currentUrl
With subquery JOINs, you can enrich the
PageAction dataset with the missing data! By combining the data using the session ID, you can see not only the URL with the most clicks, but also where these actions came from:
JOIN (FROM PageView SELECT count(*) FACET session, city
LIMIT MAX) ON session
SELECT count(*) FACET city, currentUrl
Tip: Make sure to include LIMIT MAX in the inner subquery to maximize the number of results joined to the outer query. The current max result limit is 2,000 rows.
In this last example, you’ll learn how to parse two log datasets to match one another, then run calculations to find the error percentage of the logs.
Logs can often have data buried within the log message. Using enhanced string parsing functions like
aparse(), you can extract key values. In this case, the primary key application ID is within the log message of both
WITH numeric(aparse(message, '%appId: * %')) AS app_Id
FROM Log SELECT count(*) FACET app_name, app_Id SINCE 1 day ago
FROM Log_Error SELECT count(*) AS errCnt
FACET numeric(aparse(message,'%applicationId: * %')) AS application_Id
SINCE 1 day ago
To find the error percentage by application, a LEFT JOIN is required. This is because not all of the applications have errors, and an INNER JOIN would exclude these apps.
WITH numeric(aparse(message, '%applicationId: * %')) AS application_Id
LEFT JOIN (FROM Log_Error SELECT count(*) AS errCnt
FACET numeric(aparse(message, '%appId: * %')) AS app_Id
SINCE 1 day ago LIMIT MAX) ON application_Id=app_Id
SELECT (latest(errorCount) OR 0)/count(*) AS errorPercentage
FACET app_name, application_Id SINCE 1 day ago
Even though there was data stored in two different log partitions and the application ID was hidden in the log messages, you were able to parse out the data you needed and calculate the error percentage. Using
latest() was also helpful to ensure the aggregations of the inner query were calculated correctly with the outer query.
- Subquery JOINs have a limit of 2,000 results.
- Alerting is not supported. JOIN queries have the same limitations as standard subqueries, due to architectural limitations of streaming in our system today.
- SELECT * is not supported in the inner subquery of a subquery JOIN.
- Queries that generate very high-cardinality relationships are currently restricted to a maximum of 1:100, which means a single primary key cannot map to more than 100 rows of the joined query.
- Try not to alias a column within a subquery using an existing column in the outer query, as this can cause errors.
- Make sure the primary keys match their attribute type (for example, string, numeric, etc.). Mismatches in attribute types cause the query to return no results. For example,
ON account=account_numberwill not work if
accountis a string and
account_numberis a number.
LIMIT MAXto the inner query will maximize the number of results, and therefore maximize the number of JOINs.
When joining against an aggregated dataset, be sure to use
latest(), which will preserve the aggregated value in the outer query.
With subquery JOINs, combining two datasets is easy. Best of all, your team can analyze the combined data to gain insights about the health and performance of your application that otherwise may not be obvious or easy to correlate. There are too many possibilities to count how data can be joined, but what’s certain is that you can do it with subquery JOINs.
- Read about subquery JOINs in the NRQL reference page.
- Take a deep dive into the NRQL subquery JOINs documentation.
- Watch the instructional Data Bytes video on JOINs, GeoIP, and Lookups.
- If you’re not already using New Relic, get started with New Relic for free. Your free account includes 100 GB/month of free data ingest, one free full-access user, and unlimited free basic users.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.