Like many Ruby shops we use the DelayedJob job runner to run many of our background tasks. The New Relic agent has had some basic DelayedJob instrumentation for a while now but until recently you were limited to a few metrics which could only be viewed using the somewhat limited and unwieldy custom charts. But no longer! In the latest release of the Ruby Agent we are sending much richer, useful metrics for DelayedJob workers, and the recently released Custom Dashboards allows you to to build a truly useful view for monitoring and troubleshooting your jobs.
Here's what it looks like:
This is a Custom Dashboard we built for our DelayedJobs workers running in production. Below, I'll show you the steps to create this in your own New Relic account. But first let me explain what we are looking at.
There are four graphs with data reported by the agent running in the DelayedJob worker. Job Queue Length shows the number of jobs waiting to be executed at any given time, based on one minute samples. We break out these numbers by priority since we manage separate workers in three different priority levels. In the graph above, you can see that our highest priority worker maintains a pretty healthy queue while we get a few spikes in our low priority worker every now and then, since the jobs in the low priority queue tend to run a little longer.
In our configuration, we destroy jobs when they've run successfully and leave the failed ones. The agent polls the failed jobs and reports the count which we graph in the Failed Jobs chart. We pay particular attention to this chart during a deploy to look for any jump in failing jobs caused by a regression.
Locked Jobs is particularly interesting. Jobs are locked while executing, so the number of jobs locked at any given time should not exceed the number of workers. With idle workers, the count will be low. With workers fully utilized, the number will approach the total number of workers. We find you do have to be careful as some jobs occasionally fail without clearing their lock, so you have to manually clear those out. But you can use the Locked Jobs graph as a gauge of your overall worker utilization.
The Average Job Execution time graph is simply a graph of the average response time of the DelayedJob transactions.
Creating a Custom Dashboard for DelayedJobs
Let's walk through the steps to get this set up in a typical account.
The first thing you need is the current version of the Ruby Agent gem. You must be running at least version 3.4.2. Then follow these steps in the New Relic application:
1) Select the application where DelayedJob is running. In our case it's "Background Jobs".
2) From the Custom Dashboards menu choose Create custom dashboard. Enter the title and "Grid" chart type.
3) Choose Add chart to add the first chart to your dashboard. Fill in the basic properties, including the titles and chart type, which is "Stacked Area."
4) Select the DelayedJob queue lengths for the graph. Start typing in Workers/. Initially it will say the metric doesn't exist, but when you enter the slash it will display the list of available metrics in the Workers namespace.
Keep typing Workers/DelayedJob/queue_length/priority/*. The metrics will auto-complete as you enter each segment. The asterisk at the end indicates that you want to break down the graph by priority. If you don't use priorities you can chart a single line for the queue length using the metric name Workers/DelayedJob/queue_length/all.
5. That's generally good enough to display the chart, but there are a couple of other properties you can customize. In our case, we removed the limit on the number of different priorities it will show, added "jobs" to the y-axis labels, and fixed the application to "RPM Background" where the DelayedJob agent reports to. Here's what it looks like:
6. Add the Failed Jobs chart. Repeat steps 3 - 5 but enter the metric for tracking failed jobs: Workers/DelayedJob/failed_jobs.
7. Add the Locked Jobs chart. Repeat steps 3 - 5 but enter the metric for tracking locked jobs: Workers/DelayedJob/locked_jobs.
8. Add the Job Execution Time chart. Repeat steps 3 - 5 but enter the metric for the response time of delayed jobs: OtherTransaction/DelayedJob/all. We show execution time in milliseconds so we chose To Milliseconds for the Number Format and "ms" as the y-axis unit label. We also hide the legend.
That's enough to get you started with a good dashboard for monitoring your DelayedJob workers. But you don't have to stop there. You can add a table showing you the summary metrics for each job or a breakdown chart of all the jobs. In addition to showing the response time of jobs, you could chart the throughput. We combine our staging and production servers onto one dashboard by duplicating the charts and picking a different application. Custom Dashboards now makes it easy to configure the charts for the exact data you want to display.
If you haven't already, try New Relic today to see how you can improve the performance of your Ruby applications. And as always, we welcome your feedback on this post. Let us know what you think of Custom Dashboards and how you're using them to monitor your applications.