BlackLine is a cloud-based enterprise software company that automates and controls the financial close process for companies. If software goes down during this time, books can’t close and companies can face stiff penalties. As a company, we want to make this process as seamless as possible. A big component of that is the performance of our product. BlackLine has a range of services to monitor, including monolithic applications, microservices, and third-party applications. In addition, we cover a wide variety of applications, technologies, and regions, and the complications that come along with that.
The decision to reduce monitoring tools
We were using nine or so tools to monitor our stack, including AppDynamics, GreyLog, Scom, Fog light, Elastic, LogicMonitor, and SquaredUp. This meant that we had a fragmented view of our estate. We were also hitting 5,000-10,000 alerts per day. With tens of millions of transactions per hour, we needed a solution that could handle that amount of data and show things like application performance management (APM) and service maps. We also had additional security challenges around sensitive data we handle for customers.
During an incident, multiple tools give engineers a high cognitive load. Not only do engineers have to understand the system, they have to correlate how tools are telling them about the interactions, and fill in the gaps. They have to take a look at multiple screens to deduce simple information. And most incidents aren't taking place when you're at your best, they're happening at 3 a.m. At that time, you don’t want to pull up 10 different screens to correlate data. We wanted systems that could do that for us so we could spend time figuring out the problem, not figuring out how to figure out the problem.
Any new monitoring solution needed to be deployed quickly, with immediate return on investment.
Changing the culture around incidents
One of our biggest challenges was mean time to detection (MTTD). We wanted to detect problems before our customers. When you have many different monitoring tools, problems can be lost in the white noise. When handling incidents before, we spent a lot of time parsing through files and logs to do that correlation and identify what was important. Those minutes add up. New Relic lets you skip those steps. It provides that signal, it shows where we should spend our time.
The difference between New Relic and some more traditional monitoring tools is how the information is presented to you and how you can leverage it. New Relic helps us understand context and correlation from the start. You don’t need to rely on historians to tell you how and why applications were built. That time is important to us, but more importantly, monitoring is an area that we don't have to teach everyone how to do anymore. When someone logs into New Relic, the moment they join the company, they understand how to use it. Across the board, users can see the exact same information presented in the same way. That same viewpoint helps build a global team that doesn't require everyone to be awake at all times. I like to operate a thin team and I like them to have a work-life balance and the tools that provide it for them.
Shifting to proactive monitoring
APM is one of the most useful functionalities for an SRE—the ability to dig into an application and see how it's behaving over time; how specific calls in a distributed ecosystem are executing against another. That functionality reduces the cognitive load for engineers because the APM system tells us how it's interacting. APM shows us when there's an issue or if there's about to be an issue—or more importantly if a use case has evolved from how it was originally designed.
New Relic helps detect issues before our clients experience them. It gives us the ability to write better policies and processes on how to give us information and alert us on it. It also helps us understand when we're about to see a degradation. Not every incident happens as a spike, sometimes it's a gradual progress to failure—we start to use more resources, have higher error rates, or see longer timelines on our application responses. If we can get those signals through all the noise, alerts, and logs, then we can attack it before it becomes an incident. The more data you give engineers the better. They understand the code they're writing and how it should behave, not just the feature set, but how it can impact the day-to-day lives of the people who we value the most; our customers.
BlackLine detected 13 issues before they made it into production, just by dropping New Relic infrastructure agents—before alerts were even set up. Correlation is amazing because out of the box, you’re able to go through and see APM and how it ties into logs, all the way through SQL. We can see real-time user monitoring through the data layer—just by installing an agent. It allows you to put metrics and alert conditions around SRE, like SLAs. That allowed us to alleviate pressure before it was ever felt by our clients.
Return on investment from monitoring tools
Budget is very important for us. Before any project is initiated we need to make sure the return on investment is high. What stood out to us about New Relic is that we didn’t have to pay per host. We have hosts scattered all over the world and multiple different cloud models and on-premise systems. Ingestion is the best model that we have because our data can come from any number of places. When you charge per host, especially if you have tens of thousands of hosts scattered across the globe, budgets can get out of control very fast. New Relic gave us a very easy model to follow. We know exactly what we're going to get charged for, and we have built-in dashboards that tell us what our trend is and what our future spend is projected to be.
Last year, some 244 issues took 5 hours to fix with New Relic. We estimate a saving of $16 million per year via proactive monitoring. If we can prevent incidents, then it leads to better customer satisfaction and a better customer experience. With New Relic, all data can be ingested to show correlation via real user monitoring, synthetics, logs in context, and distributed tracing.
New ways to build applications, write code, and create dashboards
New Relic has helped us evolve, both as leaders and as individual contributors. It shows how code needs to evolve. It provides real-time feedback on how the product is being used. That forces you to think more about how clients will use your products—and therefore your services when you’re making them. That feedback loop makes you a better engineer. The next time you build something and deploy it on a platform like New Relic, you look for new problems, new ways to build applications, rather than repeating history and making the same mistakes.
If you know basic SQL scripting, you can use New Relic Query Language (NRQL) to create dashboards, alerts, and alerts as code. NRQL allows you to build as code, so actions are easily repeatable and stable. On any other platform, you can't repeat it exactly the next time. New Relic takes away that cognitive load. New Relic provides the performance, metrics, and monitoring that we need. It gives our customers the confidence that we are monitoring the service that they're paying for, and leveraging the tools appropriately.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.