Whenever a team at Riskified launches a new service, developers instrument it with New Relic so they have a holistic view of the application, including the proper AWS CloudFormation and container usage to get the optimal performance for the platform. This approach has paid off in many instances, including a particularly impactful event when Haim Ashkenazi, Riskified’s head of DevOps, got an alert on a low Apdex score for a server cluster. Calls to the service had increased from around 10 million per day to more than 1 billion in a couple of hours, and performance had plummeted.
The sudden 100x increase taxed Riskified’s normal autoscaling policies and impacted many parts of the system. Using New Relic APM and customized New Relic Insights dashboards, Riskified was able to retune its scaling policies and database metrics and restore performance within roughly 20 minutes.
Without New Relic, the process would have involved a substantial amount of trial and error to reach an adequate level of servers—and at a significant cost. Instead, Ashkenazi was able to adjust the scaling policy to handle the increased volume, buying time—and saving the company an unnecessary expense—while he figured out what happened. “It took less than 20 minutes to stabilize and resolve the problem,” recalls Ashkenazi, “And I could analyze exactly what part of the application was impacted.” In the end, it turned out that one of Riskified’s biggest customers had released an incorrectly configured version of its mobile app, resulting in overuse of the API.
Transaction response times reduced by 75%
As Riskified continues to scale, Feldman’s ongoing quest is to reduce transaction time—the amount of time it takes to send merchants that yes-or-no decision. He says, “It’s no longer just a question of monitoring software performance, it’s also making sure that we’re quickly providing decisions on transactions and meeting our customers’ SLAs. New Relic helps us get there.”
New Relic’s APM dashboards pointed Riskified in the right direction, and their engineers then added additional method-level instrumentation into the areas in the application code that were taking too much time to execute.
“Moving from 1-second decisions to 200-millisecond decisions is very difficult, since it’s all in the very low level of the code structure and efficiency,” says Feldman. “New Relic’s base capabilities were extremely helpful, and we were able to quickly and easily make custom additions to those capabilities to meet our demanding standards.” Feldman and his team created a variety of dashboards with New Relic Insights to monitor very specific elements of their operation that were critical to their success.
Using these dashboards, Riskified’s teams were able to combine the analytics of Insights and APM to drill down to the class and method level or the message level, systematically identifying areas that could be improved and fixing them. As a result, over the course of four months Riskified was able to slash its average transaction response time by 75%—from 800 milliseconds to 200 milliseconds or less.
Monitoring and scaling on Black Friday
For Black Friday (the most important shopping day of the year) and throughout the entire holiday shopping season, Ashkenazi notes, “You don’t take chances. You go as large as you can.”
To ensure its platform was ready to meet the expected uptick in request volume, Riskified used New Relic Infrastructure to perform stress tests, challenging its system to find the largest load the machines could handle without failing. Analyzing the data from New Relic, Ashkenazi and team determined that the cluster’s latency was not the best parameter for creating the policy for the autoscaling group. “New Relic helped us identify the right network for this type of server and create an effective scaling policy using the amount of network bytes transmitted per minute instead,” explains Ashkenazi.
The team monitored the platform for the entirety of Black Friday, and all systems performed superbly under the new policy. Ashkenazi can now sleep soundly. “I don't wake up at night. We have it handled. If we design the system correctly with scaling groups and policies, taking messages and responding accordingly, most events handle automatically,” he says. Significantly, the new policy has resulted in cost savings, as it enables Riskified to define its infrastructure and manage its spend with AWS effectively.
Continually advancing a DevOps culture
At Riskified, DevOps isn’t just the name of a team. “We’re trying to adopt DevOps as a culture,” Feldman says. “Everybody is responsible for the reliability of the system and performance.”
Feldman continues, “New Relic provides us not just efficiency but also ownership. Because every team can manage much of the infrastructure requirements themselves, they’re constantly thinking about improvements and can ensure that nothing falls through the cracks.”