Greeting cards are a very seasonal business. Valentine's Day, Mother’s Day, and Father’s Day are really big days for us at Thortful, when thousands of concurrent users are on our site and the number of searches per minute multiplies by 30. 

As an e-marketplace in the B2C space, with small cart sizes, customer loyalty is key to our success. The website needs to be a great experience so customers come back. Getting to issues and resolving them fast is key—understanding how customers are impacted, where, how, and why. We do everything we can on the technology side to make that possible.

Here are a few things we do to ensure platform performance and deliver a seamless customer experience during busy periods.

Monitoring key metrics for customer experience

I don’t want customers reporting issues. I want to be the first to know. To do that, we preemptively scale up everything for busier periods. We proactively look for anything we can track: conversion rate and response times from our web app to our APIs. Those metrics tell me whether we’re delivering on our promise or not. We also religiously track the number of checkouts per second in the run-up to big days like Valentine’s Day.

One advantage of being cloud-based with Amazon Web Services (AWS) is how much easier it is to scale up—clusters, autoscale models—so when traffic ramps up, everything scales up with it. On the digital side, we can grow and shrink as much as we want. Anything used by users that land on the site needs to respond below the 100-millisecond mark. We track to the millisecond—and any movement in milliseconds helps conversions. Technically, we’re only as good as our last checkout.

Observability has been an important tenet in scaling growth in our go-to-market strategy. Five years ago, we were too small to do any load or performance testing. Every year we learned on the job. That made us very cautious on the product side of things because we never knew what was going to break. New Relic is really good at helping us fix issues in flight. We can pinpoint the issue in the system with distributed tracing, which gives our team an easy way to capture, visualize, and analyze traces through different services that are a part of our architecture.

Trying new mediums and using data to understand ROI

The challenge with marketing campaigns is—specifically, when you start with a brand new channel, or when you do TV or radio— in a few seconds, you have an influx of traffic. That puts pressure on your stack. From a funnel tracking perspective, we use tools like Google Analytics to track industry conversion from page to page. Everything we do in terms of experimentation on-site and ad testing goes through that kind of tracking tool. 

On the pure technology side, a key metric we monitor is our search function. Typically every person who lands on our site will do at least one search. It's crucial for conversion and a successful basket. Checkouts are also important. We go from one checkout per second to 30 checkouts per second on busy days. Even if our basket value is small, multiply that by 20 every second, 1200 every minute. It all adds up. We need to see granularly enough to know that every single component does its job and performs the best that it can. These data insights were critical to us growing to scale—and every year we had record growth. We monitor all of these metrics on New Relic.

Automating alerts with data

I'm a very big proponent of automation. As engineers, we have to be as lazy as possible when it comes to repetitive tasks. If you do something more than twice, you have to look at automation. We’ve automated alerts and text push notifications to let us know if there’s something wrong with our baselines, like response time. A first contact engineer will get that, depending on the complexity of the incident, and then it will move from frontend to backend to principal engineers, if necessary. Once alerted, we’ll focus on the incident and drill down into the response time to see what service is affected. Logs in context allow us to see the exact log line where the issue lies, but also correlate it with other relevant data points. There is no more jumping into a different screen or tool to look at the logs for an issue.

As soon as we’re green for one metric, we put an alert on it. We use alerts to monitor everything from database clusters to web servers. We track all baselines, including Core Web Vitals, with New Relic dashboards. We use them as our single source of truth to visualize trends and keep our finger on the pulse. We give our team an actual view of how their systems work. It’s the best thing we could do. 

We've reached a point now where we know what to scale. With microservices, we start scaling our search capabilities or older capabilities at different points and levels. If we lose a minute of uptime on the website, that's huge. We want to limit that risk as much as possible. 

Improving database response times

We had issues with our databases not returning queries in the right response time. We focused on the query and New Relic helped our engineers optimize it against APM with Java. 

Now, we can dig into what’s happening and see the endpoint with distributed tracing. If we have an issue in our stack, it’s really useful to have everything in one system and see those end-to-end service dependencies and then see if that can affect our customers. We have our logs flowing back with ingenuity, as a root cause analysis tool, so I can literally look at everything. We sometimes end up finding that the issue is somewhere else, or that we need to do something differently. Other times, we see that maybe where we log errors isn’t optimal. 

We’re also using New Relic to do technical testing. We have a set of key endpoints on our critical path. If one of them becomes unavailable, we know something bad is happening. We are running two or three different synthetic scripts that look at the end user, and how they would interact with the site and our APIs. It’s important that we see these technical things from a user standpoint, we can even see security vulnerabilities within the context of our entire stack, right next to system performance, in one location. New Relic helps us focus on the right metrics to deliver on customer experience.