Founded in 1999, CafePress calls itself “the world’s best gift shop,” but that label doesn’t begin to describe the breadth or scope of what the e-commerce company actually does. Offering on-demand printing from its Louisville, Kentucky, manufacturing facility, CafePress is a pioneer in customization, its website serving as a hub for designers and consumers alike.

Indeed, with artists and writers making their designs and slogans available to be printed on any of the site’s vast array of merchandise, consumers coming to the site to create individualized wares, and major entertainment companies offering up their own products for personalization, has become the go-to place for customized products.

Today that means offering more than one billion items from a global community of more than two million designers through the e-commerce website. To power it all, CafePress now relies on a hybrid cloud infrastructure, primarily employing Amazon Web Services (AWS) to help deliver its IT infrastructure and New Relic to support scalable, reliable application delivery on top of that infrastructure.

Quickly resolving hidden performance issues

Like many businesses today, CafePress is a technology company at heart, depending on a host of home-grown and third-party applications to keep its website, manufacturing and fulfillment facilities, and back office humming.

In the early days, however, getting a true picture of application performance was elusive. And when applications did experience hiccups, they were slow to be detected and even slower to be diagnosed and fixed. Describing that time, CafePress Vice President of Engineering Bryan Downs says, “In those days, when an issue arose, every engineer on the team would offer a theory as to the root cause, but it was all just guesswork. Meanwhile, our manufacturing plants would be experiencing issues that lasted for weeks.”

And even when things seemed to be going well on the e-commerce site, the team didn’t have much to go on besides their own gut feelings. Says Downs, “We’d browse the site without problems and figure that performance must be the same for the other 10,000 concurrent users, which of course wasn’t necessarily true.”

What Downs and team needed were the metrics that would prove their theories correct—or incorrect—and the visibility to spot trouble before real issues arose. That’s why CafePress became one of New Relic’s earliest customers, deploying its SaaS-based application performance monitoring solution, New Relic APM.

Achieving deep visibility

Nearly a decade later, much has changed within CafePress’ IT environment—which is now largely cloud-based through AWS—but one thing that’s remained constant has been the use of New Relic monitoring. CafePress Manager of Business Technology Cody Martinho, whose team manages the network operation center (including site scope and site availability), can’t imagine a time when this wasn’t the case.

“Today, New Relic monitoring is fully integrated within our websites and throughout the majority of our internal applications,” says Martinho. “My team is responsible for configuration and incident response within New Relic so we work with the other development teams to define the applications metrics that will establish the thresholds for alerting.”

The resulting deep visibility—across a hybrid cloud environment that includes .NET Core for new applications and .NET Framework for older applications—enables Martinho and team to not only quickly address any issues that arise but also to identify (and correct) the trends that could lead to performance problems down the road. “New Relic’s traces provide a level of detail that we aren’t able to get with any of our other tools,” he says. “As a result, we’re able to really get in and troubleshoot an issue to completion.”

Downs agrees. “We have a lot of services talking to one another,” he says. “With New Relic we’re able to see the dependencies, both internal and external. So, for instance, if website performance is suffering because our external search service has slowed, we’re able to jump into New Relic and see the reason for this—whether it’s that our databases have slowed down, or that our Elastic search is underperforming, or whatever else is causing the problem.”

"We have a lot of services talking to one another. With New Relic we’re able to see the dependencies, both internal and external. So, for instance, if website performance is suffering because our external search service has slowed, we’re able to jump into New Relic and see the reason for this."

Bryan Downs VP of Engineering, CafePress

What’s more, by taking advantage of custom dashboards created using New Relic Insights, the CafePress IT team has been able to move from reactive to proactive mode; for example, unearthing data that enables them to predict how a release will impact website performance, and then get feedback on the true impact once the release is in production. This capability has been key in ensuring the company’s smooth migration to the cloud.

Moving to the cloud faster with real-time data

With just one remaining datacenter (in Las Vegas), CafePress is nearing completion of a transition from an on-premise environment to one that resides exclusively in the cloud. For a retailer like CafePress, which experiences an enormous spike in business during the months of November and December, taking the data center to the cloud is pretty much a no-brainer from a financial perspective. According to Downs, in closing its datacenter and “going 100% cloud,” CafePress expects to reduce its infrastructure costs by 50% annually in hardware and maintenance with the completion of the migration. What’s not as easy is re-architecting applications that have been optimized to run on highly dedicated hardware on a hyper-fast network—or, as Downs says, “rebuilding our applications so that they don’t have to sit on an eight-core big machine with lots and lots of RAM.” New Relic has been instrumental in that process.

Explains Martinho, “The New Relic platform has been of great use to us as we migrate applications from an on-prem environment into the cloud because it allows us to see exactly what’s going on and where errors might occur. What’s more, the information we derive from it helps us identify things like data that could be cached to speed database performance or how large an application is likely to be once it’s migrated to AWS.”

Another big plus of New Relic monitoring is that when an issue does arise—as happened recently with its imaging system— CafePress can determine whether the problem is originating in its own code or within AWS. If the latter is the case (as was true with the imaging system issue), the team simply opens a support ticket with AWS and waits for that team’s input to proceed with diagnostics and repair.

"The New Relic platform has been of great use to us as we migrate applications from an on-premise environment into the cloud because it allows us to see exactly what’s going on and where errors might occur."

Cody Martinho Manager of Business Technology, CafePress

Transitioning to a DevOps deployment model

Concurrent with its move to the cloud, CafePress is also gradually transforming its IT infrastructure and software engineering teams into DevOps organizations—a process that, like the migration to the cloud, has been greatly aided by New Relic. Explains Martinho, “Whenever somebody goes to GitHub [development platform], we do a pull request for the feature branch, and the development team will check to make sure that all of the unit tests and integration tests work. If so, an automated build will occur and [after confirmation] go live. So we’re definitely doing continuous integration. And though we’ re not at the point of continuous deployments yet, we are up to almost 200 per month, which is huge for us.”

As part of this process, the team uses New Relic to monitor what it calls “canary deployments”—where a new or updated application is pushed out to just one server. “We then watch the performance of that individual server in New Relic APM, comparing its performance with the other servers that we are monitoring,” says Martinho. “If CPU utilization and response times are the same or better, we’ll upgrade that canary to the entire farm, and as that rolls out, we’ll continue to watch New Relic very closely during release.”

Reduced costs, a system-wide view, and adoption beyond IT

Almost a decade into its use of New Relic, CafePress continues to reap the benefits of the solution, chief among them being the cost savings conferred from the outages the company has been able to avoid by detecting and correcting problems early.

Says Martinho, “With downtime of our e-commerce site costing CafePress up to approximately 5.5% of its daily revenue per hour, New Relic has paid for itself many times over—most recently by helping us unearth a problem that was causing our newly launched site to go down for an extended period at the same time each night. By using New Relic service maps to visualize our entire environment, we were able to trace the problem to the load balancer (at the very top of the funnel), which had just stopped sending traffic to everything. Without New Relic, I don’t know how long it would have taken us to get to the root of that problem.”

Other teams across CafePress are looking to capitalize on the deep visibility provided by New Relic’s customizable dashboards. “We’re still in the early stages of doing this, but by wedding New Relic Insights data with the performance metrics gleaned from New Relic APM, we should be able to allow our customer experience, marketing, and customer service teams to better manage their areas as well, allowing them to do things like compare the metrics of what our customers are searching for versus what they’re actually buying,” says Martinho. “And we just completed a very successful trial of New Relic Browser and New Relic Synthetics, which we hope to begin using in the future as well.”