Why New Relic
New Relic provides end-to-end visibility with detailed insight into performance, stability, availability, and customer experience, easily shared across multiple disciplines and teams
- Enables DevOps to continuously monitor health and performance of infrastructure and applications, and identify and resolve issues before they impact customers
- Reduces the time engineers spend looking up server-level statistics, thanks to consolidated, end-to-end observability across AWS containers and microservices
- Proactively alerts DevOps team if resource utilization exceeds thresholds to enable rapid investigation and remediation
- Empowers multiple teams with infrastructure and application data that’s specifically relevant to them
India has a population of nearly 1.4 billion people, yet only 30-40 million individuals hold credit cards. One reason is the large cross-section of the population with low income and poor or non-existent credit histories, which makes it difficult for them to qualify for credit through traditional channels. ZestMoney is on a quest to change this by helping more people get credit to participate in the consumer economy, and ultimately grow the digital footprint in India.
An innovative FinTech, ZestMoney built a platform that integrates mobile technology, digital banking, and artificial intelligence, enabling people to apply for and get a decision on credit within minutes. ZestMoney is catching on quickly across India: to date, about 6 million customers are using their ZestMoney credit to pay for merchandise, travel, education, healthcare, and much more. The target is to grow the number of customers five times within the next year, as well as expand merchant partnerships from 3,000 today to more than 8,000.
The challenge is how to achieve this scale while delivering on performance and customer experience expectations.
Accelerating service delivery with AWS containers
ZestMoney needed an elastic, cost-effective infrastructure to meet its scalability, DevOps, and service delivery goals. That’s why the company chose Amazon Web Services (AWS) as its cloud platform. ZestMoney uses Amazon Elastic Cloud Compute (EC2) for hosting a few core services, and Amazon Elastic Container Service (ECS) for most of its containerized services, including its decision engine.
In addition, ZestMoney uses Amazon Relational Database Service (RDS) for its database, AWS CodeDeploy for all deployments, Amazon CloudWatch to monitor the logs, and AWS CloudTrail to monitor events. The company also relies on Amazon Route 53 domain name server, Amazon CloudFront content delivery, Amazon Simple Queue Service (SQS) messaging, and Amazon Simple Notification Service (SNS).
The company chose to deploy Docker containers on AWS for optimal scalability, stability, and performance for critical operations such as the ZestMoney decision engine, and for returning pre-approval notices and repayment schedules to customers within minutes of making their applications.
Ganesh Muralidhar, director of DevOps at ZestMoney, says, “In the past, it took two or three minutes for a new server to come up with all services running on just EC2. Post-migration to containers, it now takes only a few seconds for services to launch or scale up.”
Amazon ECS also helps ZestMoney improve the availability of services. “If a container goes down, ECS automatically brings up a new one and the service stays healthy,” Ganesh notes. “We were running at around 97.5% to 98% uptime last year. This year we are 99.86%.”
“Any level of slowness in our systems can drive away customers. Our responsibility is to make sure our systems are error-free and performing their best to deliver a quality customer experience. That’s what New Relic enables us to do.”
Gaining end-to-end observability
On AWS ECS, ZestMoney currently runs more than 80 microservices, each with a specific rule for processing customer applications and ultimately determining whether an applicant is eligible for credit. Additional business cases include disbursements and processing of repayments. It’s critical that this end-to-end process is error-free and performing optimally.
“Customers hate a poorly performing experience,” says Ganesh. “They may drop off and not bother completing the application.”
For this reason, ZestMoney relies heavily on New Relic to instrument and monitor the full breadth of its cloud, container, and application environment. According to Ganesh, “New Relic provides extensive information in terms of stack trace and error analysis, which is easy to consume and share with multiple teams. This flexibility and the depth of information New Relic provides were the main reasons we opted for New Relic.”
With so many services supporting its business, ZestMoney takes advantage of New Relic Infrastructure to monitor things like CPU utilization for each service across the entire infrastructure—all through a single interface.
Ganesh points out, “With New Relic, you don’t have to click on every server to check its utilization. This reduces the amount of time our engineers spend looking up CPU levels and other server-level statistics. Adding to this, setting up alerts for infrastructure-level anomalies is also pretty easy. We have autoscaling on AWS but want a heads up if that’s getting triggered so we can look into why utilization is so high.”
Monitoring service performance
ZestMoney relies on New Relic to help meet performance objectives for its applications and operating environment. For instance, one merchant partner required processing performance of at least 50 transactions per second (TPS) at any given time for consumers paying with their ZestMoney account. By using New Relic during initial performance testing, Ganesh and his team identified several hot spots within the ecosystem that could be improved to achieve higher TPS with optimal performance.
“New Relic helped us understand the bottlenecks and identify what needed to be fixed,” Ganesh says. “After resolving the issues, we showed that our services could scale up to 65 TPS for the e-commerce merchant.”
ZestMoney uses New Relic Synthetics for ongoing monitoring of service performance and health to proactively identify issues before they affect customers. “We use New Relic Synthetics across different domains, geographic locations, and endpoints to perform health checks,” Ganesh explains. “If there’s any kind of failure at a specific location, we get an alert even before the customer reports a problem.”
“New Relic helped us understand the bottlenecks and identify what needed to be fixed. After resolving the issues, we showed that our services could scale up to 65 TPS for a leading e-commerce partner.”
Sharing data across the company
Another advantage that Ganesh sees from using New Relic is the ability to build individual dashboards for different teams across the company. This empowers each team with observability into the infrastructure and application data that’s specifically relevant to them. For example:
- The business team can observe the overall health of services to ensure a positive customer experience.
- The company’s DevOps team relies on New Relic to ensure applications in development perform properly and aren’t degraded by inefficiencies such as issuing unnecessary database calls or calls to other external services.
- The data science team gains visibility into the DevOps process so they can determine with the developers where to embed queries in the application that return valuable data for analytics and business intelligence.
Assuring service health and performance to support ongoing business growth
As ZestMoney continues to evolve and refine its infrastructure to drive new innovation and further elevate the customer experience, the company is looking to move from ECS to Amazon Elastic Kubernetes Service (EKS). New Relic can help the ZestMoney team gain visibility into the new EKS environment.
Ganesh says, “We will look to New Relic Kubernetes Cluster Explorer to give us a clear view of how many pods are running, how the cluster is behaving, and the stability of the system. Most important, it will provide detailed stats at the host level so we can see metrics for a specific pod on a particular host. This will immediately tell us if the service levels have any problems, so we can address them.”
Ultimately, Ganesh measures the success of his organization by how well he can deliver system uptime and performance at scale. “I want our systems to be up and running without any anomalies, no services failing,” he says. “During the most recent sale season, our systems handled up to 280 TPS and remained super stable. I want to increase that to 500 TPS with 100% uptime. For me, that is success.”
Ganesh sees New Relic playing a central role in helping him achieve his goals: “I need information on not just the number of services down, not just the error rate or resource-level utilization, but also if just one service is breaking, I want to know the details of why and the impact level in case of any cascading effect. New Relic has already solved this. With the distributed tracing feature, we can look at the complete path of the service, understand where the hotspot is, and why this particular service is taking longer, so we can promptly resolve the issue.”
He concludes, “Any level of slowness in our systems can drive away customers. Our responsibility is to make sure our systems are error-free and performing their best to deliver a quality customer experience. That’s what New Relic enables us to do.”