Rapid incident management is critical to the smooth running of Gett’s international operations. Understanding and fixing performance problems affecting drivers and riders is complicated by a dynamic microservices cloud architecture. New Relic helps this talented team meet the challenge and deliver a superior digital experience.
Meeting 99% SLAs
One of Gett’s major challenges is ensuring its technology is reliable and available to its drivers and riders at all times, at rates of more than 99%, especially when the business experiences unexpected spikes in traffic. In these scenarios, it is crucial that the research and development team—which includes tech support and incident management—works closely with customer care on how technology is being developed, deployed, and monitored, and the impact that has on drivers and riders. Getting a comprehensive and swift understanding of how everyone is experiencing that service in real-time is paramount.
Five years ago, Gett didn’t have a proper tech support team, nor a precise incident management process with the right monitoring tools in place. Dani Konstantinovski, global tech support manager at Gett, recalls: "When there was a problem, the first we used to hear of it was from the field. Drivers used to call our customer care team who then called us. It simply wasn't the best way to deal with putting out the fires." Lior Avni, global incident manager adds, "In my book, that’s a failure." Lior works closely with Dani whenever there’s a critical incident that needs to be escalated.
Gett has since invested significant time and resources to deliver a superb customer experience. "We had many challenges before, which we dealt with over the years: organization, mapping of services, missing alerts, things like that. One by one we took care of everything," explains Lior. "So right now, the only two challenges are shortening the mean time to understand, and mean time to detect."
The size of the production environment presents real challenges. "We are working with a microservices architecture with close to 200 microservices in our system. When something goes down, there’s usually a butterfly effect and chain reaction, and we need to find the source quickly to put out what we at Gett term as 'fires,'” says Dani. "The breadth, length and width of our production system keeps on growing," adds Lior. "The challenge is to monitor so many services and machines and get the work done in an organized way."
Identifying issues across microservicess
As a major Amazon Web Services (AWS) user, Gett was using multiple monitoring tools. Those tools were falling short of what was needed. Having full, real-time observability over these microservices was what drove Gett to choose and then expand their use of New Relic to improve incident management and how they delivered strong digital customer experiences.
"New Relic makes our lives much, much easier. We can precisely identify the problem by jumping into New Relic to understand exactly what service is affected, what’s the reason, and what we need to do. With microservices, when one service is going down, you need to understand exactly what and where it is impacting. New Relic gives us this observability—and without it, this job would be very, very hard to do," says Dani.
Using New Relic to consolidate monitoring tools streamlines how the team understands issues as well as saves on costs. "We were using logging tools like ELK that we were finding complex to use. So, adding logs into New Relic was a wow moment, because for the first time, we have everything in the same system, making it so much easier to understand and to identify problems," says Dani.
"Not needing to switch between tools saves us valuable minutes when we’re managing an incident, which helps us reduce our mean time to understand," adds Lior. "New Relic is now my No. 1 tool for incident management, both in how it helps manage every service and creates the notification channels to be directed to the actual engineer who owns the service. Fine-tuning alerts minimize mean time to detect from five to under two minutes, due to those specific alerts from New Relic."
Making teams the owner of their own services
Managing a service used by millions daily has to take into account sudden changes in customer demand driven by unexpected events. The team must be extremely responsive to how technology is performing under pressure. "While we can prepare for major events like we did for the World Cup in Russia, there’s a second kind of spike like an extreme storm or one of our competitors having technical problems that lead more customers to unexpectedly choose to use our app. New Relic helps us to see exactly where these huge increases are building up and where we must add machines and capacity," explains Dani.
The speed and precision of observability change the dynamics of how these incidents are managed, allowing Gett to be more proactive in its response. "Clearer, earlier visibility of problems means that when drivers call our customer care team, they are already prepared for this call and can confidently tell them we are dealing with it. For me, a huge advantage of using New Relic is how it helps us manage the resolution of end-to-end incidents and problems together," explains Dani.
Having a comprehensive single source of truth with New Relic enables the incident management and development teams to collaborate much more closely on their prime objective of delivering an excellent digital experience. As Lena Katz, head of R&D, says: "At Gett, the developers in R&D are the owners of their services. We want them to be happy developers and don’t want to have pagers going off when it's not necessary. Because with New Relic we can see so clearly across all of our 200 or more microservices, we know exactly what’s going on and which microservice may have started the fire. So, we are able to alert the right development team."
This collaboration is important, because the tech support team are not technology specialists, which makes how New Relic directs the team with clear guidance extremely important. "New Relic helps us to identify the problem exactly. Not only which microservice has a problem, but also which specific error caused this service to have this issue. So, when we contact the developers, we send a link to the specific error so they can fix the problem much faster," says Dani.
"I always tell my R&D engineers, new features are nice, but if your legacy doesn't work, who cares about new features?'’ Lior says. "Customer experience needs to be first grade. This is the only thing that matters. It's not the 'everything,' it's the 'only thing.' New Relic dashboards help tremendously. When I show them, it is an 'aha' moment for them, because they have a complete picture and, with a click of a button, can see the most troublesome transaction to fix."
Reducing MTTR by 50%
"We are committed to serve our customers with the best SLA of 99%, so we need to be able to detect our issues as soon as possible and to resolve them on the spot. That's why we have invested in New Relic to observe our applications because it gives us the ability to understand what's going on at all times with our services," says Lena.
Gett now has a strong, collaborative team with well-thought-out processes on incident management. New Relic is crucial to how everyone works collaboratively. The MTTR has been reduced by 50%.
"You open up New Relic, and instantly you can understand where you have a problem, what's the business impact based on the microservices that have some issues. You can understand what is going to be an impact on our customer care and our clients. That helps us all work together on managing the incident smoothly," says Dani.