At D24, we are a worldwide payment service provider, one of the biggest in Latin America. Through our different providers, we have integrations with hundreds of local banks, wallets, card acquirers, and more via one single application programming interface (API), without any operational burdens. We process millions of transactions every day from users seeking to (i) deposit or withdraw funds to or from merchant's websites or (ii) buy our merchant’s products or services online. This is why our services must be available 24/7/365. This requires a holistic understanding of what's happening under the hood at all times. That's why we partnered with New Relic, so we can better understand our tech stack, discover emerging trends, and address issues before they affect our customers. We are proud of keeping an SLA above 99.99% thanks to the ability of New Relic to combine operational metrics with infrastructure and APM. We know not only how our systems are performing, but also how they are being used and what's not behaving as it should.
Receiving an HTTP 200 in under 300 milliseconds after creating a cash-in request doesn't necessarily mean that everything is working well. Seeing end-users being able to pay in a frictionless way does. To do that, we need to be able to log and track every payment attempt on every payment method we offer, dig into any anomalies before they become issues, and implement a solution before customers even notice an impact on our operations. Because of the huge traffic that our systems have, we need to be able to answer a lot of questions in seconds, not hours: Are payments failing because a local bank's website is down, or are our servers overloaded? Are database queries executing as usual or is one of our canary deployments executing queries without proper indexing? Was a payment rejected because of an expected anti-fraud rule or a misconfiguration?
We rely on New Relic's observability features to create a round-the-globe, round-the-clock payments observability system. In particular, we leverage the following key features enabled by New Relic's tooling: dashboards, custom events, canary deployment metrics, integrated alert systems, logs, and infinite tracing.
Custom events help us investigate when things go wrong
Every time that a merchant creates a new transaction—a new cash-in or cash-out request—our microservices send a real-time custom event to New Relic with details of the transaction. A merchant created a cash-in request? We receive an event of type "CREATE". A user paid the cash-in request? We receive another event of type "COMPLETE". Are we receiving a healthy amount of completed events compared to the amount of payments created over the last minutes? Everything is working well! Are we below a predefined threshold? We need to investigate what's wrong. Are users creating more cash-in requests per minute? This most likely means they are unable to pay and are trying multiple times. Do all metrics remain the same besides the conversion rate? This most likely means that there are delays in releasing the transactions on the bank’s end.
Let's take Pix as an example. Pix is a payment method designed by the Central Bank of Brazil (CBB) for bank transfers, which has a completion time of 10 seconds from the moment the user pays. It is free to the end users and all they need to use it is a bank account within a Brazilian financial institution. The CBB requires all local banks to be integrated with Pix and to offer certain service-level agreements (SLAs). That being said, there are dozens of major banks used by the end-users. And, any of these banks may face issues at any point in time. So, how do we know when any of these banks have issues that are outside of our control? Even if we are unable to fix the issue because it's caused by the local institution, we need to be aware of its occurrence so that we can inform our merchants in real-time, for them to be able to take the appropriate actions.
That's why we have created dashboards that enable us to see in real-time the performance of each bank. In case of anomalies, we can further investigate by observing the metrics of the users of that particular financial institution. Metrics like conversion rate, average approval time, and quantity of deposits per user. The analysis of these metrics allows us to understand where the issue is. Is the issue with the bank receiving the money? Would changing the receiving bank (the one that the users are asked to send the money to) help? Or is the issue with the sending bank (the one from where we receive the funds)? This ability that New Relic provides us with is beyond monitoring as just technical observability: it drives greater transparency around business performance overall.
Dashboards for conversion rates
Before we started using New Relic some years ago, we used to rely on the metrics from our cloud provider to understand the status of our infrastructure and our business intelligence platforms to track business KPIs. But with a growing, global customer base and market reach, we understood that we had to transform all the information we had into actual actionable insights.
One of our core areas to monitor is the conversion rate of all of our payment methods. As a PSP, we are integrated with hundreds of third-party providers such as online wallets, local banks, and card acquirers. So when dealing with any of our merchants, if a third-party provider is having issues it is as if we were the ones having the issues.
That's why we built New Relic dashboards that allow us to review each payment method performance separately in tiles. Each tile summarizes, with the power of NRQL, the flow of payments being created and the payments that are being completed in real time, and therefore, the conversion rate. A key metric to our business is the conversion rate of our payment methods, it shows whether a payment method is offering a smooth user experience or not. If a conversion rate goes below our acceptable thresholds, our monitoring team receives alerts that allow us to take immediate actions, like enabling a backup solution while we investigate with the provider what's happening. We also compare, on the same tiles, the real-time results with the results from the same moment of the previous week, enabling us to automatically detect seasonal anomalies.
These dashboards have tables with the names and metrics for each of our merchants. So, if we suspect that the issues are happening only with specific merchants, we can click on the relevant name, and all the tiles from the page are automatically filtered by that merchant's metrics. And if we want to see further metrics for a particular payment method and merchant, we just click on the payment method and it sends us to another page with further metrics. It's astonishing how fast we can go from detecting an anomaly to actually knowing why it's occurring.
As good as having all the metrics that affect our business is, with the number of services, servers, databases, payment methods, and merchants that we have, manually monitoring from a dashboard simply wouldn't scale. That's why each of the metrics that matter to us have their respective alerts. New Relic allows us to set simple alerts capable of sending us notifications for all our merchants, payment methods, and infrastructure metrics without any complications.
All our alerts are sent via Slack. We have a bunch of Slack groups each dedicated to some kind of monitoring. We receive alerts whenever we see a symptom of poor performance: when conversion rates are lower than expected, when infrastructure metrics are approaching certain values, or when merchants stop sending traffic so we can offer them any assistance in case they are having internal issues, for example. We enjoy using our expertise and developed observability practices to make sure that our merchants are also doing as expected.
Canary deployment metrics and infinite tracing
Last year, we worked hard to migrate our microservices to Kubernetes to help us scale quickly and automatically. As our company started growing rapidly, our systems had to be capable of growing at the same pace. So, while doing it, we took a slow and safe approach to make sure that no user's payment process would be affected. To begin with the migration, the new Kubernetes services would receive a very small percentage of the traffic to make sure that all the APM metrics were the same or even better. Then, we progressively increased traffic until we were able to fully shut down the legacy services. We managed to measure and optimize our ability to dynamically scale based on the traffic that our platform receives.
We are now able to handle thousands of transactions per second (TPS) and our infrastructure bill has remained steady over time.
As we continue growing, we continue adapting to the needs of our merchants, which means having the ability to implement changes quickly and without any negative impacts. We needed to be able to make sure that deploying a new version of an application would be risk-free, but naturally, more deployments mean a higher risk of introducing unwanted behaviors.
Now, even after the new version has successfully passed all previous validations regarding code quality and tests, we are able to monitor in real-time any variations of the key metrics of the application by sending New Relic a notification about the deployment of a specific application using the canary deployment methodology.
With the open source tool stack like ArgoCD, ArgoRollout, and GitOps Methodologies, any new deployment starts receiving 1% of traffic or even less for the most critical apps. We then set up queries for metrics, such as the error rate, using NRQL; when our monitoring shows that our infrastructure is robust —with error rate and other queries within expected values—Argo increases (automatically) the traffic to the new deployment and then measures again. That continues until the new deployment manages all of the traffic. As a result of integrating with CI/CD tools like Argo, we gain more control over deployments and are well on the way toward continuous delivery.
These observability and organizational methodologies have led us to see a 90% improvement in our mean time to recovery (MTTR): it takes us less than 5 minutes to identify a problem. Now, when we detect any kind of issues, it is not because the merchants have reached out to us noticing that payments are not going through, it's because of our proactive approach, and we use this to notify our merchants about the impact of the ongoing issue so they can take any actions they believe will help their end users to have the best possible experience. We have built an observability system that scales and has given us an uptime of more than 99.99% without adding friction to our continuous innovation practices.
Explore more customer stories.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.