Pour le moment, cette page n'est disponible qu'en anglais.

I have been working in the Observability and Customer Experience areas for over 25 years now. In that time I have helped customers (and in some cases the companies that employed me) to improve the performance of their applications and provide better experiences for their customers, providing a significant ROI on their spend. In that time it became very obvious to me that there are some key points about observability that apply to everyone, so I am writing about them in these blog posts to share them with you in the hope that many more of you can benefit.

Monitor everything

The first key lesson is the importance of monitoring everything that is involved in the application or service, and not presuming that something doesn’t need to be monitored because it’s not considered an important or significant component. Let me provide a real world example from a number of years ago…

A true story

On a trip to South Africa to provide enablement on database monitoring and performance tuning for a customer in the financial services sector, I was asked if I could first help resolve a performance issue that had been occurring intermittently. The request came in just as I was being picked up from the airport by a colleague after an overnight flight without much sleep, and the problem had just started happening again so it was urgent.

The application was a portal where customers could log in and retrieve copies of documents related to their policies, and sometimes the portal was taking a very long time to retrieve the documents. At the customer office I met with the Database Administrator who explained that the application team were pointing the finger at the database as the root cause, saying that it was not sending back the documents being requested by the customers fast enough. When he was viewing the data about the database activity he couldn’t see anything that indicated an issue, and he thought he was missing something in the data.

I had a look at the data and agreed with the database administrator that all requests were being executed very quickly, including those specific to the document requests. From the data that we had, the time appeared to be spent within the application.

One of the members of the app team joined us so we could go through the data for the app and database together, and I could explain what I was seeing. Discussing how all of the monitored components were working together to retrieve and return the document I eventually found out that the database requests were returning the file path for the specified document and not the document itself, and that the app then used this to retrieve the file, which was located on a file server that was not being monitored.

I suggested that someone have a look at the file server and see what was happening on it, because it was likely to be the problem. They had a look, and found that the installed anti-virus software was scanning all the files on the disk, using a lot of CPU while doing so in addition to the high volume of disk activity. If they had been monitoring the file server with a host agent it would have been evident very quickly that when the anti-virus software was causing increased disk and CPU activity that this correlated with the poor response times for the document requests, but nobody had thought that a simple file server was worth monitoring, believing that it wouldn’t impact the application.

The goal of observability

The number one goal of observability is to give you insight into what is happening within the code of your applications, and within the various services and systems that are involved in running them. If an error or a performance degradation occurs, you want to know that you have the data that can answer the questions that will arise, and improve the performance or resolve the problem ASAP. If you are not monitoring everything, you are leaving a blind spot that could be the cause of your next problem.