Doing or Waiting?

Determine if your code is taking too long or waiting on something else

Published 2026년 Mar 18일 6분 소요

Applications and services will always be in one of two states: doing something, or waiting for something. This is similar to driving a car on a trip: you will either be moving, or stopped, as you will likely have to wait at traffic lights, or possibly be impacted by a traffic jam.

Ideally we want our applications to be doing things quickly and efficiently, keeping waiting to a minimum (in much the same way we want to drive to our destination without any delays). Whether we are troubleshooting errors, resolving performance issues, or undertaking performance optimisation, we need to determine if the application was doing something, or waiting for something. If our application was waiting, we need to determine what it was waiting for, and if it was waiting for another application or service we then need to investigate what that was doing or waiting for.

What are we ‘Doing’?

When an application is doing something, it is using CPU. This is the processing of the code and data that is within memory, performing functions such as sorting data, working through algorithms for cryptography, and parsing data to prepare the response that has been requested.

This code is only being held back by how efficiently it uses the CPU and the speed of the CPU, i.e. the fewer instructions required to complete the process the faster it can complete, and the faster the CPU can execute each instruction the sooner the process finishes. With driving on our trip we could find an alternative route that would be shorter, or get a faster car.

Why are we ‘Waiting’?

Our applications are not always able to continue processing, and instead of using CPU they need to pause and wait. There are two different types of waits that our applications experience:

Direct waits: These are requests that our code makes of other systems in order to perform the next steps, e.g. reading a file from storage, making a HTTP request to another service for some data, waiting for user input, etc. The CPU is available for our application to use, but it needs data from elsewhere in order to continue. This would be an open road with no traffic, but we would need to fill the car with fuel to be able to continue.
Indirect waits: This is where our application is ready to use the CPU, but the CPU is not available as something else is already using it. This could be another process running on the same host that is using the CPU which leads to our application having to wait for the CPU to become available. It could also be something in the environment of the application, such as garbage collection within the runtime which means that our code needs to wait for that to complete until it can continue. This is the dreaded traffic jam that prevents us from being able to drive at the normal speed for the road.

A representation of the sequence of events within a simple transaction showing time spent Doing and Waiting

In the APM Summary view in New Relic, the instrumentation breaks out the direct waits into segments, such as MySQL, Web External, etc. If the time is shown as the language of the runtime, then this could be either the application doing something, or indirect waits, such as other processes blocking access to the CPU, or garbage collection occurring within the runtime.

The image above shows direct wait time within MySQL being responsible for the majority of the response time of this application, and a very small amount of time within the Java runtime.

You can use the language runtime view to see if CPU usage and garbage collection correlate with the increase in response time.

The JVM view allows us to see the amount of CPU being used by our application, and the amount of that CPU time that is spent in garbage collection instead of executing our code.

In my previous blog post (Everything, Everywhere, All At Once) I gave an example of an issue a company had where users were experiencing increased response times when trying to retrieve their documents from a web portal:

The code in the web browser processed the user input and then issued a request to the backend application and waited on it (Direct wait).
The backend application processed the request from the browser and then made a request to the database which responded very quickly with a file location (Direct wait).
The file location was processed by the backend application which then requested the file from the file server, for which it then had to wait a lot longer than normal (Direct wait).
The file server was having to wait for another process (antivirus) that was using CPU and disk I/O before it could return the requested file (Indirect wait).

Reducing waits

Observability allows you to determine what your applications are waiting on, and if you have enough visibility (you are monitoring everything, right?) it will allow you to determine where the time is being spent. From this you can determine the changes needed to reduce the time spent doing and waiting. In the example above, the performance issue was caused by indirect waits, and the resolution was to modify the configuration of the antivirus software to prevent it from blocking the file server.

A real world example

In the days when most applications were monolithic they were typically running queries against relational databases, with a lot of the waiting seen to be in the database. When looking into why the queries were taking so long, you would find that they were waiting for blocks to be read from the disk into the buffer cache memory of the database server. If too many blocks were required reading one block would remove another from the cache, which would then have to be read back in again if required later.

In these circumstances, some database monitoring tools would recommend adding more memory to the server and increasing the size of the buffer cache. This would allow more blocks to be held in memory and reduce the chance of needing to read them from disk again. Making this change would work, and the queries would complete in less time.

With the removal of the frequent waits to read blocks from disk, the queries no longer needed to pause. This would lead to the queries and the database server using more CPU as they processed the blocks in memory. The tools that recommended the extra memory would now recommend more CPUs being added to the server. In addition to the hardware costs of adding more CPUs (as well as the previous extra memory) this could also lead to increases in software licensing costs.

When I investigated these queries I often found that the SQL used was inefficient. Viewing the explain plans, and the data the query was actually returning as the final result set, it was clear that blocks for tables and indexes that were not needed were being read from disk or processed in the buffer cache unnecessarily.

I would optimize the query to read fewer blocks but still return the same result set. Where the buffer cache had not been increased in size there would be fewer blocks needing to be read from disk and stored in the cache, reducing the wait for those reads. Where the blocks were already in memory, this would then also lead to the query not using as much CPU as there would be fewer blocks to be processed. These changes reduced both the doing and the waiting, improving the performance of the application, and reducing costs for the business.

핵심 요약

Ensure your applications and services are instrumented to be able to understand what they are doing or what they are waiting for. This could be Browser agents in your web pages, APM agents in your frontend and backend services, or integrations that collect performance data for message brokers, databases, etc.
Use Distributed Tracing to understand where time is being spent across multiple services for your transactions.
Monitor the infrastructure of the application to be able to determine if any external processes could be causing your application to wait, using host or container agents.
When your application is waiting for something, you need to be able to determine what it was waiting for, and then be able to follow the requests through the applications and services to determine where the wait time was spent.
Sometimes the waits can be reduced by making improvements to the code making the requests.
If most of the time is spent doing something (using CPU), look into how it could be improved to perform more efficiently (or possibly stop doing that something if it doesn’t need to be done).

Steve Cunnew, Technical Success Manager

Steve Cunnew is a Technical Success Manager at New Relic, with many years of experience in APM, O11y and Customer Experience solutions, and loves solving problems. Outside of IT Steve is a huge fan of Disney and Star Wars.

이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.

780+ 개 통합을 사용해 무료로 스택 모니터링

모든 통합 보기

이 문서의 내용

Doing or Waiting?

Determine if your code is taking too long or waiting on something else

What are we ‘Doing’?

Why are we ‘Waiting’?

Reducing waits

A real world example

핵심 요약

지능형 옵저버빌리티 플랫폼

지능형 옵저버빌리티 플랫폼

카테고리

주요

애플리케이션 성능 모니터링

디지털 경험 모니터링

AI 및 지능형 자동화

인프라 모니터링

로그 관리

플랫폼 기능

솔루션

솔루션

사용 사례

기술

업계

요금

소규모 팀

규모가 있는 팀

비스니스에 핵심적인 조직

요금

소규모 팀

규모가 있는 팀

비스니스에 핵심적인 조직

고객

고객

주요

업계

리소스

리소스

시작하기

가이드

이벤트 및 온디맨드

Doing or Waiting?

Determine if your code is taking too long or waiting on something else

What are we ‘Doing’?

Why are we ‘Waiting’?

Reducing waits

A real world example

핵심 요약

Tags

관련

지능형 옵저버빌리티 플랫폼

지능형 옵저버빌리티 플랫폼

주요

애플리케이션 성능 모니터링

디지털 경험 모니터링

AI 및 지능형 자동화

인프라 모니터링

로그 관리

플랫폼 기능

솔루션

솔루션

요금

소규모 팀

규모가 있는 팀

비스니스에 핵심적인 조직

요금

소규모 팀

규모가 있는 팀

비스니스에 핵심적인 조직

고객

고객

리소스

리소스