The components and value of a microservices monitoring strategy

Don’t take this wrong but moving to microservices will break your monitoring strategy and increase your mean time to resolution (MTTR). That’s because traditional monitoring alone can’t connect dependencies and performance across the entire customer journey through your distributed architecture.

Microservices are a good thing, right?

Pretty much everyone agrees that a distributed systems approach is ideal for many modern applications. That’s why a large majority of organizations—84% according to a survey from API platform company Kong—have adopted a microservices architecture.

The Kong survey reports that organizations are adopting microservices for security enhancements, quicker development, faster integration of new technologies, improved infrastructure flexibility, and stronger collaboration across teams.

The value of microservices

In its "State of Microservices 2020" report, The Software House reports that organizations cite solving scalability and performance issues as the two most important reasons to choose a microservices architecture for applications. So really, what’s not to like?

Here’s the trade-off

The downside of all the microservices goodness is greater complexity for the software and operations teams that support them.

Once the number of microservices becomes more than a handful, the ability to manage and maintain them becomes increasingly difficult. The Kong survey reported that organizations are running 184 microservices on average. Hence, the hit on your MTTR.

Challenges with microservices monitoring

As monolithic applications become distributed across microservices and software teams face pressure to rapidly and frequently ship new features and experiences, several things make it extremely challenging to understand performance and pinpoint issues and bottlenecks:

Microservices are constantly changing, which introduces more risk
The lifetime of a container in which a microservice is running may be measured in minutes or less
The complexity of scale increases with each new microservice or change that’s introduced
Teams might be responsible for services they didn’t develop

For all these reasons, it’s critical to find a way to cut through the complexity and ease the effort for your software teams.

How to create a microservice monitoring strategy

Monitoring monolithic applications has traditionally focused on collecting data about the health and performance of the application to know when something goes wrong so that operators and engineers can respond quickly. Over the years, monitoring has evolved to deliver detailed metrics and alerting on performance and user experience across the entire technology stack, including the cloud.

Monitoring is great, but modern application architectures require a new approach called observability. Observability lets you understand why something is wrong, compared to monitoring, which simply tells you when something is wrong.

Step 1. Understand the difference between monitoring and observability

Observability is critical because it gives you the ability to see a connected view of all your performance data in one place, in real time. That way, you can pinpoint issues faster, understand what caused the issue, and ultimately deliver excellent customer experiences.

Observability does this by letting you know how your distributed applications work together and how dependencies and individual services impact the health and performance of the entire application.

Step 2: Use telemetry data when monitoring

Observability delivers actionable insight by combining four essential types of observability data: metrics, events, logs, and traces. While all the telemetry data is important, the last data type—distributed traces—is essential for software teams using microservices.

That’s because distributed tracing is the best way to quickly understand what happens to requests as they transit through the microservices that make up your distributed applications. Trace data helps you comprehend the flow of requests and pinpoint where failures or performance issues are occurring and why.

Relying solely on monitoring individual services can’t give you that insight. It may even lead you to believe that everything is fine with performance despite customers reporting latency issues.

Instead, distributed tracing gives everybody involved—DevOps, operations, software, and site reliability engineers—quick answers to specific questions in distributed software environments, including:

What is the health of the services that make up a distributed system?
What is the root cause of errors and defects within a distributed system?
Where are performance bottlenecks that could impact the customer experience?
Which services have problematic or inefficient code that should be prioritized for optimization?

Step 3: Connect the dots to reduce MTTR

When you move to microservices, distributed tracing is the secret weapon that can help you keep your MTTR where it should be—low. Not only does tracing data help you identify and resolve issues faster, but it also lets you and your team measure overall system health and understand the effect of changes and prioritize high-value areas for improvement. You’ll be able to ask the right questions and get the right answers no matter how many microservices you deploy.

Microservices monitoring best practices

Here are some best practices for an effective microservices monitoring strategy:

Define clear objectives:

Define key metrics: Determine what metrics are essential for your application (e.g., response time, error rate, throughput) and set specific goals for each.

Use distributed tracing:

Instrumentation: Implement distributed tracing to track requests as they flow through your microservices. Tools like Jaeger, Zipkin, or OpenTelemetry can help.

Collect comprehensive metrics:

System metrics: Monitor CPU usage, memory, disk I/O, and network activity for each service instance.
Application metrics: Track application-specific metrics, like the number of requests, response times, and error rates.
Database metrics: If your microservices interact with databases, monitor query performance, connection pools, and cache hits/misses.

Set up centralized logging:

Structured logs: Use structured logging formats for better parsing and analysis.
Log aggregation: Centralize logs using tools like New Relic log management for easy searching and analysis.

Implement health checks:

Liveness probes: Use liveness probes to detect and restart unhealthy containers in containerized environments like Kubernetes.
Readiness probes: Implement readiness probes to indicate when a service is ready to accept traffic.

Implement alerts and notifications:

Thresholds: Set up alerts based on predefined thresholds for metrics like response time or error rate.
Alert escalation: Implement alert escalation policies to ensure critical alerts are addressed promptly.

Security monitoring:

Access Logs: Monitor access logs to identify suspicious activities or potential security breaches.
Runtime security: Implement runtime security measures to detect and respond to security threats in real-time.

Want to see how New Relic helps with monitoring? See how we worked with ZenHub.

다음 단계

To learn more about distributed tracing, including how it works and when you should use it, read A Quick Introduction to Distributed Tracing: Gain Visibility and Reduce MTTR in Complex Application Environments.

Want to try out distributed tracing on your own free New Relic account? Sign up today to try all of New Relic’s features with 100/GB of free data ingest.

Want a more fun read? See our take on the pomodoro technique.

Jeremy Castile

Jeremy is a former Principal Product Marketing Manager at New Relic, where he was responsible for go-to-market strategies for application performance monitoring products. He's been in the technology industry for more than 12 years, with roles in engineering, product management, and product marketing. He's passionate about helping customers drive business growth and deliver more perfect software. Jeremy holds a BS in engineering and an MBA from George Fox University and is an AWS-certified data nerd.

이 블로그에 표현된 견해는 저자의 견해이며 반드시 New Relic의 견해를 반영하는 것은 아닙니다. 저자가 제공하는 모든 솔루션은 환경에 따라 다르며 New Relic에서 제공하는 상용 솔루션이나 지원의 일부가 아닙니다. 이 블로그 게시물과 관련된 질문 및 지원이 필요한 경우 Explorers Hub(discuss.newrelic.com)에서만 참여하십시오. 이 블로그에는 타사 사이트의 콘텐츠에 대한 링크가 포함될 수 있습니다. 이러한 링크를 제공함으로써 New Relic은 해당 사이트에서 사용할 수 있는 정보, 보기 또는 제품을 채택, 보증, 승인 또는 보증하지 않습니다.

780+ 개 통합을 사용해 무료로 스택 모니터링

모든 통합 보기