Managing, provisioning, and deploying applications across hybrid infrastructure is typically the responsibility of operations teams in many IT organizations—developers do not spend much time concerned with specific details of hybrid cloud architecture. However, as critical application dependencies (including supporting services and storage) move out of a single data center, service owners need to make sure that the health of their applications is verifiable in a straightforward, simple way. The first step to simplifying? Make it easy to quickly find the service with an issue.
Is it the service’s problem or something else?
Every service owner should plan for services to fail. The idea behind simple HTTP-based health checks is to provide a way for individuals or tools (such as a monitoring solution like New Relic) to answer a simple question: Is the service behaving as expected? In complex infrastructures, however, answering that question is increasingly difficult. Services running in different data centers are affected by network issues, operated by different teams, and have many dependencies. You should make sure that your health checks can distinguish between operating correctly or failing due to an issue in a dependent system like a database. As infrastructure changes and new features are added to services, remember to update health checks as well.
Standardize health checks across services
Writing a health check is more than returning status code 200. The format of the response body should follow the same pattern, URI, and format across different services. The goal is to allow all teams to have a standard, easy-to-understand endpoint that returns critical information about the health of the service. This consistency is especially important when troubleshooting error conditions and alerts. Standard data to consider including in the health check response include the service name, current deployed version, and details of the infrastructure it’s running on (like node name). Lastly, follow HTTP status code standards. As seen in orchestration tools like Kubernetes and Marathon, any health check that returns a response outside of a 200-400 range is considered a failure.
Document all SLAs that impact a single service
As dependencies between services become more complex, so does the concept of a healthy service. That’s why understanding—and documenting—all service-level agreements (SLAs) that impact a service is critical. An SLA is more than an external legal agreement with customers, it’s also an internal agreement between different services that have implications for overall system availability. From an internal service perspective, a slight increase in response time might be perfectly acceptable under heavy load but have serious repercussions for dependent services. Carefully tune the timeout value of health check responses depending on an overall system’s requirements. This value can be discovered and refined in game day testing.
With complex infrastructure, there often are multiple SLAs a single service must provide. It’s critical that these are made explicit, and that everyone understands how a single service (and all of its dependencies) fits into the overall architecture. This information should be easily discoverable and shared between teams.
Remember that health checks are also for team health
In addition to application monitoring, well-designed and standardized service health checks help teams better understand the large systems they are responsible for building and maintaining. As services with many dependencies span different networks and infrastructure, health checks are increasingly required when running any production-level service.