Alerting is important to respond rapidly to potential service interruptions. However, alerting an incident responder to take manual action is still time consuming. That is why New Relic automates as much as possible, focusing on self-healing systems.
Auto-Scaling Reduces Engineering Toil
New Relic has invested significantly in auto-scaling algorithms which can rapidly scale up or down services. These algorithms use metrics such as cpu and memory to perform the scale ups and downs. This has significantly reduced interruptions and team pager notifications. For example, it was not unusual for our Logging Team to be paged 2-4 times per a week to help scale a service. After implementing autoscaling, the team receives significantly fewer pages.
Auto-Rollback For Reliability
While New Relic services go through a series of checks before being deployed to production, sometimes bugs do reach production. In these cases, New Relic employs automatic service rollback. When a change is deployed through New Relic’s continuous deployment pipeline, a workflow is started that listens for the entity’s health. If the service becomes unhealthy, then the workflow will trigger the continuous deployment pipeline to rollback the unhealthy instances.