Implement auto-remediation using New Relic and Amazon EventBridge

Auto-remediation is the holy grail of observability as it eliminates human intervention and minimizes mean time to resolve (MTTR). This is a practical guide to implementing auto-remediation using New Relic and AWS EventBridge.

In this example, we address an Amazon Elastic Cloud Compute (EC2) instance experiencing high memory utilization and demonstrate how to automatically remediate the issue by restarting the instance. The EC2 instance is monitored using the New Relic infrastructure agent, which collects relevant metrics such as memory utilization, and sends them to New Relic. When the memory utilization exceeds a predefined threshold, an alert condition is triggered within New Relic.The triggered alert sends a notification to EventBridge, which in turn executes a rule to restart the EC2 instance.

Installing the infrastructure agent and setting up New Relic alerts are covered in New Relic’s official documentation, so this guide focuses on the configuration needed for the New Relic to EventBridge integration.

Add EC2 instance ID to New Relic alert notification

In order for EventBridge to know which EC2 instance to restart, we must specify an instance ID. This ID could be hardcoded in the EventBridge rule if you always want to restart the same instance. However, a more flexible approach is to supply the instance ID dynamically from the New Relic alert notification. The instance ID can be added to the alert notification with the steps below.

In your alert condition query, facet by entityKey instead of entityName; as the entityKey contains the EC2 instance ID, this ensures it’s included in the issue payload when an alert is triggered. Example New Relic Query Language (NRQL):

SELECT average(memoryUsedPercent) FROM SystemSample FACET entityKey

Place the alert condition in an existing policy or create a new one.

Create a workflow, a channel, and a destination to get the outbound notification to EventBridge:

Give the workflow a name.
Use the issue filter to associate the workflow with the alert policy above.
Add a channel of type AWS EventBridge. The default name of the channel is New Channel but you can rename it as desired.
A channel needs to be associated with a destination. If you’ve created an EventBridge destination before, you can reuse it. If not, create a new destination within the channel editor screen. The destination name is up to you, but you’ll need to provide a valid AWS account region and account ID.
Specify an event source, which is the name of the event as it would appear in your AWS account. This name is also up to you (for example, MyEventSource). Type the desired name into the Event source search box and an option will appear to Create a new event source….

dialogue box prompting creating a new event source

6. Follow the on-screen instructions to associate the new event source to an event bus in AWS.

7. Specify an event template, which tells New Relic what to include in the notification to EventBridge. A default template is auto-generated for you, but you have to modify it to include the entityKey tag, the attribute that your alert condition’s NRQL facets by. Add the following JSON attribute:

"instanceId": {{ json accumulations.tag.entityKey }}

See line 17 in the image below. This adds a new variable “instanceId” and populates it with entityKey from the issue payload.

instanceID with populated entityKey added to line 17 of this payload.

Create SSM Automation Role in AWS IAM

Go to AWS IAM and create a new role:

Set the Trusted Entity Type to AWS service.
Set the Service or use case to Systems Manager.
Set the Use case to Systems Manager.

In the next page, select AmazonSSMAutomationRole. The role name and description are up to you. Create the role.

Edit the role to create an inline policy for rebooting EC2 instances.

In the policy editor, switch to JSON mode then enter the following JSON:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "ec2:RebootInstances",
            "Resource": "*"
        }
    ]
}

The inline policy name is up to you.

The Amazon resource name (ARN) of this SSM automation role will be needed in the next section.

Create a rule in EventBridge

Create a rule in EventBridge that will intercept the notification from New Relic, parse it, and restart the correct EC2 instance based on it.

In the EventBridge’s Rules UI, select the event bus corresponding to the event source you created earlier (for example, aws.partner/newrelic.com/3214389/MyEventSource), then create a new rule.

The rule name and description are up to you.

Set rule type to Rule with an event pattern so that the rule is fired when a specific alert notification from New Relic is received.

Set event source to Other. You don’t need to specify a sample event.

Set the creation method to Custom pattern (JSON editor). Then specify the following JSON in the event pattern text area, replacing “MyEventSource” with the actual name of your event source.

{
  "source": [{
    "suffix": "MyEventSource"
  }]
}

Specify a rule target. Remember the goal here is to restart an EC2 instance:

Set Target types to AWS service.
Select a target. Choose System Manager Automation. If you click on the drop down, you’ll see that there’s an option called “EC2 RebootInstances API call”. That sounds like what you need but it doesn’t allow you to extract the instance ID dynamically from an inbound event. If you want to hardcode the instance ID, then that option is fine. But if you want to dynamically populate the instance ID from an alert notification, then you need to select System Manager Automation instead.
Set Document to AWS-RestartEC2Instance. Note the many other runbooks that can be selected here; there are alternative auto-remediation steps that can be performed depending on your actual requirements. Regardless of which runbook you choose here, you should refer to the corresponding documentation to determine what input it needs. In the case of AWS-RestartEC2Instance, here’s the documentation. It tells you that you’ll need to specify AutomationAssumeRole and InstanceId. You’ll do this in the next step. If you choose a different runbook, then the input requirements may be different and the next step may need to be adjusted accordingly.

Under Configure automation parameter(s), select Input Transformer so it will parse out the required instance ID from the alert notification:

Set the Input Path to:

{"instanceId": "$.detail.instanceId"}

The reason to do this is because the detail attribute contains the JSON object we specified in the channel event template in New Relic. Recall that in that event template, we added "instanceId": {{ json accumulations.tag.entityKey }}. Therefore, detail.instanceId now contains an array of instance IDs. That’s what we need for the Input Path.

Set the Template to:

{"InstanceId":<instanceId>,"AutomationAssumeRole":["arn:aws:iam::123456789123:role/my-ssm-automation-role"]}

However, replace the dummy AutomationAssumeRole ARN above with the one you created earlier.

Set the execution role. Creating a new role is the simpler option since AWS will take into account your above settings. However, if you prefer to reuse an existing role instead, make sure it’s updated with your event source and SSM automation role.

The rest of the settings are optional. You may want to specify a dead letter queue so any unprocessed events will be sent there. You can also add another target for the rule. A great second target would be CloudWatch so the events get logged there. This is useful for troubleshooting purposes as you will be able to see how the event JSON looks like.

Testing and validation

You will need to trigger an alert notification in New Relic. For testing purposes, you may want to adjust the alert condition threshold so that even a low memory utilization will trigger it. For convenience, there’s also a “Send Test Notification” on the page where you configured the channel event template. To get there, edit the workflow you created, then click on the … next to the AWS EventBridge channel, and then click Edit (as shown below).

Then click the Send test notification button.

In AWS, validate that your EC2 instance has been stopped and then started again. The EC2 instance state will transition from “running” to “stopping” to “stopped” to “pending” to “running”.

You can also go to AWS System Manager’s Automation page to see whether the AWS-RestartEC2Instance runbook has been successfully executed. It will show two steps: stopInstances and startInstances. Note that the status of the startInstances step will remain as “In Progress” until the EC2 instance has completed initialization, that is, its Status Check field has changed from “Initializing” to “2/2 checks passed”.

Execution details for triggered event in AWS

Próximos passos

Get started today with New Relic EventBridge Integration
If you have a free New Relic account, you already have access to New Relic EventBridge Integration. To learn more, contact your New Relic account representative and get started.

Don’t have a New Relic account yet? Sign up for free today. Your free account includes 100 GB/month of data ingest, one full user, and access to New Relic EventBridge Integration.

Por Patrick Rodjito, Senior Solutions Architect

As opiniões expressas neste blog são de responsabilidade do autor e não refletem necessariamente as opiniões da New Relic. Todas as soluções oferecidas pelo autor são específicas do ambiente e não fazem parte das soluções comerciais ou do suporte oferecido pela New Relic. Junte-se a nós exclusivamente no Explorers Hub ( discuss.newrelic.com ) para perguntas e suporte relacionados a esta postagem do blog. Este blog pode conter links para conteúdo de sites de terceiros. Ao fornecer esses links, a New Relic não adota, garante, aprova ou endossa as informações, visualizações ou produtos disponíveis em tais sites.

780+ integrações para começar a monitorar seu stack gratuitamente.

Veja as integrações

In this article