Goodbye to False Silences: Automating Reliable NRQL Alerts at Scale

Achieving operational excellence is about more than just having alerts; it's about trusting them. A common problem as organisations scale their observability environments is how to manage alert conditions and ensure they remain reliable even when data stops reporting.

Many teams find themselves managing hundreds of NRQL alert conditions that lack two essential reliability settings: Signal Loss and Gap Filling. The absence of these settings creates a dangerous blind spot known as a "false silence".

While the User Interface is excellent for granular tuning and investigation, making changes across hundreds of conditions demands a more streamlined, programmatic approach. To ensure consistency at scale, we can leverage automation to identify, collect, and modify existing alerts en mass. The solution involves using NerdGraph, the New Relic GraphQL API.

The "Why": Understanding Signal Loss and Gap Filling

Before diving into the automation, it is crucial to understand why these two settings transform a "noisy" or "blind" alert into a reliable signal.

1. The Danger of "False Silences" (Signal Loss)

Standard NRQL alerts evaluate the data coming in. If a service crashes completely, the host dies, or the network is severed, the data stream stops. Without Signal Loss configured, the alert condition perceives the silence as normal because no thresholds are actively being breached.

This creates a "False Silence", a dangerous state where your dashboard stays green while your customers are experiencing a total blackout. This poses a massive business risk: the mean time to detect (MTTD) increases indefinitely because the system is unaware that an issue even exists. By configuring Signal Loss, you are notified immediately when telemetry data stops, ensuring that a dead service is treated with the same urgency as a failing one.

Without Signal Loss: The service dies -> Telemetry data stops -> Alert stays Green (False Silence).
With Signal Loss: The service dies -> Telemetry data stops -> New Relic triggers a Signal Loss violation -> You are notified immediately.

2. The Instability of Sparse Data (Gap Filling)

While Signal Loss addresses total outages, Gap Filling addresses the "noise" and instability found in modern, distributed environments. Data streams aren’t always perfect, continuous lines. Sometimes data arrives with slight delays or gaps.

Without a gap-filling strategy, these minor reporting hiccups can cause alerts to rapidly trigger and clear because the engine is evaluating null values instead of a continuous trend. This leads to alert fatigue, where engineers begin to ignore notifications because the system feels twitchy or unreliable. Gap Filling ensures the alert engine substitutes missing data points with a static value (like "0") or the last known value, to ensure the alert logic remains stable. This prevents the business from wasting expensive engineering hours chasing "ghost" incidents, allowing teams to focus on genuine performance issues.

The Solution: A 2-Step Automation Workflow

If you are not currently using these capabilities in your existing alerts, the following scripts can quickly remedy that. To adopt these capabilities for your current setup at scale, we can deploy a two-phase workflow using Bash scripts to interact with NerdGraph. This process allows edit hundreds of alerts without the need to go through each one manually.

Phase 1: Collecting Condition IDs

The first step is collection. We need to find the alerts that require updates without manually clicking through the UI. We utilise the nrqlConditionsSearch query within NerdGraph.

This script searches by a name pattern (using the nameLike filter), to collect as many alerts as needed, and these can be exported to a simple CSV file containing the ID and name of every matching condition.

#!/bin/bash

#
# SCRIPT 1: COLLECT CONDITIONS
# Finds ALL NRQL alert conditions using a LIKE pattern match on the name.
#
# Remember to include SQL-style wildcards (%) in your search pattern.
#
# Prerequisites: curl, jq
#
# --- USAGE EXAMPLE ---
#   ./collect_conditions.sh --condition-name "%"
#   ./collect_conditions.sh --condition-name "%prod%"
#

# --- Configuration ---
API_KEY="YOUR_NEW_RELIC_API_KEY" # IMPORTANT: This is a User Key, not a License key.
ACCOUNT_ID="YOUR_ACCOUNT_ID"
OUTPUT_FILE="conditions.csv"

# --- Pre-run Checks ---
if ! command -v jq &> /dev/null; then echo "Error: 'jq' not found. Please install it." && exit 1; fi
if [ "$API_KEY" == "YOUR_NEW_RELIC_API_KEY" ]; then echo "Error: Please edit this script and set your API_KEY." && exit 1; fi
if [ $# -ne 2 ] || [ "$1" != "--condition-name" ]; then
    echo "Usage: $0 --condition-name <name_pattern>" && exit 1
fi

# --- Get Search Pattern from Arguments ---
SEARCH_PATTERN="$2"

# --- GraphQL Query Definition ---
read -r -d '' GQL_QUERY <<EOF
query(\$accountId: Int!, \$conditionNameLike: String!, \$cursor: String) {
  actor {
    account(id: \$accountId) {
      alerts {
        nrqlConditionsSearch(searchCriteria: {nameLike: \$conditionNameLike}, cursor: \$cursor) {
          nextCursor
          nrqlConditions {
            id
            name
          }
        }
      }
    }
  }
}
EOF

# --- Initialize for Loop ---
CURSOR="" # Start with an empty cursor for the first page
echo "Searching for conditions where name is LIKE: \"${SEARCH_PATTERN}\"..."
echo "id,name" > "$OUTPUT_FILE" # Create file with header

# --- Pagination Loop ---
# This loop will continue as long as the API provides a 'nextCursor'.
while true; do
    if [ -z "$CURSOR" ]; then
        echo "Fetching first page..."
    else
        echo "Fetching next page..."
    fi

    # Conditionally set cursor to null in JSON if the shell variable is empty
    JSON_PAYLOAD=$(jq -n \
      --arg q "$GQL_QUERY" \
      --arg id "$ACCOUNT_ID" \
      --arg val "$SEARCH_PATTERN" \
      --arg cursor "$CURSOR" \
      '{query: $q, variables: {accountId: $id | tonumber, conditionNameLike: $val, cursor: ($cursor | if . == "" then null else . end)}}')

    # Execute API Call
    RESPONSE=$(curl -s -X POST https://api.newrelic.com/graphql \
         -H "Content-Type: application/json" \
         -H "API-Key: ${API_KEY}" \
         -d "${JSON_PAYLOAD}")

    # Check for API errors
    if echo "$RESPONSE" | jq -e '.errors' > /dev/null; then
      echo "Error: The API returned an error:"
      echo "$RESPONSE" | jq .
      exit 1
    fi

    # Append the results from the current page to the CSV file
    echo "$RESPONSE" | jq -r '.data.actor.account.alerts.nrqlConditionsSearch.nrqlConditions[] | [.id, .name] | @csv' >> "$OUTPUT_FILE"

    # Get the cursor for the NEXT page
    CURSOR=$(echo "$RESPONSE" | jq -r '.data.actor.account.alerts.nrqlConditionsSearch.nextCursor')

    # If the cursor is null or empty, there are no more pages, so we exit the loop
    if [ -z "$CURSOR" ] || [ "$CURSOR" == "null" ]; then
        break
    fi
done

# --- Final Summary ---
COUNT=$(($(wc -l < "$OUTPUT_FILE") - 1))
echo "✅ Done. Found a total of ${COUNT} condition(s). See '${OUTPUT_FILE}' for the complete list."

Phase 2: Updating all collected conditions

Once we have our list of IDs, the second script iterates through the CSV to apply the updated configurations. We use the alertsNrqlConditionStaticUpdate mutation to enforce the new standard.

We apply the following logic within the condition block:

Signal Loss: We set expirationDuration to 90 seconds and set openViolationOnExpiration to true. This guarantees that if the data stream is silent for 90 seconds, an incident opens immediately.

Gap Filling: We set fillOption to STATIC with a fillValue of 0 (or a value appropriate for your metric). This bridges data gaps to keep the evaluation logic intact.

#!/bin/bash

#
# SCRIPT 2: UPDATE CONDITIONS
# Reads a CSV, applies updates, and provides an accurate final summary.
#
# Prerequisites: curl, jq
#
# USAGE:
#   ./update_conditions.sh conditions.csv
#

# --- Configuration ---
API_KEY="YOUR_NEW_RELIC_API_KEY" # IMPORTANT: This is a User KEY, not a License key.
ACCOUNT_ID="YOUR_ACCOUNT_ID"

# --- Parameters --- 
EXPIRATION_DURATION=90 # Duration (in seconds) to wait after data stops arriving before declaring a signal loss.
FILL_VALUE=0 # The static value to use when filling gaps in data streams.

# --- Pre-run Checks ---
if ! command -v jq &> /dev/null; then echo "Error: 'jq' not found. Please install it." && exit 1; fi
if [ "$API_KEY" == "YOUR_NEW_RELIC_API_KEY" ]; then echo "Error: Please edit this script and set your API_KEY." && exit 1; fi

INPUT_FILE="$1"
if [ -z "$INPUT_FILE" ]; then echo "Usage: $0 <path_to_csv_file>" && exit 1; fi
if [ ! -f "$INPUT_FILE" ]; then echo "Error: File not found at '$INPUT_FILE'" && exit 1; fi

# --- GraphQL Mutation Definition ---
read -r -d '' GQL_MUTATION <<EOF
mutation(\$accountId: Int!, \$conditionId: ID!) {
  alertsNrqlConditionStaticUpdate(
    accountId: \$accountId,
    id: \$conditionId,
    condition: {
      expiration: {
        expirationDuration: ${EXPIRATION_DURATION}, 
        openViolationOnExpiration: true
      },
      signal: {
        fillOption: STATIC,
        fillValue: ${FILL_VALUE}
      }
    }
  ) {
    id
    name
  }
}
EOF

# --- Initialize Counters and Log File ---
SUCCESS_COUNT=0
FAILURE_COUNT=0
FAILED_LOG_FILE="failed_updates.log"
TOTAL_LINES=$(($(wc -l < "$INPUT_FILE") - 1))
CURRENT_LINE=0

# Clear log file from any previous runs
> "$FAILED_LOG_FILE"
echo "Starting bulk update... Any failures will be logged to ${FAILED_LOG_FILE}"
echo "Applying updates: Expiration=${EXPIRATION_DURATION}s, Fill Value=${SIGNAL_FILL_VALUE}"
echo ""

# --- Main Loop ---
while IFS=, read -r id name; do
    ((CURRENT_LINE++))
    
    # Strip quotes from both id and name read from the CSV
    id=$(echo "$id" | tr -d '"')
    clean_name=$(echo "$name" | tr -d '"')

    echo "Updating (${CURRENT_LINE}/${TOTAL_LINES}): \"${clean_name}\" (ID: ${id})"

    # Construct the JSON payload with variables for this specific condition ID.
    JSON_PAYLOAD=$(jq -n \
      --arg q "$GQL_MUTATION" \
      --arg accountId "$ACCOUNT_ID" \
      --arg conditionId "$id" \
      '{query: $q, variables: {accountId: $accountId | tonumber, conditionId: $conditionId | tonumber}}')

    RESPONSE=$(curl -s -X POST https://api.newrelic.com/graphql \
         -H "Content-Type: application/json" \
         -H "API-Key: ${API_KEY}" \
         -d "${JSON_PAYLOAD}")

    # Check for errors and increment counters
    if echo "$RESPONSE" | jq -e '.errors' > /dev/null; then
      echo "  -> ❌ Error updating condition ${id}. See log for details."
      echo "Failed to update ID: ${id}, Name: \"${clean_name}\"" >> "$FAILED_LOG_FILE"
      echo "API Response: $(echo "$RESPONSE" | jq .errors)" >> "$FAILED_LOG_FILE"
      echo "--------------------------------------------------" >> "$FAILED_LOG_FILE"
      ((FAILURE_COUNT++))
    else
      echo "  -> ✅ Successfully updated."
      ((SUCCESS_COUNT++))
    fi

done < <(tail -n +2 "$INPUT_FILE")


# --- Final Summary Block ---
echo ""
echo "--------------------"
echo "Bulk Update Summary"
echo "--------------------"
echo "Total conditions to process: ${TOTAL_LINES}"
echo "✅ Successful updates: ${SUCCESS_COUNT}"
echo "❌ Failed updates:     ${FAILURE_COUNT}"
echo ""

if [ "$FAILURE_COUNT" -gt 0 ]; then
    echo "Details for all failed updates have been saved to: ${FAILED_LOG_FILE}"
fi

echo "Process complete."

Conclusion: From Manual Effort to Scalable Reliability

This approach solves the urgent problem of standardising configurations across hundreds of alerts, but it also demonstrates the power of NerdGraph to manage configuration programmatically. By implementing this automation, you can transform a manual, time-consuming effort into a simple, repeatable process.

This provides clear business benefits: engineering teams gain efficiency by eliminating alert fatigue and focusing on feature development, while executive teams see a reduction in incident MTTD. This secures customer experience and reduces financial risk, resulting in a robust alerting system without “false silences”.

Next steps

You can adapt these scripts to your own environment. Modify the nameLike queries to fit your naming conventions, and adjust the expiration timers and fill conditions to match your specific requirements.

Ready to audit your own alerts? Don't let a "false silence" be the way you find out your system is down. Explore NerdGraph Documentation.

Happy automating!

By Javier Ortiz, Technical Success Manager

Javier Ortiz is a Technical Success Manager at New Relic, dedicated to helping customers unlock the full potential of their observability stack. Specialized in Alerts, Dashboards, and NRQL (and passionate about his homelab server too) Javier bridges the gap between complex data and actionable business insights.

The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.

780+ integrations to start monitoring your stack for free.

See All Integrations