At New Relic, we understand the value of testing our systems to ensure their efficiency and resiliency against hazardous conditions. Whether it’s through chaos engineering or adversarial gamedays, our DevOps teams employ reliability best practices to learn more about our systems and develop new ways to improve them.

Chaos engineering involves carefully injecting harm into our systems to test the systems’ response to it. This allows us to prepare and practice for outages, and to minimize the effects of downtime before it occurs.

The operative word here is carefully. We’re not trying to break our systems—we’re trying to make them stronger and more resilient. And despite the name, chaos engineering is not actually chaotic. Instead, chaos engineering involves thoughtful, planned, and controlled experiments designed to demonstrate how our systems behave in the face of failure.

There is no shortage of chaos engineering resources, but recently the Unified API team leveraged our implementation of the GraphQL API as an entry point for our internal chaos engineering practices. Sure, Netflix has its Chaos Monkey, but New Relic has a Chaos Panda!

The problem statement

Unlike typical REST APIs that often require loading from multiple endpoints, GraphQL provides a single endpoint that can manage complex queries, so you can get the data your app needs from many services—all in one request. We’ve achieved this with “schema stitching,” which creates a single schema from several underlying GraphQL APIs and enables us to deliver a unified experience across all of our APIs, via our GraphQL server.

To make a request for data with New Relic’s GraphQL API, consumers query for various fields in a data structure; for example, a query on the accounts field returns account information associated with the user who ran the query. The fields are resolved by making requests to various downstream services. In the event that a service is down or is running slowly, GraphQL may send back partial responses—the request won’t fully fail, but the response may be missing data. That’s different than in a REST API where a response usually indicates a simple success or failure.

It’s these partial responses that we were interested in. Particularly, how could we test against latency and field errors, and in turn find opportunities to add more resilient code in our services that would respond proactively to partial request failures?

Enter the Chaos Panda

With our new Chaos Panda testing tool, internal New Relic teams can configure GraphQL in pre-production testing to do things like add latency to its responses or cause certain fields to fail at a specific failure rate. It’s as simple as running a GraphQL mutation (a basic GraphQL query type that can modify and fetch data).

Here’s an example of a chaosStart mutation that kicks off a chaos session for the entitySearch and accounts fields and slows responses down by 5,000 ms:

mutation {

  chaosStart(configuration: {

    fieldErrors: [

      {name: "entitySearch", probability: 1.0},

      {name: "accounts", probability: 0.70}

    ],

    latency: 5000

    }

  )

}

In this case, we’ve configured the entitySearch field with a probability of 1.0, so that field will return errors from GraphQL 100% of the time; we’ve given the accounts field a probably of 0.70, so it will return errors 70% of the time. We’ve also configured latency at 5,000 ms, so the GraphQL query response will be delayed 5,000 ms.

Here’s the response with Chaos Panda running:

{

  "data":  {

    "currentUser": {

      "accounts": null,

      "entitySearch": null

    }

  },

  "errors": [

    {

      "locations": [

        {

          "column": 0,

          "line": 6

        }

      ],

      "message": "Chaos Panda strikes again: \"accounts\" failed to resolve.",

      "path":  [

        "currentUser",

        "accounts"

      ]

    },

    {

      "locations": [

        {

          "column": 0,

          "line": 3

        }

      ],

      "message": "Chaos Panda strikes again: \"entitySearch\" failed to resolve.",

      "path": [

        "currentUser",

        "entitySearch"

      ]

    }

 ],

...

A mutation is scoped to the user who issues the request, so other users will not be affected. Any additional chaosStart mutations will overwrite the currently active chaos configuration for that user. Testing automatically expires after one hour, and order is restored.

To manually disable a test, though, users can run the simple chaosStop mutation:

Mutation {

    chaosStop

}

Here’s how it looks in action:

Early days in our (chaotic) journey

GraphQL dramatically reduces the overhead of working with New Relic data. The current iteration of Chaos Panda very much represents the start of our journey using the GraphQL API for chaos engineering. By building the API first, we’re making our developers' lives as easy as possible, and that applies to all aspects of the development process—from development to efficiency and resiliency testing.