The past five years have seen rapid growth at Seismic. We’ve expanded into new markets and acquired several companies. As we’ve grown, our platform has grown with us. Entering new markets means meeting new regulatory requirements—like geographic residency and cloud provider residency—for storing personal data in specific cloud environments. New acquisitions, while allowing us to deliver the best product for our customers, have also meant integrating new tech stacks into our operational infrastructure. Through all of this, our customers need to experience a unified, world-class Seismic Enablement CloudTM. As a global leader in enablement, we need consistency in how we monitor all features on our platform.  

Two key things have helped us grow while focusing on customer needs and managing our tech complexity.

Standardizing with Terraform

At Seismic, we don’t have any on-premises servers; we operate our infrastructure in multiple geo-locations across multiple major cloud providers. We take a cloud-agnostic approach, with a few exceptions, where we leverage our cloud providers’ managed services.  

For our developer teams, we want them to have a consistent experience across the different cloud providers—and that means focusing on standards. Our cloud and infrastructure engineers need to be able to work across all the platforms and tooling we use. They need to abstract away complexities and implement frameworks to enforce standards. This minimizes the cloud-specific expertise required for an application development team to build, deploy, and support their services. We can launch a new region in one of our existing cloud providers in under one month. Launching a new cloud provider can be done in six to nine months. This would not be possible without our commitment to engineering standards and practices such as using infrastructure as code.

When Seismic makes a strategic acquisition, our first task is to reduce the duplication of monitoring and logging solutions. Fortunately, moving everything to New Relic has enabled us to have that “single pane of glass” observability across our multicloud environments. This has also been impactful for our finance operations teams, where we have cut our tooling and infrastructure costs over time by reducing the number of vendors we rely on for monitoring. 

After recently completing our final observability tool consolidation project, our core SRE team refocused on creating and codifying observability standards using Terraform modules. We started by focusing on the golden signals. That’s the foundation we build on. Then, we address more specific needs with particular teams. We use naming conventions and standards to have consistency across the different services, regions and clouds. 

For example, when an engineering team creates a new service or needs to deploy to a new cloud region or cloud provider, the modules ensure consistency and eliminate manual configuration. The Terraform module sets up the baseline metrics and alerts—integrated into Slack—and defines some starting thresholds. Then, an SRE team member collaborates with the engineering teams to implement these templates and tune the thresholds to suit the service needs. Once that is done, teams can promote critical alerts to PagerDuty so on-call engineers can quickly respond to issues as they arise. This reduces cognitive load for New Relic users, as they don’t have to deal with differences in the way each service, region or environment is described in the system. When you know how it works in one cloud environment, you know how it is described everywhere else in our infrastructure as well.

In the past, we had a fragmented approach and teams had inconsistent coverage levels, which affected overall visibility into our systems. We had a high cognitive load solution for developer self-service, which primarily relied on documentation of standards and practices. The recent journey is moving us from general documented standards and guidelines to a templated module that’s easier to adopt, ensures consistency across environments, and requires a lower cognitive load to get what you need.

Weekly cross-functional team meetings

In recent months, we have established weekly cross-functional meetings with our various product areas. Representatives from product, development, design, quality, support and SRE discuss the state of new projects and the overall health of our existing production system.

As one of the representatives in these meetings, SRE focuses on sharing incident response data and SLO trends from New Relic. We also look at the frequency on-call engineering is being paged out and assess whether an alert is actionable, if thresholds need to be tuned, or if there is an unaddressed systemic problem. New Relic helps us highlight these anomalies and guides us into the area that requires attention: Was an issue observability-alerted or human-alerted? If an issue came through Seismic customer support, we use New Relic to understand where we could have been more proactive in responding. How long did it take us to get the issue escalated to the resolver? We have been able to set up alerts and thresholds so that when issues are identified, we can more efficiently determine which services or infrastructure is involved.

Our templated New Relic dashboards and alerts are a good foundation. However, each product area may encounter difficulties or gaps with their system’s observability. These concerns are brought to our global SRE team to discuss how to take action. We do our best to standardize and provide solutions that are reusable and extendable across all our product areas.

We’ve made a huge amount of progress over the past few years driving standards in our tooling and technology. Reducing our technology sprawl was a key enabler for Seismic to scale across multiple geo-locations in major cloud providers while minimizing the cognitive load of our engineers. New Relic enables us to support the continued growth of features offered and our customer base using the Seismic Enablement Cloud.