Platform engineering is about building platforms that help your teams quickly and reliably develop and deploy software. Specifically, it’s about providing and standardizing tools and workflows—the platform—to make the day-to-day of building applications easier. These platform tools include self-service pipelines, infrastructure provisioning, container orchestration, identity management, and monitoring and observability tools like New Relic.
In this post, you’ll learn what platform engineering is, why it’s important for delivering monitoring and observability tools, and steps you can take to implement observability using platform engineering.
What is platform engineering?
Platform engineering is about building a platform that supports your teams.
Platform engineers, unlike site reliability engineers, do not have operational responsibility for production systems. Instead, they provide preconfigured solutions to development and DevOps teams to help build and deploy code quickly, with high consistency, and with adherence to organizational standards. Platform engineers provide solutions for common and mission-critical components such as:
- Kubernetes cluster configuration and deployment
- Database provisioning
- Authentication and access control
- Security and vulnerability management
Platform engineers focus on building what are known as “paved roads.” If you’re driving a car, it’s much easier to reach your destination on a paved road that has signs telling you where you need to go. In software development, a paved road provides teams with a clear, standardized path for building and deploying software. There’s no need to reinvent a process, provision new tools, or wonder if you're deploying your software in a consistent way. Instead, the platform engineering team builds that road for you.
This road often consists of an internal developer platform that enables DevOps teams to browse the catalog of available solutions and integrate them into their CI/CD pipeline. Though some implementations have a UI, access via APIs is often preferred because it can be more easily automated. In short, platform engineering teams aim to reduce cognitive load without taking away autonomy from the teams they support.
Observability is an essential part of platform engineering
The main DevOps goals are to improve release quality, increase release frequency, and improve operational efficiency. To achieve that, DevOps teams need timely feedback on performance, errors, and user engagement. That information has to come from multiple environments and often spans multiple versions of the same product. You need tools like distributed tracing, especially if you’re using microservices. All of that information has to be correlated with releases to help understand the impact of changes and the root causes of errors and outages.
The goal of a platform engineer is to support DevOps teams in doing that work. That means provisioning and configuring observability tools—providing the paved road that allows DevOps teams to get to their destination without any mishaps.
That means that platform engineers need to build observability in every environment from day one. When it comes to provisioning, observability data needs to be easily accessible to every stakeholder. Data from different sources such as logs and events needs to be integrated and correlated to facilitate fast troubleshooting. Without that paved road, your teams will likely end up with a patchwork of solutions and siloed data, or worse, no solutions at all, making it extremely challenging to collaboratively fix problems when they arise.
Above all, platform engineers need to implement solutions that minimize cognitive load and tool sprawl. Where possible, the tools should be well-documented, consistent to deploy, and easy to use and implement across your environments.
Observability criteria for platform engineers
From a platform engineering perspective, an observability solution should be:
- Fast and easy to implement: Additional implementations may be necessary across different teams, and this process should be standardized.
- Easily customizable: Your observability solution should be customizable across teams, environments, products, and services, and should also include options to automate where possible.
- Scalable: The solution should remain resilient as your applications scale. This makes solutions that are on-premise less desirable and is also a consideration for open source and bespoke implementations that may not work as well at scale.
You’ll also want to consider the features that an observability solution offers. Where possible, platform engineers should reduce tool sprawl and prioritize unified solutions that are easier to standardize across teams. The exact features you’ll need depend on your use case, but these are highly recommended:
- Alert notifications: The solution should push issue notifications to the right people and do so intelligently—without creating new toil from alert storms. Data from different sources needs to be correlated and easy to navigate. Ideally, the solution should be able to provide alerts in a wide variety of tools, such as Slack, email, and PagerDuty, so that it can be seamlessly integrated into existing workflows.
- Log management: Ideally, the solution should have centralized log access to enable searches across multiple services and environments. This removes the need for all developers to have direct access to all production environments. Logs are often the first port of call for debugging.
- Application performance monitoring (APM): APM provides real-time data on application metrics. This usually includes golden signals such as throughput, latency, and errors at multiple levels, including the browser, application logic, infrastructure, and network.
- Distributed traces: Traces help you understand end-to-end performance, even when a request passes through many services. This is essential both for pinpointing the cause of errors and finding bottlenecks in your systems.
- Kubernetes monitoring: If you’re orchestrating containers with Kubernetes, you’ll need to monitor the status of clusters and pods within clusters linked to the state of your underlying hosts and the applications running on them.
- Real user and synthetic monitoring: Web and mobile developers need to understand how they can improve user experience. A good observability solution can provide web developers with data on page popularity and Google Web Vitals. Mobile developers will want to understand the causes of crashes and how users interact with mobile apps. Both web and mobile developers need visibility into the many factors that affect performance such as device, OS, network, and geographical location. You can use real user monitoring (RUM) to observe how real users are navigating your application. Meanwhile, with synthetic monitoring, you can use a headless browser to test the performance and reliability of your user-facing services.
- Infrastructure and network monitoring: Your observability solution should also be able to monitor both your underlying infrastructure and network, allowing operations and network engineers to better collaborate and identify issues.
With a clear understanding of requirements, platform engineers can start building an internal developer platform. They'll want to enforce some standards that relate to security as well as compliance with the organization's policies and external regulators. They'll need to enforce some standards around naming and tagging for the benefit of the SRE team. Apart from that, there's really no need to be very prescriptive. DevOps teams will benefit from having pre-configured alerts and dashboards, but will also need the freedom to adapt and tweak them for their needs. Developers might also want to create custom events or embellish their transactions with custom attributes. The scripting of the implementation and the policies and guidance should allow for that kind of customization of the observability solution.
Steps to implementing observability in your platform
Aim for 100% observability
To ensure that your systems have complete observability, platform engineers should ideally incorporate monitoring as soon as new services, applications, and pipelines are created. Here are some examples:
- Consider data security before gathering monitoring data. Could there be sensitive data in logs or database query strings? Should everyone have access to the same data?
- When setting up a database, include host monitoring for essential metrics such as CPU and memory utilization. The database service can be monitored for essential stats such as the status of table spaces, buffers and caches, number of connections and locks, and more. The details will vary depending on the database.
- When setting up a Kubernetes cluster, automatically capture data on the status of nodes, pods, and containers, and logs. If you want to know more about using New Relic to monitor Kubernetes, read What is Kubernetes and how should you monitor it? or watch Bringing developers Kubernetes context in New Relic APM.
- When deploying applications, build APM into the application server. Gather application logs and application server logs. You can learn more about application performance monitoring by reading How to monitor application performance with APM and APM vs. observability.
- Include other services that applications rely on such as RabbitMQ, Kafka, and web servers such as NGINX.
- To better understand how to measure and monitor the impact of observability, see What are SLOs, SLIs, and SLAs?
After you’ve added observability to your CI/CD chain, the next step is to automate the configuration of alerts and dashboards. DevOps team should get started with some default alerts based around golden signals, and then customize them. In addition, some generic dashboards that use tags or naming conventions to gather metrics based on common KPIs will provide a quick start for an application or team. Make these alerts and dashboards deployable via REST API or GraphQL.
Observability is an essential part of platform engineering. You can get started with APM and observability in just a few minutes with a free New Relic account. Your account includes 100 GB/month of free data ingest, one free full-access user, and unlimited free basic users.
本ブログに掲載されている見解は著者に所属するものであり、必ずしも New Relic 株式会社の公式見解であるわけではありません。また、本ブログには、外部サイトにアクセスするリンクが含まれる場合があります。それらリンク先の内容について、New Relic がいかなる保証も提供することはありません。