In the dynamic world of online education, Thinkific is a beacon for thousands of creator educators. With innovative and easy-to-use tools and features, Thinkific provides creators with the opportunity to create, market, and sell courses, communities, and other digital products to millions of learners worldwide. Our journey from a small, nimble startup to hundreds of employees has been transformative. While our rapid growth is exciting, it has also presented challenges when it comes to understanding our systems. In response, we aligned our service level objectives (SLOs) directly with our business metrics, paving the way for a revolution in organizational observability and turning potential chaos into a strategic advantage. 

Here's how we did it:

1. Consolidate monitoring tools and strategy

Our first task was tooling. Reducing the number of monitoring tools used can have multiple benefits for any organization. Multiple tools create challenges in painting a global picture of your infrastructure, services, and client experiences. As our development teams grew rapidly to meet our surge in customers, our teams were reviewing logs in one tool, and analyzing service performance in another. Each service had its own approach to observability, looking at different metrics to decide if there were issues to address as a priority. There was no single approach to collating MELT (metrics, events, logs and traces) data for a comprehensive overview of our systems. 

Costs were also increasing. Multiple tools in use by each team were adding to our FinOps costs, and because of the fragmentation, we weren't getting the full value from what we were spending on observability.

Our first goal was to be able to roll out observability across multiple services comprehensively. With New Relic, we could establish a consistent level of quality, optimize customer-facing software, and introduce common instrumentation. We wanted to instill observability as a mindset—rather than ad hoc logging and monitoring, and having one tool in use across all functionalities helped us build that culture.

2. Implement KPIs for all services

With the foundations laid, our focus turned to the heart of our product ecosystem—our services. Our product is made up of multiple services with different functionalities, such as student dashboards, learning communities, branded mobile apps, and a suite of e-commerce tools—including a proprietary payments platform. Reliability, performance, uptime, and latency are critical to us.

Recognizing this, we embarked on defining and implementing key performance indicators (KPIs) that mattered most. Through New Relic's service level objectives (SLO) capabilities, we began crafting a performance blueprint that all services could aspire to, grounding these objectives in measurable, impactful metrics like performance, uptime, and latency.

From here, we enhanced our service level agreements (SLAs) with clients, which we expose publicly through our website. For our own internal use of SLOs, we wanted to set stricter levels where we can be much more aggressive around the level of quality and reliability we want to achieve. Some SLOs have clear revenue impacts, such as Thinkific Payments. We watch the SLOs for payments very carefully: latency and uptime impact on revenue. Our customers can’t make money if they can’t sell. If Thinkific Payments is down, it impacts the flow of business opportunities for our customers. 

Each product development team monitors its SLOs but we also want to extrapolate those SLOs across all our services. This lets us report to decision-makers in a way that doesn’t require them to look at SLOs from many sources. Having these clear SLOs, all measured consistently, and able to be aggregated, allows us to report on them and encourage decisions that weigh up how much to invest in improving reliability versus adding new features.

3. Report baselines and KPIs: Support decision-making through observability insights

At Thinkific, we deem the need for precision in communication and decision-making critical. Observability isn't just about collecting data; it is about translating this wealth of information into actionable insights, particularly for those steering the ship—our senior engineering and product leaders. With common observability tooling and standardized KPIs, we now pull together monthly reports.

 

Our reports provide a snapshot of our reliability overall and highlight any critical problems. When reporting metrics, we adopt a traffic light system—green, yellow, and red—to draw immediate attention to areas of concern and equip even our non-technical leaders with the understanding to make informed decisions. This also helps us make recommendations and share our rationale behind them in our report.

Our reports don't just explain the metric: 'This is below the threshold, therefore you should work on it'. They weave in the ‘why’ behind each figure, connecting the dots between technical performance and its impact on our users and business goals. This approach empowers our leadership to prioritize effectively, balancing the scales between innovation and optimization with a keen eye on customer value. They can better answer whether we should allocate resources in upcoming sprints to resolve scalability, performance, or latency issues, or whether we should continue building new features for our customers.

New Relic helps us collect this data straight from our systems, so we don’t have to collect and collate it ourselves. So when sharing SLO metrics in monthly reports, we can make that link between the business impact of having observability metrics. We are then able to have conversations on prioritizing technical work according to the data. Our leadership team then makes decisions around these recommendations. At quarterly planning meetings, we can show what we did, how the system has improved, and how those improvements have enhanced the customer experience and business metrics overall.

4. Map SLO user journeys

Using our quarterly planning sessions to identify our priority target segments for the quarter ahead, we can now examine the flow of services our customers access and use when they are in our product and how our systems behave. From there, we can start mapping their user journeys through our product and tracking the SLO metrics for each service in their critical path. By doing that, we can focus our engineering time on supporting our priority target groups through their use of our product.

We want all teams, including product managers, to be able to dig into our metrics and understand how things are going from a performance perspective. We see reliability as a key feature in our product, and New Relic helps us connect product managers and technical teams to work on issues together. Our observability culture encourages everyone to jump in and dig into issues and think about how to support the customer user journey the best way we all can.

5. A culture of collective ownership

At the heart of Thinkific's product development organization lies a fundamental shift toward a culture of shared responsibility. It's a culture where every team member, from designers to product managers to engineers, is empowered to contribute to our collective success. Through this shared commitment, we're building a platform and crafting experiences that educate, inspire, and transform lives.

This journey of integrating SLOs into our strategic fabric showcases the transformative power of observability. As we move forward, we remain committed to this path of continuous improvement, driven by data, and united in our mission to empower creator educators and their audiences around the globe.