Hyperscale Log Management

Prev Next

The New Relic global engineering organization responsible for log management extensively uses their own products to deliver exceptional service to internal and external customers. When it comes to log management, the New Relic engineering organization achieves significant scale, handling tens of petabytes of logs, along with billions of log-focused queries, monthly. This organization operates in a true continuous deployment mode, deploying frequently—often dozens of times per day—using New Relic for observability. Because of these frequent changes, reliable validation of deployments is paramount, and New Relic provides the necessary insight.

By using New Relic on New Relic and its logging product, we’re achieving significant results:

Quality of Service: The primary value is delivering a certain quality of service to customers by helping to ensure the logging product works effectively internally.
Proactive Issue Identification: Using New Relic daily allows teams to proactively identify and address issues before they reach production, minimizing customer impact.
Safer Releases: The ability to identify and address issues early enables safer releases of new features and updates.
Faster Incident Response: As part of the New Relic Emergency Response Force (NERF), the teams rely on New Relic logging product for effective incident response. PagerDuty alerts linked to New Relic alerts provide charts of key metrics and runbook links for quick diagnostic steps and resolution eliminating context switching.

While many metrics are observed, here are some of the most useful to maintain high reliability:

Service Level Indicators (SLIs): Top-level SLIs are regularly reviewed for key experiences, such as endpoint latency for log ingestion and compliance across various integrations (for example, AWS Kinesis Firehose, TCP, syslog).
Service Level Objectives: There is a high target of availability of the New Relic platform to its customers. This metric reflects New Relics commitment to data integrity and reliability.
JavaScript Errors: Monitored by environment, browser, user, and product component to track user experience and identify potential issues.
Data Lag: Monitoring lag increase and decrease is crucial for incident response in particular because New Relic customers depend on high availability of the platform.

The log management engineering organization uses many New Relic functionalities including:

Service Levels, APM, Infrastructure Observability, and Logs: These core platform capabilities and insights are used to ensure that key services operate within designated error budgets, and to proactively troubleshoot and resolve issues.
Proactive Alerting: On-call engineers rely on alerting as a crucial component of their incident response, in particular when getting paged for potentially high-severity issues. These alerts link directly to New Relic alerts that provide charts for immediate diagnosis. This integrated alerting process, coupled with established runbooks, significantly cuts down on their response time and helps them proactively identify and address issues.
Comprehensive Integrations: Integration with major cloud providers' services, open source tooling, coupled with New Relic agents allows for ingestion of data and correlation of logs across tooling powering comprehensive observability.

Kubernetes at Scale Empowering Product Managers

Hyperscale Log Management

Powering Quality of Service at Petabyte Scale for New Relic's Engineering Organization

지능형 옵저버빌리티 플랫폼

지능형 옵저버빌리티 플랫폼

카테고리

주요

애플리케이션 성능 모니터링

디지털 경험 모니터링

AI 및 지능형 자동화

인프라 모니터링

로그 관리

플랫폼 기능

솔루션

솔루션

사용 사례

기술

업계

요금

소규모 팀

규모가 있는 팀

비스니스에 핵심적인 조직

요금

소규모 팀

규모가 있는 팀

비스니스에 핵심적인 조직

고객

고객

주요

업계

리소스

리소스

시작하기

가이드

이벤트 및 온디맨드