Icarus

Performance Engineering · 2019 · 1 min read

Built real user performance monitoring processing 180B+ events daily with anomaly detection and alerting

Overview

Real User Monitoring (RUM) solution capturing and analyzing performance metrics from Netflix's 200M+ subscribers, with GUI for analysis, alerting, and anomaly detection

Problem

Understanding real user performance at Netflix scale required processing billions of events daily and identifying anomalies that impact user experience

Constraints

  • Must handle 180B+ events per day
  • Must detect anomalies in near real-time
  • Must correlate performance with user behavior

Approach

Built a distributed event processing pipeline with real-time anomaly detection, integrated with Netflix's observability and alerting infrastructure

Key Decisions

Process events in real-time rather than batch

Reasoning:

Real-time processing enables faster detection and response to performance issues affecting users

Build custom anomaly detection

Reasoning:

Generic anomaly detectors don't work well with Netflix's traffic patterns; custom models reduce false positives

Tech Stack

  • Java
  • Kafka
  • Spark
  • Elasticsearch
  • AWS

Result & Impact

  • 180B+ events
    Daily Events
  • 200M+ subscribers
    User Coverage

Enabled correlation of performance metrics with user retention and identification of platform-specific issues

Learnings

  • Real-time processing at scale requires careful capacity planning
  • Custom anomaly models outperform generic solutions
  • Correlation with user behavior drives prioritization

Architecture

Icarus processes events through:

  1. Client-side beacon collection
  2. Kafka-based event ingestion
  3. Spark streaming for real-time processing
  4. Elasticsearch for storage and querying
  5. Custom anomaly detection models
  6. Web UI for analysis and alerting