Icarus
Built real user performance monitoring processing 180B+ events daily with anomaly detection and alerting
Overview
Real User Monitoring (RUM) solution capturing and analyzing performance metrics from Netflix's 200M+ subscribers, with GUI for analysis, alerting, and anomaly detection
Problem
Understanding real user performance at Netflix scale required processing billions of events daily and identifying anomalies that impact user experience
Constraints
- Must handle 180B+ events per day
- Must detect anomalies in near real-time
- Must correlate performance with user behavior
Approach
Built a distributed event processing pipeline with real-time anomaly detection, integrated with Netflix's observability and alerting infrastructure
Key Decisions
Process events in real-time rather than batch
Real-time processing enables faster detection and response to performance issues affecting users
Build custom anomaly detection
Generic anomaly detectors don't work well with Netflix's traffic patterns; custom models reduce false positives
Tech Stack
- Java
- Kafka
- Spark
- Elasticsearch
- AWS
Result & Impact
- 180B+ eventsDaily Events
- 200M+ subscribersUser Coverage
Enabled correlation of performance metrics with user retention and identification of platform-specific issues
Learnings
- Real-time processing at scale requires careful capacity planning
- Custom anomaly models outperform generic solutions
- Correlation with user behavior drives prioritization
Architecture
Icarus processes events through:
- Client-side beacon collection
- Kafka-based event ingestion
- Spark streaming for real-time processing
- Elasticsearch for storage and querying
- Custom anomaly detection models
- Web UI for analysis and alerting