FlameCloud
Built a continuous profiling solution collecting thousands of profiles and millions of stacks daily at Netflix
Overview
Cloud-based continuous profiling platform that collects, stores, and analyzes CPU, memory, and heapdump profiles from production systems
Problem
Production performance issues are difficult to reproduce in dev environments; engineers need visibility into production behavior at scale
Constraints
- Must handle thousands of concurrent profile uploads
- Must store and index millions of stack traces
- Must integrate with Netflix cloud infrastructure
Approach
Built a cloud-native platform with distributed profile collection, centralized storage with indexing, and integration with existing Netflix tools
Key Decisions
Build in-house rather than use commercial solutions
Commercial continuous profilers were expensive at Netflix scale and lacked integration with existing tooling
Use time-series storage for profiles
Time-series indexing enables efficient querying of profiles by time range for trend analysis
Tech Stack
- Java
- Python
- AWS
- Cassandra
Result & Impact
- Thousands of profilesDaily Profiles
- Millions of stacksDaily Stacks
Enabled proactive performance optimization and rapid debugging of production issues at Netflix scale
Learnings
- Continuous profiling reveals issues before they become incidents
- Time-series indexing is essential for trend analysis
- Integration with existing tools drives adoption
Architecture
FlameCloud consists of:
- Profile agents deployed on all production instances
- Centralized storage with time-series indexing
- Integration with Netflix observability stack
- Web UI for profile analysis