FlameCloud

Performance Engineering · 2017 · 1 min read

Built a continuous profiling solution collecting thousands of profiles and millions of stacks daily at Netflix

Overview

Cloud-based continuous profiling platform that collects, stores, and analyzes CPU, memory, and heapdump profiles from production systems

Problem

Production performance issues are difficult to reproduce in dev environments; engineers need visibility into production behavior at scale

Constraints

  • Must handle thousands of concurrent profile uploads
  • Must store and index millions of stack traces
  • Must integrate with Netflix cloud infrastructure

Approach

Built a cloud-native platform with distributed profile collection, centralized storage with indexing, and integration with existing Netflix tools

Key Decisions

Build in-house rather than use commercial solutions

Reasoning:

Commercial continuous profilers were expensive at Netflix scale and lacked integration with existing tooling

Use time-series storage for profiles

Reasoning:

Time-series indexing enables efficient querying of profiles by time range for trend analysis

Tech Stack

  • Java
  • Python
  • AWS
  • Cassandra

Result & Impact

  • Thousands of profiles
    Daily Profiles
  • Millions of stacks
    Daily Stacks

Enabled proactive performance optimization and rapid debugging of production issues at Netflix scale

Learnings

  • Continuous profiling reveals issues before they become incidents
  • Time-series indexing is essential for trend analysis
  • Integration with existing tools drives adoption

Architecture

FlameCloud consists of:

  • Profile agents deployed on all production instances
  • Centralized storage with time-series indexing
  • Integration with Netflix observability stack
  • Web UI for profile analysis