Vector

Performance Engineering · 2015 · 1 min read

Open-sourced an on-host performance monitoring framework used by engineers to diagnose production issues in real-time

Overview

On-host performance monitoring framework exposing high-resolution system metrics through a web interface, enabling engineers to diagnose issues on any instance in real-time

Problem

Engineers needed to diagnose production issues on specific instances, but SSH access was restricted and existing tools didn't provide sufficient granularity

Constraints

  • Must work on every Netflix instance without special access
  • Must provide high-resolution metrics (sub-second)
  • Must be accessible through a web browser

Approach

Built a lightweight agent deployed on all instances that exposes metrics through a web UI, integrated with Netflix's service discovery

Key Decisions

Open source the project

Reasoning:

The broader community could benefit, and open sourcing would drive contributions and adoption

Alternatives considered:
  • Keep internal as Netflix-only tool

Use high-resolution metrics (1-second intervals)

Reasoning:

Production issues often manifest in sub-second patterns invisible to minute-level aggregation

Tech Stack

  • JavaScript
  • D3.js
  • Python
  • AWS

Result & Impact

  • 3573 stars
    GitHub Stars
  • All Netflix production instances
    Deployment

Featured in Netflix Tech Blog, adopted by multiple companies for production debugging

Learnings

  • High-resolution metrics reveal issues invisible to aggregated data
  • Web-based access removes barriers to debugging
  • Integration with service discovery is essential at scale

Features

Vector provides:

  • Real-time CPU, memory, disk, and network metrics
  • Per-process breakdown
  • Java JVM metrics integration
  • Customizable dashboards
  • Historical data viewing