Vector
Open-sourced an on-host performance monitoring framework used by engineers to diagnose production issues in real-time
Overview
On-host performance monitoring framework exposing high-resolution system metrics through a web interface, enabling engineers to diagnose issues on any instance in real-time
Problem
Engineers needed to diagnose production issues on specific instances, but SSH access was restricted and existing tools didn't provide sufficient granularity
Constraints
- Must work on every Netflix instance without special access
- Must provide high-resolution metrics (sub-second)
- Must be accessible through a web browser
Approach
Built a lightweight agent deployed on all instances that exposes metrics through a web UI, integrated with Netflix's service discovery
Key Decisions
Open source the project
The broader community could benefit, and open sourcing would drive contributions and adoption
- Keep internal as Netflix-only tool
Use high-resolution metrics (1-second intervals)
Production issues often manifest in sub-second patterns invisible to minute-level aggregation
Tech Stack
- JavaScript
- D3.js
- Python
- AWS
Result & Impact
- 3573 starsGitHub Stars
- All Netflix production instancesDeployment
Featured in Netflix Tech Blog, adopted by multiple companies for production debugging
Learnings
- High-resolution metrics reveal issues invisible to aggregated data
- Web-based access removes barriers to debugging
- Integration with service discovery is essential at scale
Features
Vector provides:
- Real-time CPU, memory, disk, and network metrics
- Per-process breakdown
- Java JVM metrics integration
- Customizable dashboards
- Historical data viewing