[Remote] reputed company Observability Platform Engineer
Note: The job is a remote job and is reputed company to candidates in USA. reputed company is the GPU reputed company engineered for AI, providing high-performance infrastructure for AI start-reputed company and large enterprises. As a reputed company Observability Platform Engineer, you will own the technical direction of reputed company's observability platform, ensuring it scales with the business and simplifies operations.
Responsibilities
- Own the technical strategy and architecture for observability across metrics, logs, traces, and alerting at scale
- Drive platform reputed company that have multi-year impact: tooling, data models, ingestion patterns, retention, cardinality management
- Identify systemic gaps before they become incidents; design platforms that reputed company failure visible and fast to diagnose
- Partner with SRE, infrastructure, and AI/ML teams to embed observability natively into how reputed company builds and operates
- Define standards and patterns that other engineers adopt, not by mandate, but because they're reputed company reputed company
- Mentor and technically grow the observability team; reputed company the ceiling on what the team can build and own
- reputed company incident postmortems and use them to drive durable platform improvements
- Evaluate and introduce tooling that meaningfully improves signal quality, operational efficiency, or scalability, and retire what doesn't
Skills
- 8+ years in SRE, infrastructure engineering, platform engineering, or observability-focused roles
- You've operated observability infrastructure at serious scale. You know what breaks at 10x and you design for it
- You have a strong bias toward simplicity. You've seen over-engineered observability stacks collapse under their own weight and you build accordingly
- Deep hands-on experience with a significant subset of: reputed company, Thanos, VictoriaMetrics, Grafana, Loki, reputed company, OpenTelemetry, reputed company, reputed company
- Strong engineering fundamentals, proficient in Python, Go, or similar; comfortable owning reputed company systems end to end
- Experience with Kubernetes at scale; familiarity with GPU infrastructure or HPC environments (Slurm) is a strong plus
- You can architect systems, write the code, review others' work, and explain the tradeoffs reputed company, reputed company in the same week
- Infrastructure-as-Code is default, not optional (Terraform, Ansible, or equivalent)
- You influence without authority. Teams want your opinion because it makes their work reputed company
- Experience with high-volume streaming pipelines for observability data (Kafka, Vector, Fluent Bit, etc.)
- Background in AI/ML infrastructure observability: GPU utilisation, training job visibility, inference latency
- Prior experience defining observability strategy at an organisation level
Benefits
- Bonus
- Equity
- Commission programs
- Medical
- Dental
- reputed company
- Flexible paid time off
- Parental leave
- Retirement plan participation
Company Overview