[Remote] Manager, Site Reliability Engineering
Note: The job is a remote job and is open to candidates in USA. Paradigm is a software company transforming the residential, construction & building product industries. They are seeking a Manager of Site Reliability Engineering to lead a high-performing team, promote modern SRE practices, and enhance reliability across their Azure-based platform.
Responsibilities
- Lead and grow a team of site reliability engineers. Provide guidance, mentorship, and career development
- Contribute to and mature SRE practices across production services: SLOs, SLIs, error budgets, toil reduction, and blameless post-mortems that turn incidents into lasting improvements
- Oversee the incident management lifecycle end-to-end including detection, response, resolution, post-incident review, and systemic improvement
- Design on-call rotations, runbooks, and escalation procedures that balance service reliability with engineer well-being and sustainable work practices
- Drive measurable reductions in MTTR and MTTD through improved observability, intelligent automation, and predictive monitoring
- Build automation to eliminate manual operational work including provisioning, deployment, scaling, self-healing, and reporting
- Implement chaos engineering practices to validate system resilience and surface weaknesses before they cause outages
- Partner with engineering and product teams to embed reliability requirements into the development lifecycle, from design through deployment
- Collaborate with the observability team to ensure comprehensive instrumentation, smart alerting, and actionable dashboards across all critical services
- Measure, report, and advocate for reliability improvements with both technical and executive stakeholders using data to drive investment decisions
Skills
- Bachelor's degree in Engineering, or a related field or equivalent experience
- 7+ years in site reliability engineering, DevOps, or infrastructure engineering, with at least 1 year in people management (or demonstrated tech lead experience with direct influence over team processes and career growth)
- Hands-on experience running production systems on Azure (including proficiency with key services such as AKS, App Services, Service Bus, Event Grid, and Azure Monitor) or comparable cloud platforms
- Proven track record implementing SRE practices with measurable reliability improvements and familiarity with modern observability platforms (Datadog, Prometheus/Grafana, or equivalent)
- Experience leading incident response for high-severity production issues and running effective post-mortems
- Strong background in automation, infrastructure as code (Terraform, Bicep, or similar), and systematically eliminating manual operational work
- Experience with Kubernetes container orchestration with production-grade operational experience
- Ability to automate workflows and build scripts using Python, Bash, PowerShell, or Go
- Strong communication with the ability to make complex technical issues clear for both engineers and executives
- Data-driven approach. You use metrics and telemetry to guide decisions, not gut feel
- You are collaborative cross-functionally and build trust and alignment naturally
- AI-enhanced observability experience is preferred
- Experience with AI coding assistants and CI/CD systems (GitHub Actions, Azure DevOps, ArgoCD) with automation capabilities is preferred
- Knowledge of distributed systems patterns is preferred
- Exposure to AIOps platforms or using LLMs for operational automation is preferred
Company Overview
Company H1B Sponsorship