[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. i4DM is a company that provides Federal agencies with access to skilled professionals for complex mission challenges. They are seeking a Senior Site Reliability Engineer to enhance site reliability engineering, cloud operations, and resilient service delivery for VA enterprise healthcare platforms.
Responsibilities
- Partner with the Technical Director to implement and mature Site Reliability Engineering (SRE) practices across platform services and hosted applications
- Improve the full service lifecycle from design and deployment through operation and continuous refinement, with a focus on availability, latency, performance, efficiency, and capacity
- Define, track, and report service level indicators (SLIs), service level objectives (SLOs), and error budgets to guide engineering decisions and service improvements
- Build, enhance, and maintain CI/CD pipelines that enable secure, automated, and repeatable application and infrastructure delivery
- Develop and support Infrastructure as Code (IaC) and configuration automation using tools such as Terraform and Ansible to improve consistency, speed, and auditability
- Integrate automated testing, validation, and security checks into delivery workflows to improve release quality and reduce change-related risk
- Design and improve monitoring, logging, tracing, alerting, and dashboards to strengthen observability and accelerate issue detection and response
- Analyze system behavior and performance trends to improve reliability, scalability, and operational efficiency across distributed and cloud-native environments
- Reduce operational toil by automating repetitive tasks, improving runbooks, and engineering sustainable solutions for recurring operational issues
- Support cloud infrastructure and platform services in AWS and containerized environments such as Kubernetes, ensuring systems are resilient, scalable, and secure
- Contribute to platform modernization efforts by improving deployment patterns, environment consistency, and operational readiness for cloud-native services
- Assist with capacity planning, reliability reviews, and architectural improvements to support growth, resilience, and mission continuity
- Implement reliability engineering practices that align with Federal security requirements, including secure configuration, least privilege, vulnerability remediation, and policy-based controls
- Partner with cybersecurity and engineering teams to support secure-by-design infrastructure and application delivery practices
- Help ensure operational processes and automation align with compliance expectations for Federal and VA environments
- Collaborate with development, platform, operations, monitoring, incident management, and architecture teams to improve service reliability and deployment outcomes
- Work closely with the Technical Director and team leads to translate technical direction into actionable engineering improvements and operational standards
- Support Agile and SAFe delivery practices by helping teams adopt reliable release processes, operational readiness checks, and continuous improvement measures
- Participate in incident response, service restoration, root cause analysis, and post-incident reviews for critical systems and services
- Identify recurring issues, reliability gaps, and failure patterns, and drive corrective actions through automation, architectural improvements, and process refinement
- Contribute to on-call readiness, operational documentation, and blameless continuous improvement practices that improve resilience and reduce mean time to recovery
Skills
- Bachelor's degree in Computer Science, Engineering, Information Technology, or a related technical field, or equivalent practical experience
- 5+ years of experience in Site Reliability Engineering, DevOps, platform engineering, cloud operations, or related roles supporting enterprise or mission-critical environments
- Hands-on experience supporting cloud platforms (AWS preferred), Linux-based environments, and distributed systems at scale
- Strong experience with Infrastructure as Code and automation tools such as Terraform, Ansible, or comparable technologies
- Experience with containers and orchestration platforms such as Kubernetes, EKS, ECS, or Docker in production environments
- Experience building or maintaining CI/CD pipelines and deployment automation in support of secure, reliable software delivery
- Strong understanding of monitoring, observability, incident response, root cause analysis, and performance optimization principles
- Proficiency with one or more scripting or programming languages such as Python, Go, Bash, or PowerShell
- Demonstrated ability to troubleshoot complex systems, automate operational tasks, and collaborate effectively across engineering and operations teams
- Candidates must be eligible to obtain and maintain a Public Trust clearance
- Experience supporting VA, Federal Government, or other regulated environments with strong security and compliance requirements
- Experience defining and operationalizing SLIs, SLOs, error budgets, and service health metrics for production systems
- Familiarity with observability platforms and tools such as Prometheus, Grafana, CloudWatch, ELK, Splunk, or OpenTelemetry
- Experience with FedRAMP, NIST, Zero Trust, or other Federal security frameworks relevant to cloud and platform operations
- Experience supporting healthcare platforms, high-availability enterprise services, or large-scale modernization initiatives
- Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Solutions Architect, Certified Kubernetes Administrator (CKA), HashiCorp Terraform Associate, or SRE/DevOps certifications
Company Overview