[Remote] Senior Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. i4DM is a company that provides Federal agencies with access to skilled professionals for complex mission challenges. They are seeking a Senior Site Reliability Engineer to enhance site reliability engineering, cloud operations, and resilient service delivery for VA enterprise healthcare platforms.

Responsibilities

Partner with the Technical Director to implement and mature Site Reliability Engineering (SRE) practices across platform services and hosted applications
Improve the full service lifecycle from design and deployment through operation and continuous refinement, with a focus on availability, latency, performance, efficiency, and capacity
Define, track, and report service level indicators (SLIs), service level objectives (SLOs), and error budgets to guide engineering decisions and service improvements
Build, enhance, and maintain CI/CD pipelines that enable secure, automated, and repeatable application and infrastructure delivery
Develop and support Infrastructure as Code (IaC) and configuration automation using tools such as Terraform and Ansible to improve consistency, speed, and auditability
Integrate automated testing, validation, and security checks into delivery workflows to improve release quality and reduce change-related risk
Design and improve monitoring, logging, tracing, alerting, and dashboards to strengthen observability and accelerate issue detection and response
Analyze system behavior and performance trends to improve reliability, scalability, and operational efficiency across distributed and cloud-native environments
Reduce operational toil by automating repetitive tasks, improving runbooks, and engineering sustainable solutions for recurring operational issues
Support cloud infrastructure and platform services in AWS and containerized environments such as Kubernetes, ensuring systems are resilient, scalable, and secure
Contribute to platform modernization efforts by improving deployment patterns, environment consistency, and operational readiness for cloud-native services
Assist with capacity planning, reliability reviews, and architectural improvements to support growth, resilience, and mission continuity
Implement reliability engineering practices that align with Federal security requirements, including secure configuration, least privilege, vulnerability remediation, and policy-based controls
Partner with cybersecurity and engineering teams to support secure-by-design infrastructure and application delivery practices
Help ensure operational processes and automation align with compliance expectations for Federal and VA environments
Collaborate with development, platform, operations, monitoring, incident management, and architecture teams to improve service reliability and deployment outcomes
Work closely with the Technical Director and team leads to translate technical direction into actionable engineering improvements and operational standards
Support Agile and SAFe delivery practices by helping teams adopt reliable release processes, operational readiness checks, and continuous improvement measures
Participate in incident response, service restoration, root cause analysis, and post-incident reviews for critical systems and services
Identify recurring issues, reliability gaps, and failure patterns, and drive corrective actions through automation, architectural improvements, and process refinement
Contribute to on-call readiness, operational documentation, and blameless continuous improvement practices that improve resilience and reduce mean time to recovery

Skills

Bachelor's degree in Computer Science, Engineering, Information Technology, or a related technical field, or equivalent practical experience
5+ years of experience in Site Reliability Engineering, DevOps, platform engineering, cloud operations, or related roles supporting enterprise or mission-critical environments
Hands-on experience supporting cloud platforms (AWS preferred), Linux-based environments, and distributed systems at scale
Strong experience with Infrastructure as Code and automation tools such as Terraform, Ansible, or comparable technologies
Experience with containers and orchestration platforms such as Kubernetes, EKS, ECS, or Docker in production environments
Experience building or maintaining CI/CD pipelines and deployment automation in support of secure, reliable software delivery
Strong understanding of monitoring, observability, incident response, root cause analysis, and performance optimization principles
Proficiency with one or more scripting or programming languages such as Python, Go, Bash, or PowerShell
Demonstrated ability to troubleshoot complex systems, automate operational tasks, and collaborate effectively across engineering and operations teams
Candidates must be eligible to obtain and maintain a Public Trust clearance
Experience supporting VA, Federal Government, or other regulated environments with strong security and compliance requirements
Experience defining and operationalizing SLIs, SLOs, error budgets, and service health metrics for production systems
Familiarity with observability platforms and tools such as Prometheus, Grafana, CloudWatch, ELK, Splunk, or OpenTelemetry
Experience with FedRAMP, NIST, Zero Trust, or other Federal security frameworks relevant to cloud and platform operations
Experience supporting healthcare platforms, high-availability enterprise services, or large-scale modernization initiatives
Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Solutions Architect, Certified Kubernetes Administrator (CKA), HashiCorp Terraform Associate, or SRE/DevOps certifications

Company Overview

i4DM provides full range of information technology consulting services to government and commercial clients. It was founded in 2002, and is headquartered in Millersville, Maryland, USA, with a workforce of 51-200 employees. Its website is https://www.i4dm.com.

Apply To This Job

Apply

[Remote] Senior Site Reliability Engineer

You might like

[Remote] Product Manager - Centric PLM

[Remote] Recruiting Sourcer

[Remote] Network Engineer

[Remote] ITIL Process Consultant – ServiceNow

[Remote] Senior Regulatory Compliance Analyst - Privacy

[Remote] Director, Provider Sales (Western U.S)

[Remote] Financial Accounting Advisory Services-Finance Optimization-Senior Manager

[Remote] Senior Full Stack Engineer - North America (Remote)

[Remote] Senior Account Manager (West)

[Remote] EHV EPC Project Manager (Power Delivery)- Remote

Insurance Agents

Senior Cloud & ML Ops Engineer - Databricks

Infrastructure Application and Database SME – Lead

Cloud MDR Analyst SkillBridge Intern (Dayshift M-F 10 AM - 6 PM ET)

Experienced Data Entry Clerk – Remote Opportunity at blithequark

School District Needs - Teachers Aides - URGENTLY! $25 - $29 HRLY

Bürokauffrau oder Bürokaufmann in Vollzeit

Acute Care Specialty Sales Manager

Government Contracts Executive Assistant - Freelance, Remote

Experienced Full Stack Data Entry Specialist – Remote Online Data Management for arenaflex