[Remote] Forward Deployed Engineer: AI + HPC
Note: The job is a remote job and is open to candidates in USA. Cedana is a company focused on maximizing AI and HPC cluster utilization and reliability. As a Forward Deployed Engineer, you will lead technical engagements with customers, deploying Cedana's solutions in various environments and optimizing platform performance.
Responsibilities
- Engineer solutions at client sites: Lead customer integrations. Install, configure, and deploy Cedana into SLURM, Kubernetes, and Dynamo environments
- Drive product innovation from the field: Identify technical gaps while embedded with clients, then provide product feedback for new capabilities that become core product features
- Measure and optimize platform performance: Measure reliability, throughput, and performance using our internal tools. Design and implement policy-based migration automations to optimize reliability, throughput, and performance
- Own critical deployments: Ensure our platform performs reliably for clients' critical operations, debugging issues across the full stack. Debug install issues against unfamiliar customer infrastructure, and escalate to engineering when necessary
- Improve scalability : Build and own the internal installation playbook so that the second customer in each segment is onboarded faster than the first
- Respect our customers : Understand how to make their lives easier and minimize their time and overhead
Skills
- Team management experience. Requires strong project and time management skills, delivering milestones on time, and effective
- 3-10 years of software engineering experience with a track record of configuring and managing SLURM deployments
- A multi-month enterprise or research deployment you led end-to-end, from scoping through signoff. You write effective status updates to keep your team updated and on schedule
- Production experience in standing up SLURM in a customer or research environment. You've configured slurmctld, slurmdbd, accounting, cgroup integration, and GPU resource selection
- Strong Linux fundamentals of systemd, cgroups v2, namespaces, networking, filesystems, firewalls, kernel module loading, PAM session modules. You can read strace and dmesg output and form a hypothesis
- Experience with Kubernetes operations including operators, CRDs, CNIs, device plugins, and node-level debugging. You've debugged a controller in production even if you haven't written one from scratch
- Experience in an HPC integrator field team
- Client-facing technical experience working directly with customers
- Background in national lab user services or university research computing
- You've developed SLURM plug-ins, and understand their architecture and how they fit into the overall platform
- Familiarity with CRIU, container runtimes, GPU driver internals, distributed training stacks
- Hands-on with NVIDIA Dynamo, Determined, Ray, Kueue, KServe, or comparable AI orchestration
- Contributed to open-source schedulers or job systems (SLURM, Flux, Torque, PBS)
- A passion for debugging a weird cgroup issue at 11pm just as much as writing a clean install playbook the next morning
Benefits
- 100% covered medical, dental, and vision insurance for employees and families
- Unlimited PTO policy
- 401K Plan
Company Overview