[Remote] Principal Network Architect- AI Infrastructure
Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud company designed for AI, providing high-performance infrastructure for AI startups and enterprises. They are seeking a Principal Network Architect to lead the development and operational excellence of their global AI networking infrastructure, focusing on RDMA and Infiniband technologies to enhance AI training outcomes.
Responsibilities
- Own the technical direction and operational lifecycle management of Nscale’s high-performance RDMA network fabrics
- Define long-term architecture, reliability strategy, and operational standards for AI interconnect networks
- Lead availability and performance improvement initiatives across globally distributed GPU clusters
- Act as a technical authority (SME) across networking, influencing platform-wide decisions
- Support design, build, and evolve large-scale Infiniband and RoCE fabrics
- Drive deep debugging and resolution of complex cross-layer issues (hardware, firmware, kernel, distributed workloads)
- Lead incident response and postmortems, ensuring systemic fixes and long-term improvements
- Define and enforce standards across: Congestion control and traffic engineering, Routing (BGP, ECMP, fabric-level routing strategies), Firmware lifecycle and change management, Network observability and telemetry
- Develop and scale automation frameworks for network provisioning, validation, and operations
- Build tooling to support high-reliability, low-touch network operations at scale
- Improve operational efficiency across hundreds of thousands of endpoints and high-throughput links
- Lead complex technical initiatives across Network, SRE, Compute, and Platform teams
- Serve as technical lead on critical programs, coordinating engineers and stakeholders
- Influence product and infrastructure roadmaps based on operational insights and customer needs
- Mentor senior engineers and raise the bar for technical rigor and execution
Skills
- 10+ years of experience in network engineering in hyperscale, AI, or HPC environments
- Deep expertise in RDMA, Infiniband, and/or large-scale RoCE fabrics
- Strong understanding of RDMA internals and performance tuning
- Strong understanding of congestion control and fabric failure modes
- Strong understanding of distributed system communication patterns
- Expert-level knowledge of data center networking protocols (BGP, OSPF, ECMP)
- Proven ability to debug multi-layer issues across network, system, and application layers
- Strong programming/scripting skills for automation (Python, Go, etc.)
- Experience designing high-scale, highly available network systems
- Demonstrated ability to lead complex technical programs without direct authority
- Experience acting as a senior escalation point for critical production issues
- Strong ability to drive cross-team alignment and execution
- Systems-level thinking balancing performance, reliability, scalability, and cost
- Experience with NVIDIA / Mellanox networking platforms
- Familiarity with distributed AI training frameworks and GPU communication patterns
- Experience building network observability systems at scale
- Background influencing infrastructure strategy in high-growth environments
Benefits
- Highly competitive package (base + equity) with reviews every 12 months.
- Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
- Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
- Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.
Company Overview