[Remote] Principal Network Architect- AI Infrastructure

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud company designed for AI, providing high-performance infrastructure for AI startups and enterprises. They are seeking a Principal Network Architect to lead the development and operational excellence of their global AI networking infrastructure, focusing on RDMA and Infiniband technologies to enhance AI training outcomes.

Responsibilities

Own the technical direction and operational lifecycle management of Nscale’s high-performance RDMA network fabrics
Define long-term architecture, reliability strategy, and operational standards for AI interconnect networks
Lead availability and performance improvement initiatives across globally distributed GPU clusters
Act as a technical authority (SME) across networking, influencing platform-wide decisions
Support design, build, and evolve large-scale Infiniband and RoCE fabrics
Drive deep debugging and resolution of complex cross-layer issues (hardware, firmware, kernel, distributed workloads)
Lead incident response and postmortems, ensuring systemic fixes and long-term improvements
Define and enforce standards across: Congestion control and traffic engineering, Routing (BGP, ECMP, fabric-level routing strategies), Firmware lifecycle and change management, Network observability and telemetry
Develop and scale automation frameworks for network provisioning, validation, and operations
Build tooling to support high-reliability, low-touch network operations at scale
Improve operational efficiency across hundreds of thousands of endpoints and high-throughput links
Lead complex technical initiatives across Network, SRE, Compute, and Platform teams
Serve as technical lead on critical programs, coordinating engineers and stakeholders
Influence product and infrastructure roadmaps based on operational insights and customer needs
Mentor senior engineers and raise the bar for technical rigor and execution

Skills

10+ years of experience in network engineering in hyperscale, AI, or HPC environments
Deep expertise in RDMA, Infiniband, and/or large-scale RoCE fabrics
Strong understanding of RDMA internals and performance tuning
Strong understanding of congestion control and fabric failure modes
Strong understanding of distributed system communication patterns
Expert-level knowledge of data center networking protocols (BGP, OSPF, ECMP)
Proven ability to debug multi-layer issues across network, system, and application layers
Strong programming/scripting skills for automation (Python, Go, etc.)
Experience designing high-scale, highly available network systems
Demonstrated ability to lead complex technical programs without direct authority
Experience acting as a senior escalation point for critical production issues
Strong ability to drive cross-team alignment and execution
Systems-level thinking balancing performance, reliability, scalability, and cost
Experience with NVIDIA / Mellanox networking platforms
Familiarity with distributed AI training frameworks and GPU communication patterns
Experience building network observability systems at scale
Background influencing infrastructure strategy in high-growth environments

Benefits

Highly competitive package (base + equity) with reviews every 12 months.
Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

Company Overview

Nscale builds AI data centers and provides GPU cloud infrastructure that companies use to train, run, and scale large AI models. It was founded in 2024, and is headquartered in London, England, GBR, with a workforce of 201-500 employees. Its website is https://www.nscale.com.

Apply To This Job

Apply

[Remote] Principal Network Architect- AI Infrastructure

You might like

[Remote] Senior Product Manager – Professional Standards

[Remote] Oracle Recruiting Cloud

[Remote] Product Operations Manager - Remote

[Remote] Revenue Operations Data Analyst

[Remote] Family & Lifestyle Focused Content Writer

[Remote] Full Stack Engineer - Podcast

[Remote] Senior Auditor → Advisory Consultant | CPA Preferred

[Remote] Account Manager

[Remote] Principal Software Engineer, Enterprise AI Platform

[Remote] Business Development Representative

Lead Consultant, Data Governance

Associate Graphic Designer (Disney Institute)

IT Tech Support - Tier 1 Analyst (Hybrid)

Experienced Data Entry Specialist – Remote Opportunity for Teens at arenaflex

Finance, Investment & Performance Reinsurance Operations Officer Brussels

Data Engineer Snowflake

Flexible Part-Time Data Entry Specialist – Remote Work Opportunity | $25-$35/Hour | No Experience Required | Training Provided

Experienced Entry-Level Financial Advisor – Remote Opportunity for Career Growth and Development

2nd/3rd Shift - Customer Service Agent

Work For Apple Online