Scale PyTorch DDP Multi-Node on AWS EC2: Infra-First Guide

Replicate Environments and Data for Multi-Node Reliability

Multi-node DDP treats processes across independent EC2 instances as identical, requiring each node to have matching Python/PyTorch/CUDA versions, identical code from version control, and shared dataset access. Use shared EFS volumes mounted on all instances (e.g., DATASET_DIR=/efs/andrea/dataset) to avoid copying data; local copies or remote streaming work but add latency. Homogeneous clusters like 2 g6e.xlarge instances in the same availability zone minimize variance. Without this, expect cryptic errors or silent failures since DDP assumes uniformity.

One process per GPU (world size = total GPUs, e.g., 2 for 1 GPU/node), with rank 0 as master for logging/checkpointing. NCCL handles intra-node (NVLink/PCIe) and inter-node (TCP) gradient all-reduce; network misconfigs cause silent hangs.

Secure AWS Networking and Launch torchrun

Launch identical instance types, note master's private IP (e.g., 10.x.xxx.203), and edit security group inbound rules: Type=All traffic, Source=same security group ID (e.g., sg-xxx). This enables rendezvous and NCCL comms; default blocks cause indefinite hangs without errors.

Set .env per node:

Master: NUMBER_OF_NODES=2, NODE_RANK=0, NUMBER_OF_GPUS=1, MASTER_ADDR=, MASTER_PORT=30000, DDP_TIMEOUT_SECONDS=180
Worker: Same but NODE_RANK=1, OUTPUT_DIR empty (master-only).

Run in tmux: uv run torchrun --nnodes=2 --node_rank=$NODE_RANK --nproc_per_node=1 --master_addr=$MASTER_ADDR --master_port=30000 train.py. Batch size scales linearly (e.g., per-rank batch_size=10 yields effective 20), adjust LR accordingly.

Integrate DDPManager and DistributedSampler in Code

Encapsulate DDP in DDPManager class:

import os
import torch
import torch.distributed as dist
from datetime import timedelta

class DDPManager:
    def __init__(self, backend="nccl", timeout_s=180):
        self.backend = backend
        self.timeout_s = timeout_s
    def setup(self) -> bool:
        if dist.is_initialized(): return True
        if "RANK" not in os.environ: return False
        local_rank = int(os.environ["LOCAL_RANK"])
        torch.cuda.set_device(local_rank)
        dist.init_process_group(backend=self.backend, timeout=timedelta(seconds=self.timeout_s))
        return True
    def is_main_process(self) -> bool:
        return int(os.environ.get("RANK", "0")) == 0
    # barrier(), cleanup(), get_local_rank()

Setup: ddp = DDPManager(); use_ddp = ddp.setup(); device = torch.device(f"cuda:{ddp.get_local_rank()}") if use_ddp else "cuda:0". Wrap model: model = DDP(model, device_ids=local_rank, output_device=local_rank, find_unused_parameters=False); access via model.module.

Use DistributedSampler(dataset, num_replicas=world_size, rank=rank, shuffle=True) for data partitioning; set train_sampler.set_epoch(epoch) per epoch. Barrier after master-only tasks (validate/save): if use_ddp: ddp.barrier(). Master handles checkpoints: torch.save({"step": step, "model": model.module.state_dict()}, f"{ckpt_dir}/model-{step}.pth").

Debug Timeouts and Failures Proactively

Silent hangs signal network issues—ping test instances first. Missing node triggers init timeout (180s default). Master crash kills job; no fault tolerance. Deadlocks (e.g., barrier stall) timeout. Restrict GPUs: export CUDA_VISIBLE_DEVICES=0. Scale batch size with ranks for stable training; effective batch = per-rank batch * world_size.