Can anyone help me with a boilerplate or changes I need to do if I want to run trainer API for data parallel on multi-node setup.
Hey! If you’re looking to run the Trainer API for data parallel on a multi-node setup, here are some key things to check and set up:
-
Enable
torchrunordeepspeed– You’ll need to launch your training script usingtorchrun(PyTorch) orDeepSpeedfor multi-node training. -
Set
distributed_trainingparameters – In your training arguments, setddp_find_unused_parameters=Falseand make suretorch.distributed.launchortorchrunis configured correctly. -
Check environment variables – Each node should have correct settings for
MASTER_ADDR,MASTER_PORT,WORLD_SIZE, andRANK. -
Ensure all nodes communicate – Make sure SSH is set up, and all nodes can see each other. You might need to set up NCCL backend settings properly.
-
Modify training script if needed – If you’re not using Hugging Face’s
Trainer, ensure your script correctly initializestorch.distributed.init_process_group().
If you’re running into issues, sharing your setup details (framework, error messages) would help troubleshoot! ![]()
Some follow up questions:
Can I do Multi node training using accelerator instead of deep speed or torchrun?
Also I’m using a system that allocates node run time, so I don’t have the master ip in hand when I schedule my job request. Can you suggest me what should I do in that scenario for setting up my master ip?
Yes. You can use accelerate for multi-node training instead of DeepSpeed or torchrun. Hugging Face’s accelerate library simplifies distributed training and can handle multi-node setups efficiently. You’ll need to configure your accelerate config settings properly to enable multi-node execution.
For your second question—since you don’t have the master IP beforehand due to dynamic node allocation, you can use one of these approaches:
- Use a shared storage system – Some clusters provide a shared filesystem where the first node can write its IP to a file, and others can read it.
- Service-based discovery – If your cluster supports job schedulers like SLURM, you can use
scontrol show hostnameto get node addresses dynamically. - Auto-discovery with environment variables – Some cloud platforms provide a way to fetch the master node dynamically using metadata services.
- Manual assignment with retries – If no automated method works, you might need to implement a retry mechanism where worker nodes wait and poll for the master node’s IP before joining the training process.
If you’re using accelerate, it can sometimes handle discovery automatically—make sure to explore its multi-node configuration options! Let me know if you need specific guidance based on your setup. ![]()
Thanks for your help.
Like I was able to figure out issues at my end.
Firstly many of the discuss forums mentioned to specify num_process to total number of process i.e. collectively of both GPU. Because of that I was not able to spin off both the mode. In my setup per node GPU for num_process worked.
Second, my code was saving checkpoints and then using it later so I was in need to save it on both node using save_on_each_node argument of trainingArgument