Run distributed data training on high performance computation cluster

May 15, 2022

Intro

There 2 options in distributed training - the first one distributed data training and the second one - distributed model training. The distributed model training is used when model won't fit into one GPU (e.g. model requires to much memory) this way we split models into multiple GPUs. The distributed data training is used when we have large amount of data and training on one GPU could be too long.

For our experiment we would like to use LUNA dataset for langue nodule analysis. We plan to use also already written code for classification model training. The code can be found in this repository. The code is not prepared for distributed model training on multiple nodes.

Task

To run our training on HPC, as well as other tasks, we have to use Slurm. Slurm manages cluster and tasks queue. To interact with Slurm we use file with special syntaxis. In our experiment we will acquire 4 nodes with GPU and will look for the way how to utiles correctly this 4 nodes in the model training.

We should find the way how to get correct information about a cluster and utilize it with PyTorch for distributing uted training. The second task, which can be processed in case we will have more capacity is preparation of dataset for distributed training.

To run distributed parallel training PyTorch requires world_size argument which can be number of GPUs on one node or number of nodes in the cluster. PyTorch also requires master node url. In code it looks like this:

Slurm file

In the file we try to define cluster master address which is important to run distributed parallel training, however this tactic is not working correctly on our HPC because we receive not routable host names.

Found resources

Summary

We found the next working solution for Slurm managed cluster:

The code block from gist above can be added to the main logic which is responsible for the training logic.

The code is checking whether it is running on master node and will try to create a file with master IP address. All other cluster nodes will wait for the file to be created, and this way they find information about master's routable address.

Search This Blog

András Gyácsok's blog