Posts

Showing posts from May, 2022

Run distributed data training on high performance computation cluster

  Intro There 2 options in distributed training - the first one distributed data training and the second one - distributed model training. The distributed model training is used when model won't fit into one GPU (e.g. model requires to much memory) this way we split models into multiple GPUs. The distributed data training is used when we have large amount of data and training on one GPU could be too long. For our experiment we would like to use  LUNA dataset  for langue nodule analysis. We plan to use also already written code for classification model training. The code can be found in  this repository . The code is not prepared for distributed model training on multiple nodes. Task To run our training on HPC, as well as other tasks, we have to use  Slurm . Slurm manages cluster and tasks queue. To interact with Slurm we use file with special syntaxis. In our experiment we will acquire 4 nodes with GPU and will look for the way how to utiles correctly this 4 nod...