Commit bed3fa19 authored by Yuxin Wu's avatar Yuxin Wu

Support horovod distributed training (#422)

parent 9c2e2226
...@@ -15,7 +15,7 @@ which as a result makes people think TensorFlow is slow. ...@@ -15,7 +15,7 @@ which as a result makes people think TensorFlow is slow.
Tensorpack uses TensorFlow efficiently, and hides these details under its APIs. Tensorpack uses TensorFlow efficiently, and hides these details under its APIs.
You no longer need to learn about You no longer need to learn about
multi-GPU model replication, variables synchronization, queues, tf.data -- anything that's unrelated to the model itself. multi-GPU model replication, device placement, variables synchronization, queues -- anything that's unrelated to the model itself.
You still need to learn to write models with TF, but everything else is taken care of by tensorpack, in the efficient way. You still need to learn to write models with TF, but everything else is taken care of by tensorpack, in the efficient way.
A High Level Glance A High Level Glance
......
...@@ -212,9 +212,28 @@ class DistributedTrainerReplicated(SingleCostTrainer): ...@@ -212,9 +212,28 @@ class DistributedTrainerReplicated(SingleCostTrainer):
class HorovodTrainer(SingleCostTrainer): class HorovodTrainer(SingleCostTrainer):
""" """
Horovod trainer, currently support multi-GPU training. Horovod trainer, support multi-GPU and distributed training.
It will use the first k GPUs in CUDA_VISIBLE_DEVICES. To use for multi-GPU training:
CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np 4 --output-filename mylog python train.py
To use for distributed training:
/path/to/mpirun -np 8 -H server1:4,server2:4 \
-bind-to none -map-by slot \
--output-filename mylog -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0,1,2,3 \
python train.py
Note:
1. If using all GPUs, you can always skip the `CUDA_VISIBLE_DEVICES` option.
2. About performance, horovod is expected to be slightly
slower than native tensorflow on multi-GPU training, but faster in distributed training.
3. Due to the use of MPI, training is less informative (no progress bar).
It's recommended to use other multi-GPU trainers for single-node
experiments, and scale to multi nodes by horovod.
""" """
def __init__(self): def __init__(self):
hvd.init() hvd.init()
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment