Support horovod distributed training (#422)

bed3fa19 · Yuxin Wu · 9c2e2226 · bed3fa19 · bed3fa19
Commit bed3fa19 authored Nov 23, 2017 by Yuxin Wu
Show whitespace changes
Inline Side-by-side

Showing with 22 additions and 3 deletions

docs/tutorial/intro.rst docs/tutorial/intro.rst +1 -1

tensorpack/train/trainers.py tensorpack/train/trainers.py +21 -2

No files found.
--- a/docs/tutorial/intro.rst
+++ b/docs/tutorial/intro.rst
@@ -15,7 +15,7 @@ which as a result makes people think TensorFlow is slow.

 Tensorpack uses TensorFlow efficiently, and hides these details under its APIs.
 You no longer need to learn about
-multi-GPU model replication, variables synchronization, queues, tf.data -- anything that's unrelated to the model itself.
+multi-GPU model replication, device placement, variables synchronization, queues -- anything that's unrelated to the model itself.
 You still need to learn to write models with TF, but everything else is taken care of by tensorpack, in the efficient way.

 A High Level Glance

--- a/tensorpack/train/trainers.py
+++ b/tensorpack/train/trainers.py
@@ -212,9 +212,28 @@ class DistributedTrainerReplicated(SingleCostTrainer):

 class HorovodTrainer(SingleCostTrainer):
    """
-    Horovod trainer, currently support multi-GPU training.
+    Horovod trainer, support multi-GPU and distributed training.

-    It will use the first k GPUs in CUDA_VISIBLE_DEVICES.
+    To use for multi-GPU training:
+
+        CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np 4 --output-filename mylog python train.py
+
+    To use for distributed training:
+
+        /path/to/mpirun -np 8 -H server1:4,server2:4  \
+            -bind-to none -map-by slot \
+            --output-filename mylog  -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0,1,2,3 \
+            python train.py
+
+    Note:
+        1. If using all GPUs, you can always skip the `CUDA_VISIBLE_DEVICES` option.
+
+        2. About performance, horovod is expected to be slightly
+           slower than native tensorflow on multi-GPU training, but faster in distributed training.
+
+        3. Due to the use of MPI, training is less informative (no progress bar).
+           It's recommended to use other multi-GPU trainers for single-node
+           experiments, and scale to multi nodes by horovod.
    """
    def __init__(self):
        hvd.init()