update docs

58529de1 · Yuxin Wu · 812329fa · 58529de1 · 58529de1 · 58529de1
Commit 58529de1 authored May 16, 2018 by Yuxin Wu
6 changed files
--- a/README.md
+++ b/README.md
@@ -15,17 +15,17 @@ It's Yet Another TF high-level API, with __speed__, __readability__ and __flexib
 	+	Speed comes for free with tensorpack -- it uses TensorFlow in the __efficient way__ with no extra overhead.
 	  On different CNNs, it runs training [1.2~5x faster](https://github.com/tensorpack/benchmarks/tree/master/other-wrappers) than the equivalent Keras code.

-	+ Data-parallel multi-GPU training is off-the-shelf to use. It scales as well as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).
+	+ Data-parallel multi-GPU/distributed training is off-the-shelf to use with
+      one line of code. It scales as well as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).

-	+ Distributed data-parallel training is also supported and scales well. See [tensorpack/benchmarks](https://github.com/tensorpack/benchmarks) for more benchmark scripts.
+	+ See [tensorpack/benchmarks](https://github.com/tensorpack/benchmarks) for more benchmark scripts.

 2. Focus on __large datasets__.
 	+ It's unnecessary to read/preprocess data with a new language called TF.
 		Tensorpack helps you load large datasets (e.g. ImageNet) in __pure Python__ with autoparallelization.

 3. It's not a model wrapper.
-	+ There are too many symbolic function wrappers in the world.
-		Tensorpack includes only a few common models.
+	+ There are too many symbolic function wrappers in the world. Tensorpack includes only a few common models.
 	  But you can use any symbolic function library inside tensorpack, including tf.layers/Keras/slim/tflearn/tensorlayer/....

 See [tutorials](http://tensorpack.readthedocs.io/tutorial/index.html#user-tutorials) to know more about these features.

--- a/docs/tutorial/trainer.md
+++ b/docs/tutorial/trainer.md
@@ -53,7 +53,7 @@ implement different distribution strategies.
 They take care of device placement, gradient averaging and synchronoization
 in the efficient way and all reach the same performance as the
 [official TF benchmarks](https://www.tensorflow.org/performance/benchmarks).
-It takes only one line of code change to use them.
+It takes only one line of code change to use them, i.e. `trainer=SyncMultiGPUTrainerReplicated()`.

 Note some __common problems__ when using these trainers:

@@ -66,3 +66,13 @@ Note some __common problems__ when using these trainers:

 2. The tower function (your model code) will get called multipile times.
 	As a result, you'll need to be careful when modifying global states in those functions, e.g. adding ops to TF collections.
+
+### Distributed Trainers
+
+Distributed training needs the [horovod](https://github.com/uber/horovod) library which offers high-performance allreduce implementation.
+To run distributed training, first install horovod properly, then refer to the
+documentation of [HorovodTrainer](../modules/train.html#tensorpack.train.HorovodTrainer).
+
+Tensorpack has implemented some other distributed trainers using TF's native API,
+but TF's native support for distributed training isn't very high-performance even today.
+Therefore those trainers are not actively maintained and are not recommended for use.
--- a/examples/DoReFa-Net/README.md
+++ b/examples/DoReFa-Net/README.md
@@ -12,13 +12,13 @@ These quantization techniques achieves the following ImageNet performance in thi

 | Model              | W,A,G       | Top 1 Error |
 |:-------------------|-------------|------------:|
-| Full Precision     | 32,32,32    |      40.9%  |
-| TTQ                | t,32,32     |      41.5%  |
-| BWN                | 1,32,32     |      43.7%  |
-| BNN                | 1,1,32      |      53.4%  |
-| DoReFa             | 1,2,32      |      47.2%  |
-| DoReFa             | 1,2,6       |      47.2%  |
-| DoReFa             | 1,2,4       |      60.9%  |
+| Full Precision     | 32,32,32    |      40.3%  |
+| TTQ                | t,32,32     |      42.0%  |
+| BWN                | 1,32,32     |      44.6%  |
+| BNN                | 1,1,32      |      51.9%  |
+| DoReFa             | 1,2,32      |      46.6%  |
+| DoReFa             | 1,2,6       |      46.8%  |
+| DoReFa             | 1,2,4       |      54.0%  |

 These numbers were obtained by training on 8 GPUs with a total batch size of 256.
 The DoReFa-Net models reach slightly better performance than our paper, due to

--- a/examples/DoReFa-Net/alexnet-dorefa.py
+++ b/examples/DoReFa-Net/alexnet-dorefa.py
@@ -236,4 +236,4 @@ if __name__ == '__main__':
    config = get_config()
    if args.load:
        config.session_init = SaverRestore(args.load)
-    launch_train_with_config(config, SyncMultiGPUTrainer(nr_tower))
+    launch_train_with_config(config, SyncMultiGPUTrainerReplicated(nr_tower))
--- a/tensorpack/graph_builder/distributed.py
+++ b/tensorpack/graph_builder/distributed.py
@@ -72,8 +72,9 @@ class DistributedParameterServerBuilder(DataParallelBuilder, DistributedBuilderB
    `tensorflow/benchmarks <https://github.com/tensorflow/benchmarks>`_.
    However this implementation hasn't been well tested.
    It probably still has issues in model saving, etc.
-    Check `ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
-    for fast and correct distributed examples.
+    Check :class:`HorovodTrainer` and
+    `ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
+    for faster distributed examples.

    Note:
        1. Gradients are not averaged across workers, but applied to PS variables
@@ -143,8 +144,9 @@ class DistributedReplicatedBuilder(DataParallelBuilder, DistributedBuilderBase):
    It is an equivalent of ``--variable_update=distributed_replicated`` in
    `tensorflow/benchmarks <https://github.com/tensorflow/benchmarks>`_.
    Note that the performance of this trainer is still not satisfactory.
-    Check `ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
-    for fast and correct distributed examples.
+    Check :class:`HorovodTrainer` and
+    `ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
+    for faster distributed examples.

    Note:
        1. Gradients are not averaged across workers, but applied to PS variables

--- a/tensorpack/train/trainers.py
+++ b/tensorpack/train/trainers.py
@@ -281,27 +281,58 @@ class DistributedTrainerReplicated(DistributedTrainerBase):

 class HorovodTrainer(SingleCostTrainer):
    """
-    Horovod trainer, support multi-GPU and distributed training.
+    Horovod trainer, support both multi-GPU and distributed training.

    To use for multi-GPU training:

+    .. code-block:: bash
+
+        # change trainer to HorovodTrainer(), then
        CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np 4 --output-filename mylog python train.py

    To use for distributed training:

-        /path/to/mpirun -np 8 -H server1:4,server2:4  \
-            -bind-to none -map-by slot \
-            --output-filename mylog  -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0,1,2,3 \
-            python train.py
+    .. code-block:: bash

-        (Add other environment variables you need by -x, e.g. PYTHONPATH, PATH)
+        # change trainer to HorovodTrainer(), then
+        /path/to/mpirun -np 8 -H server1:4,server2:4  \\
+            -bind-to none -map-by slot \\
+            --output-filename mylog  -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0,1,2,3 \\
+            python train.py
+        # (Add other environment variables you need by -x, e.g. PYTHONPATH, PATH)

    Note:
        1. If using all GPUs, you can always skip the `CUDA_VISIBLE_DEVICES` option.

        2. Due to the use of MPI, training is less informative (no progress bar).

-        3. MPI often fails to kill all processes. Be sure to check it.
+        3. Due to a TF bug, you must not initialize CUDA context before training.
+           Therefore TF functions like `is_gpu_available()` or `list_local_devices()`
+           must be avoided.
+
+        4. MPI does not like fork(). If your dataflow contains multiprocessing, it may cause problems.
+
+        3. MPI sometimes fails to kill all processes. Be sure to check it.
+
+        5. Keep in mind that there is one process per GPU, therefore:
+
+           + If your data processing is heavy, doing it in a separate dedicated process might be
+             a better choice than doing them repeatedly in each process.
+
+           + Your need to set log directory carefully to avoid conflicts.
+             For example you can set it only for the chief process.
+
+           + Callbacks have an option to be run only on the chief process, or on all processes.
+             See :meth:`callback.set_chief_only()`. Most callbacks have a reasonable
+             default already, but certain callbacks may not behave properly by default. Report an issue if you find any.
+
+           + You can use Horovod API such as `hvd.rank()` to know which process you are.
+             Chief process has rank 0.
+
+        6. Due to these caveats, see
+           `ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
+           for a full example which has handled these common issues.
+           The example can train ImageNet in roughly an hour following the paper's setup.
    """
    def __init__(self, average=True):
        """