Commit 58529de1 authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent 812329fa
......@@ -15,17 +15,17 @@ It's Yet Another TF high-level API, with __speed__, __readability__ and __flexib
+ Speed comes for free with tensorpack -- it uses TensorFlow in the __efficient way__ with no extra overhead.
On different CNNs, it runs training [1.2~5x faster](https://github.com/tensorpack/benchmarks/tree/master/other-wrappers) than the equivalent Keras code.
+ Data-parallel multi-GPU training is off-the-shelf to use. It scales as well as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).
+ Data-parallel multi-GPU/distributed training is off-the-shelf to use with
one line of code. It scales as well as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).
+ Distributed data-parallel training is also supported and scales well. See [tensorpack/benchmarks](https://github.com/tensorpack/benchmarks) for more benchmark scripts.
+ See [tensorpack/benchmarks](https://github.com/tensorpack/benchmarks) for more benchmark scripts.
2. Focus on __large datasets__.
+ It's unnecessary to read/preprocess data with a new language called TF.
Tensorpack helps you load large datasets (e.g. ImageNet) in __pure Python__ with autoparallelization.
3. It's not a model wrapper.
+ There are too many symbolic function wrappers in the world.
Tensorpack includes only a few common models.
+ There are too many symbolic function wrappers in the world. Tensorpack includes only a few common models.
But you can use any symbolic function library inside tensorpack, including tf.layers/Keras/slim/tflearn/tensorlayer/....
See [tutorials](http://tensorpack.readthedocs.io/tutorial/index.html#user-tutorials) to know more about these features.
......
......@@ -53,7 +53,7 @@ implement different distribution strategies.
They take care of device placement, gradient averaging and synchronoization
in the efficient way and all reach the same performance as the
[official TF benchmarks](https://www.tensorflow.org/performance/benchmarks).
It takes only one line of code change to use them.
It takes only one line of code change to use them, i.e. `trainer=SyncMultiGPUTrainerReplicated()`.
Note some __common problems__ when using these trainers:
......@@ -66,3 +66,13 @@ Note some __common problems__ when using these trainers:
2. The tower function (your model code) will get called multipile times.
As a result, you'll need to be careful when modifying global states in those functions, e.g. adding ops to TF collections.
### Distributed Trainers
Distributed training needs the [horovod](https://github.com/uber/horovod) library which offers high-performance allreduce implementation.
To run distributed training, first install horovod properly, then refer to the
documentation of [HorovodTrainer](../modules/train.html#tensorpack.train.HorovodTrainer).
Tensorpack has implemented some other distributed trainers using TF's native API,
but TF's native support for distributed training isn't very high-performance even today.
Therefore those trainers are not actively maintained and are not recommended for use.
......@@ -12,13 +12,13 @@ These quantization techniques achieves the following ImageNet performance in thi
| Model | W,A,G | Top 1 Error |
|:-------------------|-------------|------------:|
| Full Precision | 32,32,32 | 40.9% |
| TTQ | t,32,32 | 41.5% |
| BWN | 1,32,32 | 43.7% |
| BNN | 1,1,32 | 53.4% |
| DoReFa | 1,2,32 | 47.2% |
| DoReFa | 1,2,6 | 47.2% |
| DoReFa | 1,2,4 | 60.9% |
| Full Precision | 32,32,32 | 40.3% |
| TTQ | t,32,32 | 42.0% |
| BWN | 1,32,32 | 44.6% |
| BNN | 1,1,32 | 51.9% |
| DoReFa | 1,2,32 | 46.6% |
| DoReFa | 1,2,6 | 46.8% |
| DoReFa | 1,2,4 | 54.0% |
These numbers were obtained by training on 8 GPUs with a total batch size of 256.
The DoReFa-Net models reach slightly better performance than our paper, due to
......
......@@ -236,4 +236,4 @@ if __name__ == '__main__':
config = get_config()
if args.load:
config.session_init = SaverRestore(args.load)
launch_train_with_config(config, SyncMultiGPUTrainer(nr_tower))
launch_train_with_config(config, SyncMultiGPUTrainerReplicated(nr_tower))
......@@ -72,8 +72,9 @@ class DistributedParameterServerBuilder(DataParallelBuilder, DistributedBuilderB
`tensorflow/benchmarks <https://github.com/tensorflow/benchmarks>`_.
However this implementation hasn't been well tested.
It probably still has issues in model saving, etc.
Check `ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
for fast and correct distributed examples.
Check :class:`HorovodTrainer` and
`ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
for faster distributed examples.
Note:
1. Gradients are not averaged across workers, but applied to PS variables
......@@ -143,8 +144,9 @@ class DistributedReplicatedBuilder(DataParallelBuilder, DistributedBuilderBase):
It is an equivalent of ``--variable_update=distributed_replicated`` in
`tensorflow/benchmarks <https://github.com/tensorflow/benchmarks>`_.
Note that the performance of this trainer is still not satisfactory.
Check `ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
for fast and correct distributed examples.
Check :class:`HorovodTrainer` and
`ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
for faster distributed examples.
Note:
1. Gradients are not averaged across workers, but applied to PS variables
......
......@@ -281,27 +281,58 @@ class DistributedTrainerReplicated(DistributedTrainerBase):
class HorovodTrainer(SingleCostTrainer):
"""
Horovod trainer, support multi-GPU and distributed training.
Horovod trainer, support both multi-GPU and distributed training.
To use for multi-GPU training:
.. code-block:: bash
# change trainer to HorovodTrainer(), then
CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np 4 --output-filename mylog python train.py
To use for distributed training:
/path/to/mpirun -np 8 -H server1:4,server2:4 \
-bind-to none -map-by slot \
--output-filename mylog -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0,1,2,3 \
python train.py
.. code-block:: bash
(Add other environment variables you need by -x, e.g. PYTHONPATH, PATH)
# change trainer to HorovodTrainer(), then
/path/to/mpirun -np 8 -H server1:4,server2:4 \\
-bind-to none -map-by slot \\
--output-filename mylog -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0,1,2,3 \\
python train.py
# (Add other environment variables you need by -x, e.g. PYTHONPATH, PATH)
Note:
1. If using all GPUs, you can always skip the `CUDA_VISIBLE_DEVICES` option.
2. Due to the use of MPI, training is less informative (no progress bar).
3. MPI often fails to kill all processes. Be sure to check it.
3. Due to a TF bug, you must not initialize CUDA context before training.
Therefore TF functions like `is_gpu_available()` or `list_local_devices()`
must be avoided.
4. MPI does not like fork(). If your dataflow contains multiprocessing, it may cause problems.
3. MPI sometimes fails to kill all processes. Be sure to check it.
5. Keep in mind that there is one process per GPU, therefore:
+ If your data processing is heavy, doing it in a separate dedicated process might be
a better choice than doing them repeatedly in each process.
+ Your need to set log directory carefully to avoid conflicts.
For example you can set it only for the chief process.
+ Callbacks have an option to be run only on the chief process, or on all processes.
See :meth:`callback.set_chief_only()`. Most callbacks have a reasonable
default already, but certain callbacks may not behave properly by default. Report an issue if you find any.
+ You can use Horovod API such as `hvd.rank()` to know which process you are.
Chief process has rank 0.
6. Due to these caveats, see
`ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
for a full example which has handled these common issues.
The example can train ImageNet in roughly an hour following the paper's setup.
"""
def __init__(self, average=True):
"""
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment