@@ -13,9 +13,9 @@ It's Yet Another TF wrapper, but different in:
1. Focus on __training speed__.
+ Speed comes for free with tensorpack -- it uses TensorFlow in the __efficient way__ with no extra overhead.
On various CNNs, it runs 1.1~2x faster than the equivalent Keras code.
On different CNNs, it runs [1.1~3.5x faster](https://github.com/tensorpack/benchmarks/tree/master/other-wrappers) than the equivalent Keras code.
+ Data-parallel multi-GPU training is off-the-shelf to use. It runs as fast as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).
+ Data-parallel multi-GPU training is off-the-shelf to use. It scales as well as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).
+ See [tensorpack/benchmarks](https://github.com/tensorpack/benchmarks) for the benchmark scripts.
You'll start to observe slow down after adding more pre-processing (such as those in the [ResNet example](../examples/ResNet/imagenet_utils.py)).
You'll start to observe slow down after adding more pre-processing (such as those in the [ResNet example](../examples/ImageNetModels/imagenet_utils.py)).
Now it's time to add threads or processes:
```eval_rst
.. code-block:: python
...
...
@@ -127,7 +127,7 @@ If you identify this as a bottleneck, you can also use:
Let's summarize what the above dataflow does:
1. One thread iterates over a shuffled list of (filename, label) pairs, and put them into a queue of size 1000.
2. 25 worker threads take pairs and make them into (preprocessed image, label) pairs.
3. Both 1 and 2 happen in one separate process, and the results are sent back to main process through ZeroMQ.
3. Both 1 and 2 happen together in a separate process, and the results are sent back to main process through ZeroMQ.
4. Main process makes batches, and other tensorpack modules will then take care of how they should go into the graph.
Note that in an actual training setup, I used the above multiprocess version for training set since
...
...
@@ -195,8 +195,8 @@ Then we add necessary transformations:
ds = BatchData(ds, 256)
```
1.`LMDBDataPoint` deserialize the datapoints (from raw bytes to [jpeg_string, label] -- what we dumped in `RawILSVRC12`)
2. Use OpenCV to decode the first component into ndarray
1.`LMDBDataPoint` deserialize the datapoints (from raw bytes to [jpeg bytes, label] -- what we dumped in `RawILSVRC12`)
2. Use OpenCV to decode the first component (jpeg bytes) into ndarray
3. Apply augmentations to the ndarray
Both imdecode and the augmentors can be quite slow. We can parallelize them like this:
@@ -7,6 +7,9 @@ Use Keras to define a model a train it with efficient tensorpack trainers.
Keras alone has various overhead. In particular, it is not efficient when working on large models.
The article [Towards Efficient Multi-GPU Training in Keras with TensorFlow](https://medium.com/rossum/towards-efficient-multi-gpu-training-in-keras-with-tensorflow-8a0091074fb2)
has mentioned some of it.
Even on a single GPU, tensorpack can run [1.1~2x faster](https://github.com/tensorpack/benchmarks/tree/master/other-wrappers)
than the equivalent Keras code. The gap becomes larger when you scale.
Tensorpack and [horovod](https://github.com/uber/horovod/blob/master/examples/keras_imagenet_resnet50.py)
are the only two tools I know that can scale the training of a large Keras model.