update docs

f130c10f · Yuxin Wu · 70d95c17 · f130c10f · f130c10f · f130c10f
Commit f130c10f authored Mar 10, 2018 by Yuxin Wu
8 changed files
--- a/README.md
+++ b/README.md
@@ -13,9 +13,9 @@ It's Yet Another TF wrapper, but different in:

 1. Focus on __training speed__.
 	+	Speed comes for free with tensorpack -- it uses TensorFlow in the __efficient way__ with no extra overhead.
-	  On various CNNs, it runs 1.1~2x faster than the equivalent Keras code.
+	  On different CNNs, it runs [1.1~3.5x faster](https://github.com/tensorpack/benchmarks/tree/master/other-wrappers) than the equivalent Keras code.

-	+ Data-parallel multi-GPU training is off-the-shelf to use. It runs as fast as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).
+	+ Data-parallel multi-GPU training is off-the-shelf to use. It scales as well as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).

 	+ See [tensorpack/benchmarks](https://github.com/tensorpack/benchmarks) for the benchmark scripts.


--- a/docs/tutorial/efficient-dataflow.md
+++ b/docs/tutorial/efficient-dataflow.md
@@ -66,7 +66,7 @@ We will now add the cheapest pre-processing now to get an ndarray in the end ins
 		ds = AugmentImageComponent(ds, [imgaug.Resize(224)])
 		ds = BatchData(ds, 256)
 ```
-You'll start to observe slow down after adding more pre-processing (such as those in the [ResNet example](../examples/ResNet/imagenet_utils.py)).
+You'll start to observe slow down after adding more pre-processing (such as those in the [ResNet example](../examples/ImageNetModels/imagenet_utils.py)).
 Now it's time to add threads or processes:
 ```eval_rst
 .. code-block:: python
@@ -127,7 +127,7 @@ If you identify this as a bottleneck, you can also use:
 Let's summarize what the above dataflow does:
 1. One thread iterates over a shuffled list of (filename, label) pairs, and put them into a queue of size 1000.
 2. 25 worker threads take pairs and make them into (preprocessed image, label) pairs.
-3. Both 1 and 2 happen in one separate process, and the results are sent back to main process through ZeroMQ.
+3. Both 1 and 2 happen together in a separate process, and the results are sent back to main process through ZeroMQ.
 4. Main process makes batches, and other tensorpack modules will then take care of how they should go into the graph.

 Note that in an actual training setup, I used the above multiprocess version for training set since
@@ -195,8 +195,8 @@ Then we add necessary transformations:
    ds = BatchData(ds, 256)
 ```

-1. `LMDBDataPoint` deserialize the datapoints (from raw bytes to [jpeg_string, label] -- what we dumped in `RawILSVRC12`)
-2. Use OpenCV to decode the first component into ndarray
+1. `LMDBDataPoint` deserialize the datapoints (from raw bytes to [jpeg bytes, label] -- what we dumped in `RawILSVRC12`)
+2. Use OpenCV to decode the first component (jpeg bytes) into ndarray
 3. Apply augmentations to the ndarray

 Both imdecode and the augmentors can be quite slow. We can parallelize them like this:

--- a/docs/tutorial/performance-tuning.md
+++ b/docs/tutorial/performance-tuning.md
@@ -6,22 +6,22 @@ Performance is different across machines and tasks,
 so you need to figure out most parts by your own.
 Here's a list of things you can do when your training is slow.

-If you ask for help understanding and improving the speed, PLEASE do them and include your findings.
+If you ask for help to understand and improve the speed, PLEASE do them and include your findings.

 ## Figure out the bottleneck

 1. If you use feed-based input (unrecommended) and datapoints are large, data is likely to become the bottleneck.
-2. If you use queue-based input + dataflow, you can look for the queue size statistics in
-	 training log. Ideally the input queue should be near-full (default size is 50).
- 	 If the size is near-zero, data is the bottleneck.
-3. If GPU utilization is low, it may be because of slow data, or some ops are inefficient. Also make sure GPUs are not locked in P8 state.
+2. If you use queue-based input + DataFlow, always pay attention to the queue size statistics in
+	 training log. Ideally the input queue should be nearly full (default size is 50).
+ 	 __If the queue size is close to zero, data is the bottleneck. Otherwise, it's not.__
+3. If GPU utilization is low but queue is full. It's because of the graph.
+	Either there are some communication inefficiency or some ops you use are inefficient (e.g. CPU ops). Also make sure GPUs are not locked in P8 state.

 ## Benchmark the components
-1. (usually not needed) Use `data=DummyConstantInput(shapes)` for training,
-	so that the iterations only take data from a constant tensor.
-	This will benchmark the graph without the overhead of data.
-2. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow.
-  This is almost the same as (1).
+1. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow.
+	This will benchmark the graph without the possible overhead of DataFlow.
+2. (usually not needed) Use `data=DummyConstantInput(shapes)` for training, so that the iterations only take data from a constant tensor.
+	No DataFlow is involved in this case.
 3. If you're using a TF-based input pipeline you wrote, you can simply run it in a loop and test its speed.
 4. Use `TestDataSpeed(mydf).start()` to benchmark your DataFlow.

@@ -31,15 +31,18 @@ Note that you should only look at iteration speed after about 50 iterations, sin
 ## Investigate DataFlow

 Understand the [Efficient DataFlow](efficient-dataflow.html) tutorial, so you know what your DataFlow is doing.
+Then, make modifications and benchmark to understand which part is the bottleneck.
+Use [TestDataSpeed](../modules/dataflow.html#tensorpack.dataflow.TestDataSpeed).
+Do __NOT__ look at training speed when you benchmark a DataFlow.

-Benchmark your DataFlow with modifications to understand which part is the bottleneck. Some examples include:
+Some example things to try:

-1. Benchmark only raw reader (and perhaps add some parallelism).
+1. Benchmark only the raw reader (and perhaps add some parallelism).
 2. Gradually add some pre-processing and see how the performance changes.
 3. Change the number of parallel processes or threads.

-A DataFlow could be blocked by CPU/disk/network/IPC bandwidth. Only by benchmarking will you
-know the reason and improve it accordingly, e.g.:
+A DataFlow could be blocked by CPU/disk/network/IPC bandwidth.
+Only by benchmarking will you know the reason and improve it accordingly, e.g.:

 1. Use single-file database to avoid random read on hard disk.
 2. Use fewer pre-processings or write faster ones with whatever tools you have.
@@ -57,7 +60,7 @@ Or you can use `GraphProfiler` callback to benchmark the graph. It will
 dump runtime tracing information (to either TensorBoard or chrome) to help diagnose the issue.
 Remember not to use the first several iterations.

-### Slow with single-GPU
+### Slow on single-GPU
 This is literally saying TF ops are slow. Usually there isn't much you can do, except to optimize the kernels.
 But there may be something cheap you can try:


--- a/examples/CaffeModels/README.md
+++ b/examples/CaffeModels/README.md
@@ -7,7 +7,10 @@ Converted models can also be found at [tensorpack model zoo](http://models.tenso

 Download: https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet

-Convert: `python -m tensorpack.utils.loadcaffe PATH/TO/CAFFE/{deploy.prototxt,bvlc_alexnet.caffemodel} alexnet.npz`
+Convert:
+```
+python -m tensorpack.utils.loadcaffe PATH/TO/CAFFE/{deploy.prototxt,bvlc_alexnet.caffemodel} alexnet.npz
+```

 Run: `./load-alexnet.py --load alexnet.npz --input cat.png`

@@ -36,7 +39,10 @@ wget http://pearl.vasc.ri.cmu.edu/caffe_model_github/model/_trained_MPI/pose_ite
 wget https://github.com/shihenw/convolutional-pose-machines-release/raw/master/model/_trained_MPI/pose_deploy_resize.prototxt
 ```

-Convert: `python -m tensorpack.utils.loadcaffe pose_deploy_resize.prototxt pose_iter_320000.caffemodel CPM-original.npz`
+Convert:
+```
+python -m tensorpack.utils.loadcaffe pose_deploy_resize.prototxt pose_iter_320000.caffemodel CPM-original.npz
+```

 Run: `python load-cpm.py --load CPM-original.npz --input test.jpg`


--- a/examples/FasterRCNN/data.py
+++ b/examples/FasterRCNN/data.py
@@ -270,7 +270,7 @@ def get_train_dataflow(add_mask=False):
        return ret

    ds = MapData(ds, preprocess)
-    ds = PrefetchDataZMQ(ds, 1)
+    ds = PrefetchDataZMQ(ds, 3)
    return ds



--- a/examples/ImageNetModels/README.md
+++ b/examples/ImageNetModels/README.md
@@ -28,7 +28,7 @@ Evaluate the [pretrained model](http://models.tensorpack.com/ShuffleNet/):

 This Inception-BN script reaches 27% single-crop error after 300k steps with 6 GPUs.

-This VGG16 script reaches 28.8% single-crop error after 100 epochs.
+This VGG16 script reaches 28.8% single-crop error after 100 epochs (30h with 8 P100s). It gets 1% better if BN is enabled.

 ### ResNet, DoReFa-Net


--- a/examples/keras/README.md
+++ b/examples/keras/README.md
@@ -7,6 +7,9 @@ Use Keras to define a model a train it with efficient tensorpack trainers.
 Keras alone has various overhead. In particular, it is not efficient when working on large models.
 The article [Towards Efficient Multi-GPU Training in Keras with TensorFlow](https://medium.com/rossum/towards-efficient-multi-gpu-training-in-keras-with-tensorflow-8a0091074fb2)
 has mentioned some of it.
+
+Even on a single GPU, tensorpack can run [1.1~2x faster](https://github.com/tensorpack/benchmarks/tree/master/other-wrappers)
+than the equivalent Keras code. The gap becomes larger when you scale.
 Tensorpack and [horovod](https://github.com/uber/horovod/blob/master/examples/keras_imagenet_resnet50.py)
 are the only two tools I know that can scale the training of a large Keras model.


--- a/tensorpack/dataflow/format.py
+++ b/tensorpack/dataflow/format.py
@@ -61,7 +61,12 @@ class HDF5Data(RNGDataFlow):


 class LMDBData(RNGDataFlow):
-    """ Read a LMDB database and produce (k,v) raw string pairs.
+    """
+    Read a LMDB database and produce (k,v) raw bytes pairs.
+    The raw bytes are usually not what you're interested in.
+    You might want to use
+    :class:`LMDBDataDecoder`, :class:`LMDBDataPoint`, or apply a
+    mapper function after :class:`LMDBData`.
    """
    def __init__(self, lmdb_path, shuffle=True, keys=None):
        """
@@ -161,29 +166,35 @@ class LMDBDataDecoder(MapData):
 class LMDBDataPoint(MapData):
    """
    Read a LMDB file and produce deserialized datapoints.
-    It only accepts the database produced by
+    It **only** accepts the database produced by
    :func:`tensorpack.dataflow.dftools.dump_dataflow_to_lmdb`,
    which uses :func:`tensorpack.utils.serialize.dumps` for serialization.

    Example:
        .. code-block:: python

-            ds = LMDBDataPoint("/data/ImageNet.lmdb", shuffle=False)
+            ds = LMDBDataPoint("/data/ImageNet.lmdb", shuffle=False)  # read and decode

-            # alternatively:
-            ds = LMDBData("/data/ImageNet.lmdb", shuffle=False)
-            ds = LocallyShuffleData(ds, 50000)
-            ds = LMDBDataPoint(ds)
+            # The above is equivalent to:
+            ds = LMDBData("/data/ImageNet.lmdb", shuffle=False)  # read
+            ds = LMDBDataPoint(ds)  # decode
+            # Sometimes it makes sense to separate reading and decoding
+            # to be able to make decoding parallel.
    """

    def __init__(self, *args, **kwargs):
        """
        Args:
            args, kwargs: Same as in :class:`LMDBData`.
+
+        In addition, args[0] can be a :class:`LMDBData` instance.
+        In this case args[0] has to be the only argument.
        """

        if isinstance(args[0], DataFlow):
            ds = args[0]
+            assert len(args) == 1 and len(kwargs) == 0, \
+                "No more arguments are allowed if LMDBDataPoint is called with a LMDBData instance!"
        else:
            ds = LMDBData(*args, **kwargs)