docs update

a0247332 · Yuxin Wu · a62e68a3 · a0247332 · a0247332 · a0247332
Commit a0247332 authored May 27, 2017 by Yuxin Wu
10 changed files
--- a/README.md
+++ b/README.md
@@ -41,7 +41,7 @@ It's Yet Another TF wrapper, but different in:
 	+ Data-Parallel Multi-GPU training is off-the-shelf to use. It is as fast as Google's [benchmark code](https://github.com/tensorflow/benchmarks).

 3. Focus on large datasets.
-	+ It's painful to read/preprocess data from TF. Use __DataFlow__ to process large datasets such as ImageNet in pure Python.
+	+ It's painful to read/preprocess data from TF. Use __DataFlow__ to efficiently process large datasets such as ImageNet in __pure Python__.
 	+ DataFlow has a unified interface, so you can compose and reuse them to perform complex preprocessing.

 4. Interface of extensible __Callbacks__.

--- a/docs/conf.py
+++ b/docs/conf.py
@@ -227,6 +227,8 @@ html_show_copyright = True
 # This is the file name suffix for HTML files (e.g. ".xhtml").
 #html_file_suffix = None

+# avoid li fonts being larger
+# TODO but li indices fonts are still larger
 html_compact_lists = False

 # Language to be used for generating the HTML full-text search index.

--- a/docs/modules/tfutils.rst
+++ b/docs/modules/tfutils.rst
@@ -26,7 +26,7 @@ tensorpack.tfutils.gradproc module
    :show-inheritance:

 tensorpack.tfutils.model_utils module
------------------------------------
+--------------------------------------

 .. automodule:: tensorpack.tfutils.model_utils
    :members:
@@ -34,7 +34,7 @@ tensorpack.tfutils.model_utils module
    :show-inheritance:

 tensorpack.tfutils.scope_utils module
------------------------------------
+--------------------------------------

 .. automodule:: tensorpack.tfutils.scope_utils
    :members:

--- a/docs/modules/train.rst
+++ b/docs/modules/train.rst
@@ -5,11 +5,3 @@ tensorpack.train package
    :members:
    :undoc-members:
    :show-inheritance:
-
-tensorpack.train.monitor module
------------------------------------
-
-.. automodule:: tensorpack.train.monitor
-    :members:
-    :undoc-members:
-    :show-inheritance:
--- a/docs/tutorial/dataflow.md
+++ b/docs/tutorial/dataflow.md
@@ -59,13 +59,3 @@ generator = df.get_data()
 for dp in generator:
 	# dp is now a list. do whatever
 ```
-
-### Efficiency
-
-DataFlow is purely Python -- a convenient and slow language (w.r.t C++). But faster data loading doesn't always mean faster
-training: we only need data to be __fast enough__.
-
-DataFlow is fast enough for problems up to the scale of multi-GPU ImageNet training.
-See [efficient dataflow tutorial](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html)
-for details.
-Therefore, for most usecases, writing format conversion/preprocessing code with TensorFlow operators doesn't help you at all.
--- a/docs/tutorial/input-source.md
+++ b/docs/tutorial/input-source.md

-# Input Sources
+# Input Pipeline

-This tutorial covers how data goes from DataFlow or other sources to TensorFlow graph.
-You don't have to read it because these are details under the tensorpack interface, but knowing it could help understand the efficiency.
+This tutorial covers some general basics of the possible methods to send data from external sources to TensorFlow graph,
+and how tensorpack support these methods.
+You don't have to read it because these are details under the tensorpack interface,
+but knowing it could help understand the efficiency and choose the best input pipeline for your task.

-`InputSource` is an abstract interface in tensorpack describing where the input come from and how they enter the graph.
-For example,
+## Prepare Data in Parallel

-1. Come from a DataFlow and been fed to the graph.
-2. Come from a DataFlow and been prefetched on CPU by a TF queue.
-3. Come from a DataFlow, prefetched on CPU by a TF queue, then prefetched on GPU by a TF StagingArea.
-4. Come from some TF native reading pipeline.
-5. Come from some ZMQ pipe.
+<!--
+   -![prefetch](input-source.png)
+	 -->
+
+![prefetch](https://cloud.githubusercontent.com/assets/1381301/26525192/36e5de48-4304-11e7-88ab-3b790bd0e028.png)
+
+A common sense no matter what framework you use:
+Start to prepare the next (batch of) data while you're training!
+
+The reasons are:
+1. Data preparation often consumes non-trivial time (depend on the actual problem).
+2. Data preparation often uses completely different resources from training --
+	doing them together doesn't slow you down. In fact you can further parallelize different stages in
+	the preparation, because they also use different resources (as shown in the figure).
+3. Data preparation often doesn't depend on the result of the previous training step.

-For most tasks, DataFlow with some prefetch is fast enough. You can use `TrainConfig(data=)` option
-to customize your `InputSource`.
+Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks),
+4 P100 GPUs can train ResNet50 at 852 images/sec, and the size of those images are 852\*224\*224\*3\*4bytes = 489MB.
+Assuming you have 5GB/s `memcpy` bandwidth, simply copying the data once would take 0.1s -- slowing
+down your training by 10%. Think about how many more copies are made during your preprocessing.

-## Use Prefetch
+Failure to hide the data preparation latency is the major reason why people
+cannot see good GPU utilization. Always choose a framework that allows latency hiding.

-In general, `feed_dict` is slow and should never appear in training loops.
-i.e., when you use TensorFlow without any wrappers, you should avoid loops like this:
-```python
-while True:
-  X, y = get_some_data()
-  minimize_op.run(feed_dict={'X': X, 'y': y})
-```
-However, when you need to load data from Python-side, this is the only available interface in frameworks such as Keras, tflearn.
-This is part of the reason why [tensorpack is faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6).
+## Python or C++ ?

-You could use something like this instead, to prefetch data into the graph in one thread and hide the copy latency:
-```python
-# Thread 1:
-while True:
-  X, y = get_some_data()
-  enqueue.run(feed_dict={'X': X, 'y': y})	 # feed data to a TensorFlow queue
+The above discussion is valid regardless of what you use to load/preprocess, Python code or TensorFlow operators (written in C++).

-# Thread 2:
-while True:
-  minimize_op.run()	 # minimize_op was built from dequeued tensors
-```
+The benefit of using TensorFlow ops is:
+* Faster preprocessing.
+* No "Copy to TF" (i.e. `feed_dict`) stage.

-This is now automatically handled by tensorpack trainers already, see [Trainer](trainer.md) for details.
+While Python is much easier to write, and has much more libraries to use.

-TensorFlow StagingArea can further hide H2D (CPU->GPU) copy latency.
-It is also automatically included in tensorpack when you use Synchronous MultiGPU training.
+Though C++ ops are potentially faster, they're usually __not necessary__.
+As long as data preparation runs faster than training, it makes no difference at all.
+And for most types of problems, up to the scale of multi-GPU ImageNet training,
+Python can offer enough speed if written properly (e.g. use `tensorpack.dataflow`).
+See the [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html) tutorial.

-You can also avoid `feed_dict` by using TensorFlow native operators to read data, which is also supported in tensorpack.
-It probably allows you to reach the best performance,
-but at the cost of implementing the reading / preprocessing ops in C++ if there isn't one for your task.
+When you use Python to load/preprocess data, TF `QueueBase` can help hide the "Copy to TF" latency,
+and TF `StagingArea` can help hide the "Copy to GPU" latency.
+They are used by most examples in tensorpack,
+however most other TensorFlow wrappers are `feed_dict` based -- no latency hiding at all.
+This is the major reason why tensorpack is [faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6).
+
+## InputSource
+
+`InputSource` is an abstract interface in tensorpack, to describe where the input come from and how they enter the graph.
+For example,
+
+1. Come from a DataFlow and been fed to the graph.
+2. Come from a DataFlow and been prefetched on CPU by a TF queue.
+3. Come from a DataFlow, prefetched on CPU by a TF queue, then prefetched on GPU by a TF StagingArea.
+4. Come from some TF native reading pipeline.
+5. Come from some ZMQ pipe, where the load/preprocessing may happen on a different machine.

-## Figure out the bottleneck
+You can use `TrainConfig(data=)` option to use a customized `InputSource`.
+Usually you don't need this API, and only have to specify `TrainConfig(dataflow=)`, because
+tensorpack trainers automatically adds proper prefetching for you.
+In cases you want to use TF ops rather than DataFlow, you can use `TensorInput` as the `InputSource`
+(See the [PTB example](https://github.com/ppwwyyxx/tensorpack/tree/master/examples/PennTreebank)).

-Thread 1 & 2 runs in parallel and the faster one will block to wait for the slower one.
-So the overall throughput will appear to be the slower one.
+## Figure out the Bottleneck

-There is no way to accurately benchmark the two dependent threads while they are running,
-without introducing overhead. However, are ways to understand which one is the bottleneck:
+Training and data preparation run in parallel and the faster one will block to wait for the slower one.
+So the overall throughput will be dominated by the slower one.

-1. Use the average occupancy (size) of the queue. This information is summarized by default.
-	If the queue is nearly empty (default size 50), then the input source is the bottleneck.
+There is no way to accurately benchmark two threads waiting on queues,
+without introducing overhead. However, there are ways to understand which one is the bottleneck:

-2. Benchmark them separately. You can use `TestDataSpeed` to benchmark a DataFlow, and
-	 use `FakeData` as a fast replacement in a dry run, to benchmark the training iterations.
+1. Use the average occupancy (size) of the queue. This information is summarized in tensorpack by default.
+	If the queue is nearly empty (default size is 50), then the input source is the bottleneck.

-If you found your input is the bottleneck, then you'll need to think about how to speed up your data.
-You may either change `InputSource`, or look at [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html).
+2. Benchmark them separately. Use `TestDataSpeed` to benchmark a DataFlow.
+	 Use `FakeData(..., random=False)` as a fast DataFlow, to benchmark the training iterations plus the copies.
+	 Or use `DummyConstantInput` as a fast InputSource, to benchmark the training iterations only.
--- a/docs/tutorial/trainer.md
+++ b/docs/tutorial/trainer.md
@@ -28,7 +28,7 @@ config = TrainConfig(
           callbacks=[...]
         )

-# start training (with a slow trainer. See 'tutorials - Input Sources' for details):
+# start training (with a slow trainer. See 'tutorials - Input Pipeline' for details):
 # SimpleTrainer(config).train()

 # start training with queue prefetch:

--- a/tensorpack/models/model_desc.py
+++ b/tensorpack/models/model_desc.py
@@ -118,7 +118,7 @@ class ModelDesc(object):
        ``self.cost``. You can override :meth:`_get_cost()` if needed.

        This function also applies the collection
-        ``tf.GraphKeys.REGULARIZATION_LOSSES``to the cost automatically.
+        ``tf.GraphKeys.REGULARIZATION_LOSSES`` to the cost automatically.
        Because slim users would expect the regularizer being automatically applied once used in slim layers.
        """
        cost = self._get_cost()

--- a/tensorpack/train/feedfree.py
+++ b/tensorpack/train/feedfree.py
@@ -88,8 +88,7 @@ class SimpleFeedfreeTrainer(SingleCostFeedfreeTrainer):
    def __init__(self, config):
        """
        Args:
-            config (TrainConfig): ``config.data`` must exist and is a
-                :class:`FeedfreeInput`.
+            config (TrainConfig): ``config.data`` must exist and is a :class:`FeedfreeInput`.
        """
        self._input_source = config.data
        assert isinstance(self._input_source, FeedfreeInput), self._input_source

--- a/tox.ini
+++ b/tox.ini
@@ -7,4 +7,5 @@ exclude = .git,
 					snippet,
 					docs,
 					examples,
-					_test.py
+					_test.py,
+					docs/conf.py