update docs

6c185f3b · Yuxin Wu · d5ef7e8b · 6c185f3b · 6c185f3b · 6c185f3b
Commit 6c185f3b authored May 25, 2019 by Yuxin Wu
6 changed files
--- a/docs/tutorial/dataflow.md
+++ b/docs/tutorial/dataflow.md
@@ -31,7 +31,7 @@ A common pipeline usually would
 __read from disk (or other sources), 
 apply transformations, 
 group into batches, prefetch data__, etc, and all __run in parallel__.
-A simple pipeline in DataFlow is like the following:
+A simple DataFlow pipeline is like the following:

 ````python
 # a DataFlow you implement to produce [tensor1, tensor2, ..] lists from whatever sources:
@@ -45,8 +45,8 @@ df = MultiProcessRunnerZMQ(df, 3)
 ````

 A list of built-in DataFlow to compose with can be found at [API docs](../modules/dataflow.html).
-You can also find more complicated DataFlow in the [ImageNet training script](../examples/ImageNetModels/imagenet_utils.py)
-with all the data preprocessing, or other tensorpack examples.
+You can also find complicated real-life DataFlow pipelines in the [ImageNet training script](../examples/ImageNetModels/imagenet_utils.py)
+or other tensorpack examples.

 ### Parallelize the Pipeline


--- a/docs/tutorial/efficient-dataflow.md
+++ b/docs/tutorial/efficient-dataflow.md
@@ -13,7 +13,7 @@ The average resolution is about 400x350 <sup>[[1]]</sup>.
 Following the [ResNet example](../examples/ResNet), we need images in their original resolution,
 so we will read the original dataset (instead of a down-sampled version), and
 then apply complicated preprocessing to it.
-We aim to reach a speed of, roughly **1k~3k images per second**, to keep GPUs busy.
+We hope to reach a speed of **1k~5k images per second**, to keep GPUs busy.

 Some things to know before reading:
 1. For smaller datasets (e.g. several GBs of images with lightweight preprocessing), a simple reader plus some multiprocess runner should usually work well enough.

--- a/docs/tutorial/extend/input-source.md
+++ b/docs/tutorial/extend/input-source.md
@@ -39,49 +39,7 @@ This is one of the reasons why tensorpack is [faster](https://github.com/tensorp
 The above discussion is valid regardless of what you use to load/preprocess data,
 either Python code or TensorFlow operators, or a mix of two.
 Both are supported in tensorpack, while we recommend using Python.
-
-### TensorFlow Reader: Pros
-
-People often think they should use `tf.data` because it's fast.
-
-* Indeed it's often fast, but not necessarily. With Python you have access to many other fast libraries, which might be unsupported in TF.
-* Python may be just fast enough.
-
-    Keep in mind: as long as data loading speed can keep up with training, and the latency of all four blocks in the
-    above figure is hidden, __a faster reader brings no gains to overall throughput__.
-
-    For most types of problems, up to the scale of multi-GPU ImageNet training,
-    Python can offer enough speed if you use a fast library (e.g. `tensorpack.dataflow`).
-    See the [Efficient DataFlow](/tutorial/efficient-dataflow.html) tutorial on how to build a fast Python reader with `tensorpack.dataflow`.
-
-### TensorFlow Reader: Cons
-The disadvantage of TF reader is obvious and it's huge: it's __too complicated__.
-
-Unlike running a mathematical model, data processing is a complicated and poorly-structured task.
-You need to handle different formats, handle corner cases, noisy data, combination of data.
-Doing these requires condition operations, loops, data structures, sometimes even exception handling.
-These operations are __naturally not the right task for a symbolic graph__.
-
-Let's take a look at what users are asking for `tf.data`:
-* Different ways to [pad data](https://github.com/tensorflow/tensorflow/issues/13969), [shuffle data](https://github.com/tensorflow/tensorflow/issues/14518)
-* [Handle none values in data](https://github.com/tensorflow/tensorflow/issues/13865)
-* [Handle dataset that's not a multiple of batch size](https://github.com/tensorflow/tensorflow/issues/13745)
-* [Different levels of determinism](https://github.com/tensorflow/tensorflow/issues/13932)
-* [Sort/skip some data](https://github.com/tensorflow/tensorflow/issues/14250)
-* [Write data to files](https://github.com/tensorflow/tensorflow/issues/15014)
-
-To support all these features which could've been done with __3 lines of code in Python__, you need either a new TF
-API, or ask [Dataset.from_generator](https://www.tensorflow.org/versions/r1.4/api_docs/python/tf/contrib/data/Dataset#from_generator)
-(i.e. Python again) to the rescue.
-
-It only makes sense to use TF to read data, if your data is originally very clean and well-formatted.
-If not, you may feel like writing a script to format your data, but then you're almost writing a Python loader already!
-
-Think about it: it's a waste of time to write a Python script to transform from some format to TF-friendly format,
-then a TF script to transform from this format to tensors.
-The intermediate format doesn't have to exist.
-You just need the right interface to connect Python to the graph directly, efficiently.
-`tensorpack.InputSource` is such an interface.
+See more discussions at [Why DataFlow?](/tutorial/philosophy/dataflow.html)

 ## InputSource

@@ -104,7 +62,9 @@ Some choices are:
 	Come from some ZeroMQ pipe, where the reading/preprocessing may happen in a different process or even a different machine.

 Typically, we recommend using `DataFlow + QueueInput` as it's good for most use cases.
+`QueueInput` and `StagingInput` can help you hide the copy latency to TF and then to GPU.
 If your data has to come from a separate process for whatever reasons, use `ZMQInput`.
+
 If you need to use TF reading ops directly, either define a `tf.data.Dataset`
 and use `TFDatasetInput`, or use `TensorInput`.


--- a/docs/tutorial/extend/model.md
+++ b/docs/tutorial/extend/model.md
@@ -5,16 +5,15 @@ The first thing to note: __you never have to write a layer__.
 Tensorpack layers are nothing but wrappers of symbolic functions.
 In tensorpack, you can use __any__ symbolic functions you have written or seen elsewhere with or without tensorpack layers.

-If you would like, you can make a symbolic function become a "layer" by following some simple rules, and then gain benefits from the framework.
+If you would like, you can make a symbolic function become a "layer" by following some simple rules, and then gain benefits from tensorpack.

-Take a look at the [Convolutional Layer](../../tensorpack/models/conv2d.py#L14) implementation for an example of how to define a layer:
+Take a look at the [ShuffleNet example](../../examples/ImageNetModels/shufflenet.py#L22) 
+to see an example of how to define a custom layer:

 ```python
 @layer_register(log_shape=True)
-def Conv2D(x, out_channel, kernel_shape,
-           padding='SAME', stride=1,
-           W_init=None, b_init=None,
-           nl=tf.nn.relu, split=1, use_bias=True):
+def DepthConv(x, out_channel, kernel_shape, padding='SAME', stride=1,
+              W_init=None, activation=tf.identity):
 ```

 Basically, a tensorpack layer is just a symbolic function, but with the following rules:

--- a/docs/tutorial/index.rst
+++ b/docs/tutorial/index.rst
@@ -22,6 +22,7 @@ Basic Tutorials
  summary
  inference
  faq
+  philosophy/dataflow

 Advanced Tutorials
 ==================

--- a/docs/tutorial/philosophy/dataflow.md
+++ b/docs/tutorial/philosophy/dataflow.md
+
+# Why DataFlow?
+
+There are many other data loading solutions for deep learning.
+Here we explain why you may want to use Tensorpack DataFlow for your own good:
+it's easy, and fast (enough).
+
+Note that this article may contain subjective opinions and we're happy to hear different voices.
+
+### How Fast Do You Actually Need?
+
+Your data pipeline **only has to be fast enough**.
+
+In practice, you should always make sure your data pipeline runs
+asynchronously with your training.
+The method to do so is different in each training framework,
+and in tensorpack this is automatically done by the [InputSource](/tutorial/extend/input-source.html)
+interface.
+
+Once you make sure the data pipeline runs async with your training,
+the data pipeline only needs to be as fast as the training.
+**Getting faster brings no gains** to overall throughput.
+It only has to be fast enough.
+
+If you have used other data loading libraries, you may doubt
+how easy it is to make data pipeline fast enough, with pure Python.
+In fact, it is usually not hard with DataFlow.
+
+For example: if you train a ResNet-50 on ImageNet,
+DataFlow is fast enough for you unless you use
+8 V100s with both FP16 and XLA enabled, which most people don't.
+For tasks that are less data-hungry (e.g., object detection, or most NLP tasks),
+DataFlow is already an overkill.
+See the [Efficient DataFlow](/tutorial/efficient-dataflow.html) tutorial on how
+to build a fast Python loader with DataFlow.
+
+There is no reason to try a more complicated solution,
+when you don't know whether a simple solution is fast enough.
+And for us, we may optimize DataFlow even more, but we just haven't found the reason to do so.
+
+### Which Data Format?
+
+Certain libraries advocate for a new binary data format (e.g., TFRecords, RecordIO).
+Do you need to use them?
+We think you usually do not. Not after you try DataFlow.
+
+1. **Not Easy**: To use the new binary format,
+	 you need to write a script, to process your data from its original format,
+	 to this new format. Then you read data from this format to training workers.
+	 It's a waste of your effort: the intermediate format does not have to exist.
+
+1. **Still Not Easy**: There are cases when having an intermediate format is useful
+	 for performance reasons.
+	 For example, to apply some one-time expensive preprocessing to your dataset, or
+	 merge small files to large files to reduce disk burden.
+	 However, those binary data formats are not necessarily good for the cases.
+
+	 Why use a single dedicated binary format when you could use something else?
+	 A different format may bring you:
+	 * Simpler code for data loading.
+	 * Easier visualization.
+	 * Interoperability with other libraries.
+	 * More functionalities.
+
+	 After all, why merging all the images into a binary file on the disk,
+	 when you know that saving all the images separately is fast enough for your task?
+
+1. **Not Necessarily Fast**:
+	Formats like TFRecords and RecordIO are just as fast as your disk, and of course,
+	as fast as other libraries.
+	Decades of engineering in dataset systems have provided
+	many other competitive formats like LMDB, HDF5 that are:
+	* Equally fast (if not faster)
+	* More generic (not tied to your training framework)
+	* Providing more features (e.g. random access)
+    
+    The only unique benefit a format like TFRecords or RecordIO may give you,
+    is the native integration with the training framework, which may bring a
+    small gain to speed.
+    
+On the other hand, DataFlow is:
+
+1. **Easy**: Any Python function that produces data can be made a DataFlow and
+   used for training. No need for intermediate format when you don't.
+1. **Flexible**: Since it is in pure Python, you still have the choice to use
+   a different data format when you need.
+   And we have provided tools to easily
+   [serialize a DataFlow](../../modules/dataflow.html#tensorpack.dataflow.LMDBSerializer)
+   to a single-file binary format when you need.
+   
+
+### Alternative Data Loading Solutions:
+
+Some frameworks have also provided good framework-specific solutions for data loading.
+In addition to that DataFlow is framework-agnostic, there are other reasons you
+might prefer DataFlow over the alternatives:
+
+#### tf.data or other TF operations
+
+The huge disadvantage of loading data in a computation graph is obvious:
+__it's extremely inflexible__.
+
+Why would you ever want to do anything in a computation graph? Here are the possible reasons:
+
+1. Automatic differentiation
+2. Run the computation on different devices
+3. Serialize the description of your computation
+4. Automatic performance optimization
+
+They all make sense for training neural networks, but **not much for data loading**.
+
+Unlike running a neural network model, data processing is a complicated and poorly-structured task.
+You need to handle different formats, handle corner cases, noisy data, combination of data.
+Doing these requires condition operations, loops, data structures, sometimes even exception handling.
+These operations are __naturally not the right task for a symbolic graph__,
+and it's hard to debug since it's not Python.
+
+Let's take a look at what users are asking for `tf.data`:
+* Different ways to [pad data](https://github.com/tensorflow/tensorflow/issues/13969), [shuffle data](https://github.com/tensorflow/tensorflow/issues/14518)
+* [Handle none values in data](https://github.com/tensorflow/tensorflow/issues/13865)
+* [Handle dataset that's not a multiple of batch size](https://github.com/tensorflow/tensorflow/issues/13745)
+* [Different levels of determinism](https://github.com/tensorflow/tensorflow/issues/13932)
+* [Sort/skip some data](https://github.com/tensorflow/tensorflow/issues/14250)
+* [Write data to files](https://github.com/tensorflow/tensorflow/issues/15014)
+
+To support all these features which could've been done with __3 lines of code in Python__, you need either a new TF
+API, or ask [Dataset.from_generator](https://www.tensorflow.org/versions/r1.4/api_docs/python/tf/contrib/data/Dataset#from_generator)
+(i.e. Python again) to the rescue.
+
+It only makes sense to use TF to read data, if your data is originally very clean and well-formatted.
+If not, you may feel like writing a Python script to reformat your data, but then you're
+almost writing a DataFlow (a DataFlow can be made from a Python iterator)!
+
+As for speed, when TF happens to support the operators you need, 
+it does offer a similar or higher speed (it takes effort to tune, of course).
+But how do you make sure you'll not run into one of the unsupported situations listed above?
+
+#### torch.utils.data.{Dataset,DataLoader}
+
+In the design, `torch.utils.data.Dataset` is simply a Python container/iterator, similar to DataFlow.
+However it has made some **bad assumptions**:
+it assumes your dataset has a `__len__` and supports `__getitem__`,
+which does not work when you have a dynamic/unreliable data source, 
+or when you need to filter your data on the fly.
+
+`torch.utils.data.DataLoader` is quite good, depiste that it also makes some
+**bad assumptions on batching** and is not always efficient.
+
+1. It assumes you always do batch training, has a constant batch size, and 
+   the batch grouping can be purely determined by indices.
+   All of these are not necessarily true.
+   
+2. Its multiprocessing implementation is efficient on `torch.Tensor`,
+   but inefficient for generic data type or numpy arrays.
+   
+On the other hand, DataFlow:
+
+1. Is a pure iterator, not necessarily has a length. This is more generic.
+2. Parallelization and batching are disentangled concepts.
+   You do not need to use batches, and can implement different batching logic easily.
+3. Is optimized for generic data type and numpy arrays.