update docs

6c482180 · Yuxin Wu · ad8aa0a5 · 6c482180 · 6c482180 · 6c482180
Commit 6c482180 authored Jun 27, 2019 by Yuxin Wu
4 changed files
--- a/docs/modules/dataflow.rst
+++ b/docs/modules/dataflow.rst
 tensorpack.dataflow package
 ===========================

-Relevant tutorials: :doc:`../tutorial/dataflow`, :doc:`../tutorial/extend/input-source`.
+Relevant tutorials: :doc:`../tutorial/dataflow`, :doc:`../tutorial/philosophy/dataflow`.

 .. container:: custom-index


--- a/docs/tutorial/dataflow.md
+++ b/docs/tutorial/dataflow.md
@@ -2,6 +2,7 @@
 # DataFlow

 DataFlow is a pure-Python library to create iterators for efficient data loading.
+It is originally part of tensorpack, and now also available as a [separate library](https://github.com/tensorpack/dataflow).

 ### What is DataFlow

@@ -13,12 +14,9 @@ A datapoint is a **list or dict** of Python objects, each of which are called th
 that yields datapoints (lists) of two components:
 a numpy array of shape (64, 28, 28), and an array of shape (64,).

-As you saw,
-DataFlow is __independent of TensorFlow__ since it produces any python objects
+DataFlow is independent of the training frameworks since it produces any python objects
 (usually numpy arrays).
-To `import tensorpack.dataflow`, you don't even have to install TensorFlow.
-You can simply use DataFlow as a data processing pipeline and plug it into any other frameworks.
-And we plan to make it installable as a separate project.
+You can simply use DataFlow as a data processing pipeline and plug it into your own training code.

 ### Load Raw Data
 We do not make any assumptions about your data format.
@@ -45,14 +43,16 @@ df = BatchData(df, 128)
 df = MultiProcessRunnerZMQ(df, 3)
 ````

-A list of built-in DataFlow to compose with can be found at [API docs](../modules/dataflow.html).
+A list of built-in DataFlow to use can be found at [API docs](../modules/dataflow.html).
 You can also find complicated real-life DataFlow pipelines in the [ImageNet training script](../examples/ImageNetModels/imagenet_utils.py)
 or other tensorpack examples.

 ### Parallelize the Pipeline

-DataFlow includes optimized parallel runner and parallel mapper.
-You can find them in the [API docs](../modules/dataflow.html) under the
+DataFlow includes carefully optimized parallel runners and parallel mappers: `Multi{Thread,Process}{Runner,MapData}`.
+Runners execute multiple clones of a dataflow in parallel.
+Mappers execute a mapping function in parallel on top of an existing dataflow.
+You can find details in the [API docs](../modules/dataflow.html) under the
 "parallel" and "parallel_map" section.

 The [Efficient DataFlow](efficient-dataflow.html) give a deeper dive
@@ -61,8 +61,9 @@ on how to use them to optimize your data pipeline.
 ### Run the DataFlow

 When training with tensorpack, typically it is the `InputSource` interface that runs the DataFlow.
-However, DataFlow can be used without other tensorpack components.
-To run a DataFlow by yourself, call `reset_state()` first to initialize it,
+
+When using DataFlow alone without other tensorpack components,
+you need to call `reset_state()` first to initialize it,
 and then use the generator however you like:

 ```python
@@ -70,14 +71,11 @@ df = SomeDataFlow()

 df.reset_state()
 for dp in df:
-    # dp is now a list. do whatever
+    # dp is now a list/dict. do whatever with it
 ```

 ### Why DataFlow?

-It's easy and fast. For more discussions, see [Why DataFlow?](/tutorial/philosophy/dataflow.html)
-Nevertheless, using DataFlow is not required. 
+It's **easy and fast***. For more discussions, see [Why DataFlow?](/tutorial/philosophy/dataflow.html)
+Nevertheless, using DataFlow is not required in tensorpack.
 Tensorpack supports data loading with native TF operators / TF datasets as well.
-
-Read the [API documentation](../../modules/dataflow.html)
-to see API details of DataFlow and a complete list of built-in DataFlow.
--- a/docs/tutorial/philosophy/dataflow.md
+++ b/docs/tutorial/philosophy/dataflow.md
@@ -150,8 +150,11 @@ or when you need to filter your data on the fly.
 `torch.utils.data.DataLoader` is quite good, despite that it also makes some
 **bad assumptions on batching** and is not always efficient:

-1. It assumes you always do batch training, has a constant batch size, and
-   the batch grouping can be purely determined by indices.
+1. `torch.utils.data.DataLoader` assumes that:
+   1. You do batch training
+   1. You use a constant batch size
+   1. Indices are sufficient to determine the samples to batch together
+
   None of these are necessarily true.

 2. Its multiprocessing implementation is efficient on `torch.Tensor`,
@@ -161,21 +164,20 @@ or when you need to filter your data on the fly.
 On the other hand, DataFlow:

 1. Is a pure iterator, not necessarily has a length or can be indexed. This is more generic.
-2. Parallelization and batching are disentangled concepts.
-   You do not need to use batches, and can implement different batching logic easily.
+2. Does not assume batches, and allow you to implement different batching logic easily.
 3. Is optimized for generic data type and numpy arrays.


 ```eval_rst
-.. note:: **Why is an iterator more general than ``__getitem__``? **
+.. note:: Why is an iterator interface more generic than ``__getitem__``?

-	DataFlow's iterator interface can perfectly simulate the behavior of ``__getitem__`` interface like this:
+	DataFlow's iterator interface can perfectly simulate the behavior of indexing interface like this:

-.. code-block:: python
+    .. code-block:: python

-	df = SomeIndexGenerator()
-	# A dataflow which produces indices, like [0], [1], [2], ...
-	# The indices can be either sequential, or more fancy, akin to `torch.utils.data.Sampler`.
-	df = MapData(df, lambda idx: dataset[idx[0]])
-  # Map the indices to datapoints by ``__getitem__``.
+        # A dataflow which produces indices, like [0], [1], [2], ...
+        # The indices can be either sequential, or more fancy, akin to torch.utils.data.Sampler.
+        df = SomeIndexGenerator()
+        # Map the indices to datapoints by ``__getitem__``.
+        df = MapData(df, lambda idx: dataset[idx[0]])
 ```
--- a/tensorpack/__init__.py
+++ b/tensorpack/__init__.py
@@ -8,7 +8,6 @@ from tensorpack.utils import *
 from tensorpack.dataflow import *

 # dataflow can be used alone without installing tensorflow
-# TODO maybe separate dataflow to a new project if it's good enough

 # https://github.com/celery/kombu/blob/7d13f9b95d0b50c94393b962e6def928511bfda6/kombu/__init__.py#L34-L36
 STATICA_HACK = True