update docs

5969bf53 · Yuxin Wu · 26664c3f · 5969bf53 · 5969bf53 · 5969bf53
Commit 5969bf53 authored Aug 03, 2019 by Yuxin Wu
6 changed files
--- a/.github/ISSUE_TEMPLATE/unexpected-problems---bugs.md
+++ b/.github/ISSUE_TEMPLATE/unexpected-problems---bugs.md
@@ -16,7 +16,7 @@ feel free to delete everything in this template.

 (2) **If you're using examples, have you made any changes to the examples? Paste `git status; git diff` here:**

-(3) **If not using examples, tell us what you did:**
+(3) **If not using examples, help us reproduce your issue:**

  It's always better to copy-paste what you did than to describe them.


--- a/docs/tutorial/efficient-dataflow.md
+++ b/docs/tutorial/efficient-dataflow.md
@@ -16,23 +16,21 @@ then apply complicated preprocessing to it.
 We hope to reach a speed of **1k~5k images per second**, to keep GPUs busy.

 Some things to know before reading:
+
+1. You are recommended to read the [Parallel DataFlow Tutorial](parallel-dataflow.html) first.
 1. You only need the data loader to be **fast enough, but not faster**.
   See [How Fast Do You Actually Need](philosophy/dataflow.html#how-fast-do-you-actually-need) for details.
   For smaller datasets (e.g. several GBs of images with lightweight preprocessing), 
   a simple reader plus some multiprocess runner is usually fast enough.
-
-   Therefore you don't have to understand this tutorial in depth, unless you really find your data loader being the bottleneck.
-   **Premature optimization is the root of evil.** Always benchmark and make sure you need optimization before optimizing.
-
-2. Having a fast Python generator **alone** may or may not improve your overall training speed.
+1. Having a fast Python generator **alone** may or may not improve your overall training speed.
 	 You need mechanisms to hide the latency of **all** preprocessing stages, as mentioned in the
 	 [InputSource tutorial](extend/input-source.html).
-3. Reading training set and validation set are different.
+1. Reading training set and validation set are different.
 	 In training it's OK to reorder, regroup, or even duplicate some datapoints, as long as the
 	 data distribution stays the same.
 	 But in validation we often need the exact set of data, to be able to compute a correct and comparable score.
 	 This will affect how we build the DataFlow.
-4. The actual performance would depend on not only the disk, but also memory (for caching) and CPU (for data processing).
+1. The actual performance would depend on not only the disk, but also memory (for caching) and CPU (for data processing).
 	You may need to tune the parameters (#processes, #threads, size of buffer, etc.)
 	or change the pipeline for new tasks and new machines to achieve the best performance.
    The solutions in this tutorial may not help you.
@@ -43,7 +41,8 @@ Some things to know before reading:
 The benchmark code for this tutorial can be found in [tensorpack/benchmarks](https://github.com/tensorpack/benchmarks/tree/master/ImageNet),
 including comparison with a similar pipeline built with `tf.data`.

-This tutorial could be a bit complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-scale dataset.
+This tutorial could be a bit complicated for people new to system architectures,
+but you do need these to be able to run fast enough on ImageNet-scale dataset.

 ## Random Read

@@ -58,18 +57,19 @@ TestDataSpeed(ds1).start()

 Here `ds0` reads original images from the filesystem. It is implemented simply by:
 ```python
-for filename, label in filelist:
+for filename, label in np.random.shuffle(filelist):
  yield [cv2.imread(filename), label]
 ```

-And `ds1` batch the datapoints from `ds0`, so that we can measure the speed of this DataFlow in terms of "batch per second".
-By default, `BatchData`
-will stack the datapoints into an `numpy.ndarray`, but since original images are of different shapes, we use
-`use_list=True` so that it just produces lists.
+And `ds1` batch the datapoints from `ds0`, so that we can measure the speed of this DataFlow in terms of "batches per second".
+By default, `BatchData` should stack the datapoints into an `numpy.ndarray`, 
+but since original ImageNet images are of different shapes, we use
+`use_list=True** so that it produces lists for now.

+Here we're mesuring the time to (1) read from file system speed and (2) decode the image.
 On a good filesystem you probably can already observe good speed here (e.g. 5 it/s, that is 1280 images/s), but on HDD the speed may be just 1 it/s,
 because we are doing heavy random read on the filesystem (regardless of whether `shuffle` is True).
-Image decoding in `cv2.imread` could also be a bottleneck at this early stage.
+Image decoding in `cv2.imread` may also be a bottleneck at this early stage, since it requires a fast CPU.

 ### Parallel Runner

@@ -100,7 +100,7 @@ You can also apply parallel runner after batching, of course.
 ### Parallel Map
 The above DataFlow might be fast, but since it forks the ImageNet reader (`ds0`),
 it's **not a good idea to use it for validation** (for reasons mentioned at top.
-More details at the [documentation](../modules/dataflow.html#tensorpack.dataflow.MultiProcessRunnerZMQ)).
+More details at the [Parallel DataFlow Tutorial](parallel-dataflow) and the [documentation](../modules/dataflow.html#tensorpack.dataflow.MultiProcessRunnerZMQ)).
 Alternatively, you can use parallel mapper like this:

 ```eval_rst
@@ -155,7 +155,7 @@ And, of course, there is also a `MultiProcessMapData` as well for you to use.

 ### Save and Load a Single-File DataFlow
 Random read may not be a good idea when the data is not on an SSD.
-We can also dump the dataset into one single LMDB file and read it sequentially.
+In this case, we can also dump the dataset into one single LMDB file and read it sequentially.

 ```python
 from tensorpack.dataflow import *
@@ -177,7 +177,7 @@ from several forks of `ds0`, then neither the content nor the order of `ds1` wil
 See [documentation](../modules/dataflow.html#tensorpack.dataflow.MultiProcessRunnerZMQ)
 about caveats of `MultiProcessRunnerZMQ`.

-It will generate a database file of 140G. We load the DataFlow back by reading this LMDB file sequentially:
+This will generate a database file of 140G. We load the DataFlow back by reading this LMDB file sequentially:
 ```
 ds = LMDBSerializer.load('/path/to/ILSVRC-train.lmdb', shuffle=False)
 ds = BatchData(ds, 256, use_list=True)
@@ -201,7 +201,7 @@ the added line above maintains a buffer of datapoints and shuffle them once a wh
 It will not affect the model very much as long as the buffer is large enough,
 but it can be memory-consuming if buffer is too large.

-### Augmentations & Parallel Runner
+### Augmentations & Parallelism

 Then we add necessary transformations:
 ```eval_rst
@@ -222,37 +222,29 @@ Then we add necessary transformations:
 Both imdecode and the augmentors can be quite slow. We can parallelize them like this:
 ```eval_rst
 .. code-block:: python
-    :emphasize-lines: 3,7
+    :emphasize-lines: 4,5,6

    ds = LMDBSerializer.load(db, shuffle=False)
    ds = LocallyShuffleData(ds, 50000)
-    ds = MultiProcessRunner(ds, 5000, 1)
-    ds = MapDataComponent(ds, lambda x: cv2.imdecode(x, cv2.IMREAD_COLOR), 0)
-    ds = AugmentImageComponent(ds, lots_of_augmentors)
-    ds = MultiProcessRunnerZMQ(ds, 25)
+
+    def f(jpeg_str, label):
+        return lots_of_augmentors.augment(cv2.imdecode(x, cv2.IMREAD_COLOR)), label
+    ds = MultiProcessMapDataZMQ(ds, num_proc=25, f)
    ds = BatchData(ds, 256)
 ```

-Since we are reading the database sequentially, having multiple forked instances of the
-base LMDB reader will result in biased data distribution. Therefore we use `MultiProcessRunner` to
-launch the base DataFlow in only **one process**, and only parallelize the transformations
-with another `MultiProcessRunnerZMQ`
-(Nesting two `MultiProcessRunnerZMQ`, however, is not allowed.
-These differences are explained in the API documentation in more details.).
-Similar to what we did earlier, you can use `MultiThreadMapData` to parallelize as well.
-
 Let me summarize what this DataFlow does:

-1. One process reads LMDB file, shuffle them in a buffer and put them into a `multiprocessing.Queue` (used by `MultiProcessRunner`).
+1. One process reads LMDB file, shuffle them in a buffer and put them into a ZMQ pipe (used by `MultiProcessMapDataZMQ`).
 2. 25 processes take items from the queue, decode and process them into [image, label] pairs, and
-	 send them through ZMQ IPC pipe.
+	 send them through ZMQ IPC pipe to the main process.
 3. The main process takes data from the pipe, makes batches.

 The two DataFlow mentioned in this tutorial (both random read and sequential read) can run at a speed of 1k ~ 5k images per second,
 depend on your hardware condition of CPUs, RAM, disks, and the amount of augmentation.
 As a reference, tensorpack can train ResNet-18 at 1.2k images/s on 4 old TitanX.
 8 V100s can train ResNet-50 at 2.8k images/s according to [tensorpack benchmark](https://github.com/tensorpack/benchmarks/tree/master/ResNet-MultiGPU).
-So DataFlow will not be a serious bottleneck if configured properly.
+So DataFlow will not be a bottleneck if configured properly.

 ## Distributed DataFlow

@@ -287,8 +279,8 @@ TestDataSpeed(df).start()
   `MultiProcessRunnerZMQ` or `MultiProcessMapData` (which is an alias of `MultiProcessMapDataZMQ`).
 2. Windows needs to pickle your dataflow to run it in multiple processes.
   As a result you cannot use lambda functions for mappings, like the examples above.
-   You need to write a new function in global scope that does the mapping.
-   This issue also exist on Linux if you do not use the 'fork' start method.
+   You need to create a function in global scope, or a function-like object to perform the mapping.
+   This issue also exist on Linux when you do not use the 'fork' start method.

 [1]: #ref


--- a/docs/tutorial/index.rst
+++ b/docs/tutorial/index.rst
@@ -31,6 +31,7 @@ DataFlow Tutorials
  dataflow
  philosophy/dataflow
  extend/dataflow
+  parallel-dataflow
  efficient-dataflow

 Advanced Tutorials

--- a/docs/tutorial/parallel-dataflow.md
+++ b/docs/tutorial/parallel-dataflow.md
+# Parallel DataFlow
+
+This tutorial explains the parallel building blocks
+inside DataFlow, since most of the time they are the only thing
+needed to build an efficient dataflow.
+    
+
+## Concepts: how to make things parallel:
+
+Code does not automatically utilize multiple CPUs. 
+You need to specify how to split the tasks across CPUs.
+A tensorpack DataFlow can be parallelized across CPUs in the following two ways:
+
+### Run Multiple Identical DataFlows
+
+In this pattern, multiple identical DataFlows run on multiple CPUs,
+and put results in a queue.
+The master worker receives the output from the queue.
+
+To use this pattern with multi-processing, you can do:
+```
+d1 = MyDataFlow()   # some dataflow written by the user
+d2 = MultiProcessRunnerZMQ(d1, num_proc=20)
+```
+
+The second line starts 25 processes running `d1`, and merge the results.
+You can then obtain the results in `d2`.
+
+Note that, all the workers run independently in this pattern.
+This means you need to have sufficient randomness in `d1`.
+If `d1` produce the same sequence in each worker,
+then `d2` will produce repetitive data points.
+
+There are some other similar issues you need to take care of when using this pattern.
+You can find them at the 
+[API documentation](../modules/dataflow.html#tensorpack.dataflow.MultiProcessRunnerZMQ).
+
+
+### Distribute Tasks to Multiple Workers
+
+In this pattern, the master worker sends datapoints (the tasks)
+to multiple workers. 
+The workers are responsible for executing a (possibly expensive) mapping
+function on the datapoints, and send the results back to the master.
+An example with multi-processing is like this:
+
+```
+d1 = MyDataFlow()   # a dataflow that produces [image file name, label]
+
+def f(file_name, label):
+    # read image
+    # run heavy pre-proecssing / augmentation on the image
+    return img, label
+
+d2 = MultiProcessMapData(dp, num_proc=20, f)
+```
+
+The main difference between this pattern and the first, is that:
+1. `d1` is not executed in parallel. Only `f` runs in parallel.
+  Therefore you don't have to worry about randomness or data distribution shift.
+  Also you need to make `d1` very efficient (e.g., just produce small metadata).
+2. More communication is required to send data to workers.
+
+See its [API documentation](../modules/dataflow.html#tensorpack.dataflow.MultiProcessMapData)
+to learn more details.
+
+## Threads & Processes
+
+Both the above two patterns can be used with either multi-threading or
+multi-proessing, with the following builtin DataFlows:
+
+* [MultiProcessRunnerZMQ](../modules/dataflow.html#tensorpack.dataflow.MultiProcessRunnerZMQ)
+  or [MultiProcessRunner](../modules/dataflow.html#tensorpack.dataflow.MultiProcessRunner)
+* [MultiThreadRunner](../modules/dataflow.html#tensorpack.dataflow.MultiThreadRunner)
+* [MultiProcessMapDataZMQ](../modules/dataflow.html#tensorpack.dataflow.MultiProcessMapDataZMQ)
+* [MultiThreadMapData](../modules/dataflow.html#tensorpack.dataflow.MultiThreadMapData)
+
+(ZMQ means [ZeroMQ](http://zeromq.org/), a communication library that is more
+efficient than Python's multiprocessing module)
+
+Using threads and processes have their pros and cons:
+
+1. Threads in Python are limted by [GIL](https://wiki.python.org/moin/GlobalInterpreterLock).
+   Threads in one process cannot interpret Python statements in parallel.
+   As a result, multi-threading may not scale very well, if the workers spend a
+   significant amount of time in the Python interpreter.
+2. Processes need to pay the overhead of communication with each other.
+
+The best choice of the above parallel utilities varies across machines and tasks. 
+You can even combine threads and processes sometimes.
+
+For a new task, you often need to do a quick benchmark to choose the best pattern.
+See [Performance Tuning Tutorial](performance-tuning.html)
+on how to effectively understand the performance of a DataFlow.
+
+See also [Efficient DataFlow](efficient-dataflow.html)
+for real examples using the above DataFlows.
+
--- a/docs/tutorial/philosophy/dataflow.md
+++ b/docs/tutorial/philosophy/dataflow.md
@@ -153,7 +153,7 @@ or when you need to filter your data on the fly.
 1. `torch.utils.data.DataLoader` assumes that:
   1. You do batch training
   1. You use a constant batch size
-   1. Indices are sufficient to determine the samples to batch together
+   1. Indices are sufficient to determine which samples to batch together

   None of these are necessarily true.

@@ -161,16 +161,16 @@ or when you need to filter your data on the fly.
   but inefficient for generic data type or numpy arrays.
   Also, its implementation [does not always clean up the subprocesses correctly](https://github.com/pytorch/pytorch/issues/16608).

-Pytorch starts to improve on these bad assumptions (e.g., with [IterableDataset](https://github.com/pytorch/pytorch/pull/19228)).
+PyTorch starts to improve on these bad assumptions (e.g., with [IterableDataset](https://github.com/pytorch/pytorch/pull/19228)).
 On the other hand, DataFlow:

-1. Is a pure iterator, not necessarily has a length or can be indexed. This is more generic.
+1. Is an iterator, not necessarily has a length or can be indexed. This is more generic.
 2. Does not assume batches, and allow you to implement different batching logic easily.
 3. Is optimized for generic data type and numpy arrays.


 ```eval_rst
-.. note:: Why is an iterator interface more generic than ``__getitem__``?
+.. note:: An iterator interface is more generic than ``__getitem__``?

 	DataFlow's iterator interface can perfectly simulate the behavior of indexing interface like this:


--- a/tensorpack/dataflow/imgaug/transform.py
+++ b/tensorpack/dataflow/imgaug/transform.py
@@ -298,6 +298,19 @@ class TransformList(BaseTransform):
        repr_each_tfm = ",\n".join(["  " + repr(x) for x in self.tfms])
        return "imgaug.TransformList([\n{}])".format(repr_each_tfm)

+    def __add__(self, other):
+        other = other.tfms if isinstance(other, TransformList) else [other]
+        return TransformList(self.tfms + other)
+
+    def __iadd__(self, other):
+        other = other.tfms if isinstance(other, TransformList) else [other]
+        self.tfms.extend(other)
+        return self
+
+    def __radd__(self, other):
+        other = other.tfms if isinstance(other, TransformList) else [other]
+        return TransformList(other + self.tfms)
+
    __repr__ = __str__