update docs

4eab3650 · Yuxin Wu · ea720903 · 4eab3650 · 4eab3650 · 4eab3650
Commit 4eab3650 authored Sep 05, 2017 by Yuxin Wu
3 changed files
--- a/docs/tutorial/efficient-dataflow.md
+++ b/docs/tutorial/efficient-dataflow.md
@@ -218,30 +218,23 @@ launch the base DataFlow in only **one process**, and only parallelize the trans
 with another `PrefetchDataZMQ`
 (Nesting two `PrefetchDataZMQ`, however, will result in a different behavior.
 These differences are explained in the API documentation in more details.).
-Similar to what we did above, you can use `ThreadedMapData` to parallelize as well.
+Similar to what we did earlier, you can use `ThreadedMapData` to parallelize as well.

-Let me summarize what the above DataFlow does:
+Let me summarize what this DataFlow does:

 1. One process reads LMDB file, shuffle them in a buffer and put them into a `multiprocessing.Queue` (used by `PrefetchData`).
 2. 25 processes take items from the queue, decode and process them into [image, label] pairs, and
 	 send them through ZMQ IPC pipe.
 3. The main process takes data from the pipe, makes batches.

-The DataFlow mentioned above (both random read and sequential read) can run at a speed of 1k ~ 2k images per second if you have good CPUs, RAM, disks.
+The two DataFlow mentioned in this tutorial (both random read and sequential read) can run at a speed of 1k ~ 2k images per second if you have good CPUs, RAM, disks.
 As a reference, tensorpack can train ResNet-18 at 1.2k images/s on 4 old TitanX.
 A DGX-1 (8 P100) can train ResNet-50 at 1.7k images/s according to the [official benchmark](https://www.tensorflow.org/performance/benchmarks).
 So DataFlow will not be a serious bottleneck if configured properly.

-## More Efficient DataFlow
+## Distributed DataFlow

-To work with larger datasets (or smaller networks, or more/better GPUs) you could be severely bounded by CPU or disk speed of a single machine.
-One way is to optimize the preprocessing routine, for example:
-
-1. Write some preprocessing steps in C++ or use better libraries
-2. Move certain preprocessing steps (e.g. mean/std normalization) to TF operators which may be faster
-3. Transfer less data, e.g. use uint8 images rather than float32.
-
-Another way to scale is to run DataFlow in a distributed fashion and collect them on the
+To further scale your DataFlow, you can run it on multiple machines and collect them on the
 training machine. E.g.:
 ```python
 # Data Machine #1, process 1-20:

--- a/docs/tutorial/performance-tuning.md
+++ b/docs/tutorial/performance-tuning.md
@@ -39,9 +39,11 @@ A DataFlow could be blocked by CPU/hard disk/network/IPC bandwidth. Only by benc
 know the reason and improve it accordingly, e.g.:

 1. Use single-file database to avoid random read on hard disk.
-2. Write faster pre-processing, or use distributed data preprocessing to reduce CPU burden.
-3. Compress your data (e.g. use uint8 images, or JPEG-compressed images) before sending them through
+2. Write faster pre-processing with whatever tools you have.
+3. Move certain pre-processing (e.g. mean/std normalization) to the graph, if TF has fast implementation of it.
+4. Compress your data (e.g. use uint8 images, or JPEG-compressed images) before sending them through
 	 anything (network, ZMQ pipe, Python-TF copy etc.)
+5. Use distributed data preprocessing, with `send_dataflow_zmq` and `RemoteDataZMQ`.

 ## Improve TensorFlow


--- a/tensorpack/dataflow/remote.py
+++ b/tensorpack/dataflow/remote.py
@@ -22,6 +22,7 @@ def send_dataflow_zmq(df, addr, hwm=50, print_interval=100, format=None):
    """
    Run DataFlow and send data to a ZMQ socket addr.
    It will dump and send each datapoint to this addr with a PUSH socket.
+    This function never returns unless an error is encountered.

    Args:
        df (DataFlow): Will infinitely loop over the DataFlow.
@@ -59,6 +60,7 @@ def send_dataflow_zmq(df, addr, hwm=50, print_interval=100, format=None):
 class RemoteDataZMQ(DataFlow):
    """
    Produce data from ZMQ PULL socket(s).
+    See http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html#distributed-dataflow

    Attributes:
        cnt1, cnt2 (int): number of data points received from addr1 and addr2