Commit 4eab3650 authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent ea720903
...@@ -218,30 +218,23 @@ launch the base DataFlow in only **one process**, and only parallelize the trans ...@@ -218,30 +218,23 @@ launch the base DataFlow in only **one process**, and only parallelize the trans
with another `PrefetchDataZMQ` with another `PrefetchDataZMQ`
(Nesting two `PrefetchDataZMQ`, however, will result in a different behavior. (Nesting two `PrefetchDataZMQ`, however, will result in a different behavior.
These differences are explained in the API documentation in more details.). These differences are explained in the API documentation in more details.).
Similar to what we did above, you can use `ThreadedMapData` to parallelize as well. Similar to what we did earlier, you can use `ThreadedMapData` to parallelize as well.
Let me summarize what the above DataFlow does: Let me summarize what this DataFlow does:
1. One process reads LMDB file, shuffle them in a buffer and put them into a `multiprocessing.Queue` (used by `PrefetchData`). 1. One process reads LMDB file, shuffle them in a buffer and put them into a `multiprocessing.Queue` (used by `PrefetchData`).
2. 25 processes take items from the queue, decode and process them into [image, label] pairs, and 2. 25 processes take items from the queue, decode and process them into [image, label] pairs, and
send them through ZMQ IPC pipe. send them through ZMQ IPC pipe.
3. The main process takes data from the pipe, makes batches. 3. The main process takes data from the pipe, makes batches.
The DataFlow mentioned above (both random read and sequential read) can run at a speed of 1k ~ 2k images per second if you have good CPUs, RAM, disks. The two DataFlow mentioned in this tutorial (both random read and sequential read) can run at a speed of 1k ~ 2k images per second if you have good CPUs, RAM, disks.
As a reference, tensorpack can train ResNet-18 at 1.2k images/s on 4 old TitanX. As a reference, tensorpack can train ResNet-18 at 1.2k images/s on 4 old TitanX.
A DGX-1 (8 P100) can train ResNet-50 at 1.7k images/s according to the [official benchmark](https://www.tensorflow.org/performance/benchmarks). A DGX-1 (8 P100) can train ResNet-50 at 1.7k images/s according to the [official benchmark](https://www.tensorflow.org/performance/benchmarks).
So DataFlow will not be a serious bottleneck if configured properly. So DataFlow will not be a serious bottleneck if configured properly.
## More Efficient DataFlow ## Distributed DataFlow
To work with larger datasets (or smaller networks, or more/better GPUs) you could be severely bounded by CPU or disk speed of a single machine. To further scale your DataFlow, you can run it on multiple machines and collect them on the
One way is to optimize the preprocessing routine, for example:
1. Write some preprocessing steps in C++ or use better libraries
2. Move certain preprocessing steps (e.g. mean/std normalization) to TF operators which may be faster
3. Transfer less data, e.g. use uint8 images rather than float32.
Another way to scale is to run DataFlow in a distributed fashion and collect them on the
training machine. E.g.: training machine. E.g.:
```python ```python
# Data Machine #1, process 1-20: # Data Machine #1, process 1-20:
......
...@@ -39,9 +39,11 @@ A DataFlow could be blocked by CPU/hard disk/network/IPC bandwidth. Only by benc ...@@ -39,9 +39,11 @@ A DataFlow could be blocked by CPU/hard disk/network/IPC bandwidth. Only by benc
know the reason and improve it accordingly, e.g.: know the reason and improve it accordingly, e.g.:
1. Use single-file database to avoid random read on hard disk. 1. Use single-file database to avoid random read on hard disk.
2. Write faster pre-processing, or use distributed data preprocessing to reduce CPU burden. 2. Write faster pre-processing with whatever tools you have.
3. Compress your data (e.g. use uint8 images, or JPEG-compressed images) before sending them through 3. Move certain pre-processing (e.g. mean/std normalization) to the graph, if TF has fast implementation of it.
4. Compress your data (e.g. use uint8 images, or JPEG-compressed images) before sending them through
anything (network, ZMQ pipe, Python-TF copy etc.) anything (network, ZMQ pipe, Python-TF copy etc.)
5. Use distributed data preprocessing, with `send_dataflow_zmq` and `RemoteDataZMQ`.
## Improve TensorFlow ## Improve TensorFlow
......
...@@ -22,6 +22,7 @@ def send_dataflow_zmq(df, addr, hwm=50, print_interval=100, format=None): ...@@ -22,6 +22,7 @@ def send_dataflow_zmq(df, addr, hwm=50, print_interval=100, format=None):
""" """
Run DataFlow and send data to a ZMQ socket addr. Run DataFlow and send data to a ZMQ socket addr.
It will dump and send each datapoint to this addr with a PUSH socket. It will dump and send each datapoint to this addr with a PUSH socket.
This function never returns unless an error is encountered.
Args: Args:
df (DataFlow): Will infinitely loop over the DataFlow. df (DataFlow): Will infinitely loop over the DataFlow.
...@@ -59,6 +60,7 @@ def send_dataflow_zmq(df, addr, hwm=50, print_interval=100, format=None): ...@@ -59,6 +60,7 @@ def send_dataflow_zmq(df, addr, hwm=50, print_interval=100, format=None):
class RemoteDataZMQ(DataFlow): class RemoteDataZMQ(DataFlow):
""" """
Produce data from ZMQ PULL socket(s). Produce data from ZMQ PULL socket(s).
See http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html#distributed-dataflow
Attributes: Attributes:
cnt1, cnt2 (int): number of data points received from addr1 and addr2 cnt1, cnt2 (int): number of data points received from addr1 and addr2
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment