Commit 784e2b7b authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent 5d0d6d16
# DataFlow
DataFlow is a library to easily build Python iterators for efficient data loading.
### What is DataFlow
DataFlow is a library to build Python iterators for efficient data loading.
A DataFlow has a `get_data()` generator method,
which yields `datapoints`.
A datapoint must be a **list** of Python objects which I called the `components` of a datapoint.
A datapoint is a **list** of Python objects which is called the `components` of a datapoint.
For example, to train on MNIST dataset, you can build a DataFlow with a `get_data()` method
that yields datapoints of two elements (components):
that yields datapoints (lists) of two components:
a numpy array of shape (64, 28, 28), and an array of shape (64,).
### Composition of DataFlow
One good thing about having a standard interface is to be able to provide
the greatest code reusability.
There are a lot of existing modules in tensorpack, which you can use to compose
complex DataFlow with a long pre-processing pipeline. A whole pipeline usually
There are a lot of existing DataFlow utilities in tensorpack, which you can use to compose
complex DataFlow with a long pre-processing pipeline. A common pipeline usually
would __read from disk (or other sources), apply augmentations, group into batches,
prefetch data__, etc. A simple example is as the following:
````python
# a DataFlow you implement to produce [tensor1, tensor2, ..] lists from whatever sources:
df = MyDataFlow(shuffle=True)
df = MyDataFlow(dir='/my/data', shuffle=True)
# resize the image component of each datapoint
df = AugmentImageComponent(df, [imgaug.Resize((225, 225))])
# group data into batches of size 128
......@@ -29,24 +31,25 @@ df = BatchData(df, 128)
# start 3 processes to run the dataflow in parallel, and communicate with ZeroMQ
df = PrefetchDataZMQ(df, 3)
````
A more complicated example is the [ResNet training script](../examples/ResNet/imagenet-resnet.py)
You can find more complicated DataFlow in the [ResNet training script](../examples/ResNet/imagenet-resnet.py)
with all the data preprocessing.
All these modules are written in Python,
so you can easily implement whatever operations/transformations you need,
without worrying about adding operators to TensorFlow.
Unless you are working with standard data types (image folders, LMDB, etc),
you would usually want to write your own DataFlow.
you would usually want to write the base DataFlow (`MyDataFlow` in the above example) for your data format.
See [another tutorial](http://tensorpack.readthedocs.io/en/latest/tutorial/extend/dataflow.html)
for details on handling your own data format.
for details on writing a DataFlow.
### Why DataFlow
1. It's easy: write everything in pure Python, and reuse existing utilities. On the contrary,
writing data loaders in TF operators is painful.
2. It's fast (enough): see [Input Pipeline tutorial](http://tensorpack.readthedocs.io/en/latest/tutorial/input-source.html)
on how tensorpack handles data loading.
<!--
- TODO mention RL, distributed data, and zmq operator in the future.
-->
Nevertheless, tensorpack support data loading with native TF operators as well.
### Use DataFlow outside Tensorpack
DataFlow is independent of both tensorpack and TensorFlow.
DataFlow is __independent__ of both tensorpack and TensorFlow.
You can simply use it as a data processing pipeline and plug it into any other frameworks.
To use a DataFlow independently, you will need to call `reset_state()` first to initialize it,
......
......@@ -33,7 +33,7 @@ ds1 = BatchData(ds0, 256, use_list=True)
TestDataSpeed(ds1).start()
```
Here `ds0` simply reads original images from the filesystem. It is implemented simply by:
Here `ds0` reads original images from the filesystem. It is implemented simply by:
```python
for filename, label in filelist:
yield [cv2.imread(filename), label]
......
# Input Pipeline
This tutorial covers some general basics of the possible methods to send data from external sources to TensorFlow graph,
This tutorial covers some general basics of the possible methods to send data from external sources to a TensorFlow graph,
and how tensorpack support these methods.
You don't have to read it because these are details under the tensorpack interface,
but knowing it could help understand the efficiency and choose the best input pipeline for your task.
## Prepare Data in Parallel
<!--
-![prefetch](input-source.png)
-->
![prefetch](https://cloud.githubusercontent.com/assets/1381301/26525192/36e5de48-4304-11e7-88ab-3b790bd0e028.png)
A common sense no matter what framework you use:
......@@ -19,9 +15,9 @@ Start to prepare the next (batch of) data while you're training!
The reasons are:
1. Data preparation often consumes non-trivial time (depend on the actual problem).
2. Data preparation often uses completely different resources from training --
2. Data preparation often uses completely different resources from training (see figure above) --
doing them together doesn't slow you down. In fact you can further parallelize different stages in
the preparation, because they also use different resources (as shown in the figure).
the preparation, because they also use different resources.
3. Data preparation often doesn't depend on the result of the previous training step.
Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks),
......@@ -30,29 +26,33 @@ Assuming you have 5GB/s `memcpy` bandwidth, simply copying the data once would t
down your training by 10%. Think about how many more copies are made during your preprocessing.
Failure to hide the data preparation latency is the major reason why people
cannot see good GPU utilization. Always choose a framework that allows latency hiding.
cannot see good GPU utilization. __Always choose a framework that allows latency hiding.__
However most other TensorFlow wrappers are designed to be `feed_dict` based -- no latency hiding at all.
This is the major reason why tensorpack is [faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6).
## Python or C++ ?
The above discussion is valid regardless of what you use to load/preprocess, Python code or TensorFlow operators (written in C++).
The benefit of using TensorFlow ops is:
The benefits of using TensorFlow ops are:
* Faster preprocessing.
* No "Copy to TF" (i.e. `feed_dict`) stage.
While Python is much easier to write, and has much more libraries to use.
* Potentially true, but not necessarily. With Python code you can call a variety of other fast libraries (e.g. lmdb), which
you have no access to in TF ops.
* Python may be just fast enough.
Though C++ ops are potentially faster, they're usually __not necessary__.
As long as data preparation runs faster than training, it makes no difference at all.
And for most types of problems, up to the scale of multi-GPU ImageNet training,
Python can offer enough speed if written properly (e.g. use `tensorpack.dataflow`).
See the [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html) tutorial.
As long as data preparation runs faster than training, it makes no difference at all.
And for most types of problems, up to the scale of multi-GPU ImageNet training,
Python can offer enough speed if you use a fast library (e.g. `tensorpack.dataflow`).
See the [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html) tutorial.
When you use Python to load/preprocess data, TF `QueueBase` can help hide the "Copy to TF" latency,
and TF `StagingArea` can help hide the "Copy to GPU" latency.
They are used by most examples in tensorpack,
however most other TensorFlow wrappers are designed to be `feed_dict` based -- no latency hiding at all.
This is the major reason why tensorpack is [faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6).
* No "Copy to TF" (i.e. `feed_dict`) stage.
* True. But as mentioned above, the latency can usually be hidden.
In tensorpack, TF queues are used to hide the "Copy to TF" latency,
and TF `StagingArea` can help hide the "Copy to GPU" latency.
They are used by most examples in tensorpack.
## InputSource
......@@ -65,10 +65,9 @@ For example,
4. Come from some TF native reading pipeline.
5. Come from some ZMQ pipe, where the load/preprocessing may happen on a different machine.
You can use `TrainConfig(data=)` option to use a customized `InputSource`.
Usually you don't need this API, and only have to specify `TrainConfig(dataflow=)`, because
tensorpack trainers automatically adds proper prefetching for you.
In cases you want to use TF ops rather than DataFlow, you can use `TensorInput` as the `InputSource`
When you set `TrainConfig(dataflow=)`, tensorpack trainers automatically adds proper prefetching for you.
You can also use `TrainConfig(data=)` option to use a customized `InputSource`.
In cases you want to use TF ops rather than a DataFlow, you can use `TensorInput` as the `InputSource`
(See the [PTB example](https://github.com/ppwwyyxx/tensorpack/tree/master/examples/PennTreebank)).
## Figure out the Bottleneck
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment