Commit d5ef7e8b authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent dd2d9ffa
......@@ -17,6 +17,10 @@ $(function (){
if (fullname.startsWith('tensorpack.'))
fullname = fullname.substr(11);
if (fullname == "tensorpack.dataflow.MultiProcessMapData") {
groupName = "parallel_map";
}
var n = $(e).children('.descname').clone();
n[0].innerText = fullname;
......
......@@ -19,51 +19,51 @@ DataFlow is __independent of TensorFlow__ since it produces any python objects
To `import tensorpack.dataflow`, you don't even have to install TensorFlow.
You can simply use DataFlow as a data processing pipeline and plug it into any other frameworks.
### Load Raw Data
We do not make any assumptions about your data format.
You would usually want to write the source DataFlow (`MyDataFlow` in the example below) for your own data format.
See [another tutorial](extend/dataflow.html) for simple instructions on writing a DataFlow.
### Composition of DataFlow
There are a lot of existing DataFlow utilities in tensorpack, which you can use to compose
one DataFlow with complex data pipeline. A common pipeline usually
would __read from disk (or other sources), apply transformations (possibly in parallel), group into batches,
prefetch data__, etc, and all __run in parallel__. A simple example is as the following:
### Assemble the Pipeline
There are a lot of existing DataFlow utilities in tensorpack, which you can use to assemble
the source DataFlow with complex data pipeline.
A common pipeline usually would
__read from disk (or other sources),
apply transformations,
group into batches, prefetch data__, etc, and all __run in parallel__.
A simple pipeline in DataFlow is like the following:
````python
# a DataFlow you implement to produce [tensor1, tensor2, ..] lists from whatever sources:
df = MyDataFlow(dir='/my/data', shuffle=True)
# resize the image component of each datapoint
df = AugmentImageComponent(df, [imgaug.Resize((225, 225))])
# apply transformation to your data
df = MapDataComponent(df, lambda t: transform(t), 0)
# group data into batches of size 128
df = BatchData(df, 128)
# start 3 processes to run the dataflow in parallel
df = MultiProcessRunnerZMQ(df, 3)
````
You can find more complicated DataFlow in the [ImageNet training script](../examples/ImageNetModels/imagenet_utils.py)
with all the data preprocessing.
### Work with Your Data
We do not make any assumptions about your data format.
You would usually want to write the source DataFlow (`MyDataFlow` in the above example) for your own data format.
See [another tutorial](extend/dataflow.html) for simple instructions on writing a DataFlow.
Once you have the source reader, all the [built-in
DataFlows](../modules/dataflow.html) are ready for you to assemble the rest of the data pipeline.
A list of built-in DataFlow to compose with can be found at [API docs](../modules/dataflow.html).
You can also find more complicated DataFlow in the [ImageNet training script](../examples/ImageNetModels/imagenet_utils.py)
with all the data preprocessing, or other tensorpack examples.
### Why DataFlow
### Parallelize the Pipeline
1. It's easy: write everything in pure Python, and reuse existing utilities.
On the contrary, writing data loaders in TF operators is usually painful, and performance is hard to tune.
See more discussions in [Python Reader or TF Reader](extend/input-source.html#python-reader-or-tf-reader).
2. It's fast: see [Efficient DataFlow](efficient-dataflow.html)
on how to build a fast DataFlow with parallelism.
If you're using DataFlow with tensorpack, also see [Input Pipeline tutorial](extend/input-source.html)
on how tensorpack further accelerates data loading in the graph.
DataFlow includes optimized parallel runner and parallel mapper.
You can find them in the [API docs](../modules/dataflow.html) under the
"parallel" and "parallel_map" section.
Nevertheless, tensorpack supports data loading with native TF operators / TF datasets as well.
The [Efficient DataFlow](efficient-dataflow.html) give a deeper dive
on how to use them to optimize your data pipeline.
### Use DataFlow in Your Own Code
### Run the DataFlow
When training with tensorpack, typically it is the `InputSource` interface that runs the DataFlow.
However, DataFlow can be used without other tensorpack components.
To run a DataFlow by yourself, call `reset_state()` first to initialize it,
and then use the generator however you like:
```python
df = SomeDataFlow()
......@@ -72,5 +72,17 @@ for dp in df:
# dp is now a list. do whatever
```
Read the [API documentation](../../modules/dataflow.html#tensorpack.dataflow.DataFlw)
to see API details of DataFlow.
### Why DataFlow
1. It's easy: write everything in pure Python, and reuse existing utilities.
On the contrary, writing data loaders in TF operators is usually painful, and performance is hard to tune.
See more discussions in [Python Reader or TF Reader](extend/input-source.html#python-reader-or-tf-reader).
2. It's fast: see [Efficient DataFlow](efficient-dataflow.html)
on how to build a fast DataFlow with parallelism.
If you're using DataFlow with tensorpack, also see [Input Pipeline tutorial](extend/input-source.html)
on how tensorpack further accelerates data loading in the graph.
Nevertheless, tensorpack supports data loading with native TF operators / TF datasets as well.
Read the [API documentation](../../modules/dataflow.html)
to see API details of DataFlow and a complete list of built-in DataFlow.
......@@ -87,13 +87,14 @@ Now it's time to add threads or processes:
ds = MultiProcessRunnerZMQ(ds1, num_proc=25)
ds = BatchData(ds, 256)
```
Here we fork 25 processes to run `ds1`, and collect their output through ZMQ IPC protocol,
which is faster than `multiprocessing.Queue`. You can also apply parallel runner after batching, of course.
Here we fork 25 processes to run `ds1`, and collect their output through ZMQ IPC protocol.
You can also apply parallel runner after batching, of course.
### Parallel Map
The above DataFlow might be fast, but since it forks the ImageNet reader (`ds0`),
it's **not a good idea to use it for validation** (for reasons mentioned at top. More details at the [documentation](../modules/dataflow.html#tensorpack.dataflow.MultiProcessRunnerZMQ)).
Alternatively, you can use multi-threaded preprocessing like this:
it's **not a good idea to use it for validation** (for reasons mentioned at top.
More details at the [documentation](../modules/dataflow.html#tensorpack.dataflow.MultiProcessRunnerZMQ)).
Alternatively, you can use parallel mapper like this:
```eval_rst
.. code-block:: python
......@@ -141,7 +142,7 @@ Let's summarize what the above dataflow does:
3. Both 1 and 2 happen together in a separate process, and the results are sent back to main process through ZeroMQ.
4. Main process makes batches, and other tensorpack modules will then take care of how they should go into the graph.
There are also `MultiProcessMapData` as well for you to use.
And, of course, there is also a `MultiProcessMapData` as well for you to use.
## Sequential Read
......@@ -190,8 +191,8 @@ As a reference, on Samsung SSD 850, the uncached speed is about 16it/s.
```
Instead of shuffling all the training data in every epoch (which would require random read),
the added line above maintains a buffer of datapoints and shuffle them once a while.
It will not affect the model as long as the buffer is large enough,
but it can also consume much memory if too large.
It will not affect the model very much as long as the buffer is large enough,
but it can be memory-consuming if buffer is too large.
### Augmentations & Parallel Runner
......@@ -229,7 +230,7 @@ Since we are reading the database sequentially, having multiple forked instances
base LMDB reader will result in biased data distribution. Therefore we use `MultiProcessRunner` to
launch the base DataFlow in only **one process**, and only parallelize the transformations
with another `MultiProcessRunnerZMQ`
(Nesting two `MultiProcessRunnerZMQ`, however, will result in a different behavior.
(Nesting two `MultiProcessRunnerZMQ`, however, is not allowed.
These differences are explained in the API documentation in more details.).
Similar to what we did earlier, you can use `MultiThreadMapData` to parallelize as well.
......@@ -240,10 +241,10 @@ Let me summarize what this DataFlow does:
send them through ZMQ IPC pipe.
3. The main process takes data from the pipe, makes batches.
The two DataFlow mentioned in this tutorial (both random read and sequential read) can run at a speed of 1k ~ 2.5k images per second if you have good CPUs, RAM, disks.
With fewer augmentations, it can reach 5k images/s.
The two DataFlow mentioned in this tutorial (both random read and sequential read) can run at a speed of 1k ~ 5k images per second,
depend on your hardware condition of CPUs, RAM, disks, and the amount of augmentation.
As a reference, tensorpack can train ResNet-18 at 1.2k images/s on 4 old TitanX.
8 P100s can train ResNet-50 at 1.7k images/s according to the [official benchmark](https://www.tensorflow.org/performance/benchmarks).
8 V100s can train ResNet-50 at 2.8k images/s according to [tensorpack benchmark](https://github.com/tensorpack/benchmarks/tree/master/ResNet-MultiGPU).
So DataFlow will not be a serious bottleneck if configured properly.
## Distributed DataFlow
......
......@@ -10,7 +10,7 @@ There are several existing DataFlow, e.g. [ImageFromFile](../../modules/dataflow
[DataFromList](../../modules/dataflow.html#tensorpack.dataflow.DataFromList),
which you can use if your data format is simple.
In general, you probably need to write a source DataFlow to produce data for your task,
and then compose it with existing modules (e.g. mapping, batching, prefetching, ...).
and then compose it with other DataFlow (e.g. mapping, batching, prefetching, ...).
The easiest way to create a DataFlow to load custom data, is to wrap a custom generator, e.g.:
```python
......@@ -47,7 +47,7 @@ for the semantics.
DataFlow implementations for several well-known datasets are provided in the
[dataflow.dataset](../../modules/dataflow.dataset.html)
module, you can take them as a reference.
module. You can take them as examples.
#### More Data Processing
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment