This tutorial gives an overview of how to build an efficient DataFlow, using ImageNet
This tutorial gives an overview of how to build an efficient DataFlow, using ImageNet
dataset as an example.
dataset as an example.
Our goal in the end is to have
Our goal in the end is to have
a __generator__ which yields preprocessed ImageNet images and labels as fast as possible.
a __Python generator__ which yields preprocessed ImageNet images and labels as fast as possible.
Since it is simply a generator interface, you can use the DataFlow in other frameworks (e.g. Keras)
Since it is simply a generator interface, you can use the DataFlow in other Python-based frameworks (e.g. Keras)
or your own code as well.
or your own code as well.
We use ILSVRC12 training set, which contains 1.28 million images.
We use ILSVRC12 training set, which contains 1.28 million images.
The original images (JPEG compressed) are 140G in total.
The original images (JPEG compressed) are 140G in total.
The average resolution is about 400x350 <sup>[[1]]</sup>.
The average resolution is about 400x350 <sup>[[1]]</sup>.
Following the [ResNet example](../examples/ResNet), we need images in their original resolution,
Following the [ResNet example](../examples/ResNet), we need images in their original resolution,
so we will read the original dataset instead of a down-sampled version, and
so we will read the original dataset (instead of a down-sampled version), and
apply complicated preprocessing to it.
then apply complicated preprocessing to it.
We will need to reach a speed of, roughly 1000 images per second, to keep GPUs busy.
We will need to reach a speed of, roughly 1k ~ 2k images per second, to keep GPUs busy.
Note that the actual performance would depend on not only the disk, but also
Note that the actual performance would depend on not only the disk, but also
memory (for caching) and CPU (for data processing).
memory (for caching) and CPU (for data processing).
You will need to tune the parameters (#processes, #threads, size of buffer, etc.)
You may need to tune the parameters (#processes, #threads, size of buffer, etc.)
or change the pipeline for new tasks and new machines to achieve the best performance.
or change the pipeline for new tasks and new machines to achieve the best performance.
This tutorial is quite complicated because you do need this knowledge of hardware & system to run fast on ImageNet-sized dataset.
This tutorial is quite complicated because you do need this knowledge of hardware & system to run fast on ImageNet-sized dataset.
However, for __small datasets__ (e.g., several GBs), a proper prefetch should work well enough.
However, for __smaller datasets__ (e.g. several GBs of space, or lightweight preprocessing), a simple reader plus some prefetch should work well enough.
## Random Read
## Random Read
...
@@ -30,7 +30,7 @@ We start from a simple DataFlow:
...
@@ -30,7 +30,7 @@ We start from a simple DataFlow:
Here `ds0` simply reads original images from the filesystem. It is implemented simply by:
Here `ds0` simply reads original images from the filesystem. It is implemented simply by:
...
@@ -44,11 +44,11 @@ By default, `BatchData`
...
@@ -44,11 +44,11 @@ By default, `BatchData`
will stack the datapoints into an `numpy.ndarray`, but since original images are of different shapes, we use
will stack the datapoints into an `numpy.ndarray`, but since original images are of different shapes, we use
`use_list=True` so that it just produces lists.
`use_list=True` so that it just produces lists.
On an SSD you probably can already observe good speed here (e.g. 5 it/s, that is 1280 samples/s), but on HDD the speed may be just 1 it/s,
On an SSD you probably can already observe good speed here (e.g. 5 it/s, that is 1280 images/s), but on HDD the speed may be just 1 it/s,
because we are doing heavy random read on the filesystem (regardless of whether `shuffle` is True).
because we are doing heavy random read on the filesystem (regardless of whether `shuffle` is True).
We will now add the cheapest pre-processing now to get an ndarray in the end instead of a list
We will now add the cheapest pre-processing now to get an ndarray in the end instead of a list
(because TensorFlow will need ndarray eventually):
(because training will need ndarray eventually):
```eval_rst
```eval_rst
.. code-block:: python
.. code-block:: python
:emphasize-lines: 2,3
:emphasize-lines: 2,3
...
@@ -68,26 +68,29 @@ Now it's time to add threads or processes:
...
@@ -68,26 +68,29 @@ Now it's time to add threads or processes:
ds = PrefetchDataZMQ(ds1, nr_proc=25)
ds = PrefetchDataZMQ(ds1, nr_proc=25)
ds = BatchData(ds, 256)
ds = BatchData(ds, 256)
```
```
Here we started 25 processes to run `ds1`, and collect their output through ZMQ IPC protocol.
Here we start 25 processes to run `ds1`, and collect their output through ZMQ IPC protocol.
Using ZMQ to transfer data is faster than `multiprocessing.Queue`, but data copy (even
Using ZMQ to transfer data is faster than `multiprocessing.Queue`, but data copy (even
within one process) can still be quite expensive when you're dealing with large data.
within one process) can still be quite expensive when you're dealing with large data.
For example, to reduce copy overhead, the ResNet example deliberately moves certain pre-processing (the mean/std normalization) from DataFlow to the graph.
For example, to reduce copy overhead, the ResNet example deliberately moves certain pre-processing (the mean/std normalization) from DataFlow to the graph.
This way the DataFlow only transfers uint8 images as opposed float32 which takes 4x more memory.
This way the DataFlow only transfers uint8 images as opposed float32 which takes 4x more memory.
Alternatively, you can use multi-threading like this:
Alternatively, you can use multi-threading like this:
Depending on whether the OS has cached the file for you (and how large the RAM is), the above script
Depending on whether the OS has cached the file for you (and how large the RAM is), the above script
can run at a speed of 10~130 it/s, roughly corresponding to 250MB~3.5GB/s bandwidth. You can test
can run at a speed of 10~130 it/s, roughly corresponding to 250MB~3.5GB/s bandwidth. You can test
...
@@ -165,7 +168,7 @@ Then we add necessary transformations:
...
@@ -165,7 +168,7 @@ Then we add necessary transformations:
ds = BatchData(ds, 256)
ds = BatchData(ds, 256)
```
```
1.`LMDBDataPoint` deserialize the datapoints (from string to [jpeg_string, label])
1.`LMDBDataPoint` deserialize the datapoints (from raw bytes to [jpeg_string, label] -- what we dumped in `RawILSVRC12`)
2. Use OpenCV to decode the first component into ndarray
2. Use OpenCV to decode the first component into ndarray
3. Apply augmentations to the ndarray
3. Apply augmentations to the ndarray
...
@@ -187,18 +190,20 @@ Both imdecode and the augmentors can be quite slow. We can parallelize them like
...
@@ -187,18 +190,20 @@ Both imdecode and the augmentors can be quite slow. We can parallelize them like
Since we are reading the database sequentially, having multiple identical instances of the
Since we are reading the database sequentially, having multiple identical instances of the
underlying DataFlow will result in biased data distribution. Therefore we use `PrefetchData` to
underlying DataFlow will result in biased data distribution. Therefore we use `PrefetchData` to
launch the underlying DataFlow in one independent process, and only parallelize the transformations.
launch the underlying DataFlow in one independent process, and only parallelize the transformations.
(`PrefetchDataZMQ` is faster but not fork-safe, so the first prefetch has to be `PrefetchData`. This is [issue#138](https://github.com/ppwwyyxx/tensorpack/issues/138))
(`PrefetchDataZMQ` is faster but not fork-safe, so the first prefetch has to be `PrefetchData`. This
is supposed to get fixed in the future).
Let me summarize what the above DataFlow does:
Let me summarize what the above DataFlow does:
1. One process reads LMDB file, shuffle them in a buffer and put them into a `multiprocessing.Queue` (used by `PrefetchData`).
1. One process reads LMDB file, shuffle them in a buffer and put them into a `multiprocessing.Queue` (used by `PrefetchData`).
2. 25 processes take items from the queue, decode and process them into [image, label] pairs, and
2. 25 processes take items from the queue, decode and process them into [image, label] pairs, and
send them through ZMQ IPC pipes.
send them through ZMQ IPC pipe.
3. The main process takes data from the pipe and feeds it into the graph, according to
3. The main process takes data from the pipe, makes batches and feeds them into the graph,
how the `Trainer` is implemented.
according to what [InputSource](http://tensorpack.readthedocs.io/en/latest/tutorial/input-source.html) is used.
The above DataFlow can run at a speed of 5~10 batches per second if you have good CPUs, RAM, disks and augmentors.
The above DataFlow can run at a speed of 1k ~ 2k images per second if you have good CPUs, RAM, disks and augmentors.
As a reference, tensorpack can train ResNet-18 (a shallow ResNet) at 4.5 batches (of 256 samples) per second on 4 old TitanX.
As a reference, tensorpack can train ResNet-18 (a shallow ResNet) at 4.5 batches (1.2k images) per second on 4 old TitanX.
A DGX-1 (8 P100) can train ResNet-50 at 1.7k images/s according to the [official benchmark](https://www.tensorflow.org/performance/benchmarks).
So DataFlow will not be a serious bottleneck if configured properly.
So DataFlow will not be a serious bottleneck if configured properly.