@@ -7,7 +7,7 @@ a __Python generator__ which yields preprocessed ImageNet images and labels as f
Since it is simply a generator interface, you can use the DataFlow in other Python-based frameworks (e.g. Keras)
or your own code as well.
We use ILSVRC12 training set, which contains 1.28 million images.
**What we are going to do**: We'll use ILSVRC12 training set, which contains 1.28 million images.
The original images (JPEG compressed) are 140G in total.
The average resolution is about 400x350 <sup>[[1]]</sup>.
Following the [ResNet example](../examples/ResNet), we need images in their original resolution,
...
...
@@ -15,19 +15,27 @@ so we will read the original dataset (instead of a down-sampled version), and
then apply complicated preprocessing to it.
We will need to reach a speed of, roughly 1k ~ 2k images per second, to keep GPUs busy.
Note that the actual performance would depend on not only the disk, but also
memory (for caching) and CPU (for data processing).
You may need to tune the parameters (#processes, #threads, size of buffer, etc.)
or change the pipeline for new tasks and new machines to achieve the best performance.
This tutorial is quite complicated because you do need this knowledge of hardware & system to run fast on ImageNet-sized dataset.
However, for __smaller datasets__ (e.g. several GBs of space, or lightweight preprocessing), a simple reader plus some prefetch should work well enough.
Some things to know before reading:
1. Having a fast Python generator **alone** may or may not help with your overall training speed.
You need mechanisms to hide the latency of all preprocessing stages, as mentioned in the
2. Requirements on reading training set and validation set are different.
In training it's OK to reorder, regroup, or even duplicate some datapoints, as long as the
distribution roughly stays the same.
But in validation we often need the exact set of data, to be able to compute the correct error.
This will affect how we build the DataFlow.
3. The actual performance would depend on not only the disk, but also memory (for caching) and CPU (for data processing).
You may need to tune the parameters (#processes, #threads, size of buffer, etc.)
or change the pipeline for new tasks and new machines to achieve the best performance.
4. This tutorial could be too complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-sized dataset.
However, for smaller datasets (e.g. several GBs of images with lightweight preprocessing), a simple reader plus some prefetch should work well enough.
Figure out the bottleneck first, before trying to optimize any piece in the whole system.
will stack the datapoints into an `numpy.ndarray`, but since original images are of different shapes, we use
`use_list=True` so that it just produces lists.
On an SSD you probably can already observe good speed here (e.g. 5 it/s, that is 1280 images/s), but on HDD the speed may be just 1 it/s,
On a good filesystem you probably can already observe good speed here (e.g. 5 it/s, that is 1280 images/s), but on HDD the speed may be just 1 it/s,
because we are doing heavy random read on the filesystem (regardless of whether `shuffle` is True).
Image decoding in `cv2.imread` could also be a bottleneck at this early stage.
We will now add the cheapest pre-processing now to get an ndarray in the end instead of a list
(because training will need ndarray eventually):
...
...
@@ -68,13 +77,13 @@ Now it's time to add threads or processes:
ds = PrefetchDataZMQ(ds1, nr_proc=25)
ds = BatchData(ds, 256)
```
Here we start 25 processes to run `ds1`, and collect their output through ZMQ IPC protocol.
Using ZMQ to transfer data is faster than `multiprocessing.Queue`, but data copy (even
within one process) can still be quite expensive when you're dealing with large data.
For example, to reduce copy overhead, the ResNet example deliberately moves certain pre-processing (the mean/std normalization) from DataFlow to the graph.
This way the DataFlow only transfers uint8 images as opposed float32 which takes 4x more memory.
Here we start 25 processes to run `ds1`, and collect their output through ZMQ IPC protocol,
which is faster than `multiprocessing.Queue`. You can also apply prefetch after batch, of course.
The above DataFlow might be fast, but since it forks the ImageNet reader (`ds0`),
it's **not a good idea to use it for validation** (for reasons mentioned at top).
Alternatively, you can use multi-threaded preprocessing like this:
Alternatively, you can use multi-threading like this:
```eval_rst
.. code-block:: python
:emphasize-lines: 3-6
...
...
@@ -83,56 +92,69 @@ Alternatively, you can use multi-threading like this: