2. Requirements on reading training set and validation set are different.
2. Reading training set and validation set are different.
In training it's OK to reorder, regroup, or even duplicate some datapoints, as long as the
In training it's OK to reorder, regroup, or even duplicate some datapoints, as long as the
distribution roughly stays the same.
data distribution roughly stays the same.
But in validation we often need the exact set of data, to be able to compute the correct error.
But in validation we often need the exact set of data, to be able to compute a correct and comparable score.
This will affect how we build the DataFlow.
This will affect how we build the DataFlow.
3. The actual performance would depend on not only the disk, but also memory (for caching) and CPU (for data processing).
3. The actual performance would depend on not only the disk, but also memory (for caching) and CPU (for data processing).
You may need to tune the parameters (#processes, #threads, size of buffer, etc.)
You may need to tune the parameters (#processes, #threads, size of buffer, etc.)
or change the pipeline for new tasks and new machines to achieve the best performance.
or change the pipeline for new tasks and new machines to achieve the best performance.
4. This tutorial could be too complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-sized dataset.
4. This tutorial could be a bit complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-sized dataset.
However, for smaller datasets (e.g. several GBs of images with lightweight preprocessing), a simple reader plus some prefetch should work well enough.
However, for smaller datasets (e.g. several GBs of images with lightweight preprocessing), a simple reader plus some prefetch should work well enough.
Figure out the bottleneck first, before trying to optimize any piece in the whole system.
Figure out the bottleneck first, before trying to optimize any piece in the whole system.