
# Input Pipeline

This tutorial contains some general discussions on the topic of
"how to read data efficiently to work with TensorFlow",
and how tensorpack supports these methods.
You don't have to read it because these are details under the tensorpack interface,
but knowing it could help understand the efficiency and choose the best input pipeline for your task.

## Prepare Data in Parallel

![prefetch](https://cloud.githubusercontent.com/assets/1381301/26525192/36e5de48-4304-11e7-88ab-3b790bd0e028.png)

A common sense no matter what framework you use:
<center>
Prepare data in parallel with the training!
</center>

The reasons are:
1. Data preparation often consumes non-trivial time (depend on the actual problem).
2. Data preparation often uses completely different resources from training (see figure above) --
	doing them together doesn't slow you down. In fact you can further parallelize different stages in
	the preparation since they also use different resources.
3. Data preparation often doesn't depend on the result of the previous training step.

Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks),
4 P100 GPUs can train ResNet50 at 852 images/sec, and the size of those images are 852\*224\*224\*3\*4bytes = 489MB.
Assuming you have 5GB/s `memcpy` bandwidth, simply copying the data once would take 0.1s -- slowing
down your training by 10%. Think about how many more copies are made during your preprocessing.

Failure to hide the data preparation latency is the major reason why people
cannot see good GPU utilization. __Always choose a framework that allows latency hiding.__
However most other TensorFlow wrappers are designed to be `feed_dict` based.
This is the major reason why tensorpack is [faster](https://github.com/tensorpack/benchmarks).

## Python Reader or TF Reader ?

The above discussion is valid regardless of what you use to load/preprocess data,
either Python code or TensorFlow operators (written in C++).

The benefits of using TensorFlow ops are:
* Faster read/preprocessing.

	* Potentially true, but not necessarily. With Python code you can call a variety of other fast libraries, which
		you have no access to in TF ops. For example, LMDB could be faster than TFRecords.
	* Python may be just fast enough.

		As long as data preparation runs faster than training, and the latency of all four blocks in the
		above figure is hidden, it makes no difference at all.
		For most types of problems, up to the scale of multi-GPU ImageNet training,
		Python can offer enough speed if you use a fast library (e.g. `tensorpack.dataflow`).
		See the [Efficient DataFlow](efficient-dataflow.html) tutorial
		on how to build a fast Python reader with DataFlow.

* No "Copy to TF" (i.e. `feed_dict`) stage.

	* True. But as mentioned above, the latency can usually be hidden.

		In tensorpack, TF queues are used to hide the "Copy to TF" latency,
		and TF `StagingArea` can help hide the "Copy to GPU" latency.
		They are used by most examples in tensorpack.

The benefits of using Python reader is obvious:
it's much much easier to write Python to read different data format,
handle corner cases in noisy data, preprocess, etc.

## InputSource

`InputSource` is an abstract interface in tensorpack, to describe where the inputs come from and how they enter the graph.
For example,

1. [FeedInput](../modules/input_source.html#tensorpack.input_source.FeedInput):
	Come from a DataFlow and been fed to the graph.
2. [QueueInput](../modules/input_source.html#tensorpack.input_source.QueueInput):
  Come from a DataFlow and been prefetched on CPU by a TF queue.
3. [StagingInput](../modules/input_source.html#tensorpack.input_source.StagingInput):
	Come from some `InputSource`, then prefetched on GPU by a TF StagingArea.
4. Come from a DataFlow, and further processed by `tf.data.Dataset`.
5. [TensorInput](../modules/input_source.html#tensorpack.input_source.TensorInput):
	Come from some TF reading ops. (See the [PTB example](../examples/PennTreebank))
6. Come from some ZMQ pipe, where the load/preprocessing may happen on a different machine.

