@@ -34,16 +34,15 @@ It's Yet Another TF wrapper, but different in:
Tensorpack includes only a few common models, and helpful tools such as `LinearWrap` to simplify large models.
But you can use any other wrappers within tensorpack, such as sonnet/Keras/slim/tflearn/tensorlayer/....
2. Focus on large datasets.
+ __DataFlow__ allows you to process large datasets such as ImageNet in Python without blocking the training.
+ DataFlow has a unified interface, so you can compose and reuse them to perform complex preprocessing.
3. Focus on training speed.
2. Focus on __training speed__.
+ Tensorpack trainer is almost always faster than `feed_dict` based wrappers.
Even on a small CNN example, the training runs [2x faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6) than the equivalent Keras code.
Even on a tiny CNN example, the training runs [2x faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6) than the equivalent Keras code.
+ Data-Parallel Multi-GPU training is off-the-shelf to use. For <=4 GPUs it is as fast as [tensorflow/benchmarks](https://github.com/tensorflow/benchmarks).
More improvements to come later.
+ Data-Parallel Multi-GPU training is off-the-shelf to use. It is as fast as Google's [benchmark code](https://github.com/tensorflow/benchmarks).
3. Focus on large datasets.
+ __DataFlow__ allows you to process large datasets such as ImageNet in pure Python without blocking the training.
+ DataFlow has a unified interface, so you can compose and reuse them to perform complex preprocessing.
4. Interface of extensible __Callbacks__.
Write a callback to implement everything you want to do apart from the training iterations, and
...
...
@@ -59,7 +58,7 @@ It's Yet Another TF wrapper, but different in:
This tutorial covers how data goes from DataFlow to TensorFlow graph.
They are tensorpack internal details, but it is important to know
if you care about efficiency.
This tutorial covers how data goes from DataFlow or other sources to TensorFlow graph.
You don't have to know it, but it may help with efficiency.
## Use TensorFlow queues
`InputSource` is an abstract interface in tensorpack describing where the input come from and how they enter the graph.
For example,
1. Come from a DataFlow and been fed to the graph.
2. Come from a DataFlow and been prefetched on CPU by a TF queue.
3. Come from a DataFlow, prefetched on CPU by a TF queue, then prefetched on GPU by a TF StagingArea.
4. Come from some TF native reading pipeline.
5. Come from some ZMQ pipe.
For most tasks, DataFlow with some prefetch is fast enough. You can use `TrainConfig(data=)` option
to customize your `InputSource`.
## Use Prefetch
In general, `feed_dict` is slow and should never appear in your critical loop.
i.e., you should avoid loops like this:
i.e., when you use TensorFlow without any wrappers, you should avoid loops like this:
```python
whileTrue:
X,y=get_some_data()
minimize_op.run(feed_dict={'X':X,'y':y})
```
However, when you need to load data from Python-side, this is the only available interface in frameworks such as Keras, tflearn.
This is part of the reason why [tensorpack is faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6) than examples from other packages.
This is part of the reason why [tensorpack is faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6) than examples from other frameworks.
You should use something like this instead:
You should use something like this instead, to prefetch data into the graph in one thread and hide the copy latency:
```python
# Thread 1:
whileTrue:
...
...
@@ -29,27 +40,28 @@ while True:
minimize_op.run()# minimize_op was built from dequeued tensors
```
This is now automatically handled by tensorpack trainers already,
see [Trainer](trainer.md) for details.
This is now automatically handled by tensorpack trainers already, see [Trainer](trainer.md) for details.
TensorFlow provides staging interface which will further improve the speed in the future. This is
TensorFlow StagingArea can further hide H2D (CPU->GPU) copy latency.
It is also automatically included in tensorpack when you use Synchronous MultiGPU training.
You can also avoid `feed_dict` by using TensorFlow native operators to read data, which is also
supported in tensorpack.
It probably allows you to reach the best performance, but at the cost of implementing the
reading / preprocessing ops in C++ if there isn't one for your task.
You can also avoid `feed_dict` by using TensorFlow native operators to read data, which is also supported in tensorpack.
It probably allows you to reach the best performance,
but at the cost of implementing the reading / preprocessing ops in C++ if there isn't one for your task.
## Figure out the bottleneck
For training, we will only worry about the throughput but not the latency.
Thread 1 & 2 runs in parallel and the faster one will block to wait for the slower one.
So the overall throughput will appear to be the slower one.
There isn't a way to accurately benchmark the two threads while they are running, without introducing overhead. However, are ways to understand which one is the bottleneck:
There is no way to accurately benchmark the two dependent threads while they are running,
without introducing overhead. However, are ways to understand which one is the bottleneck:
1. Use the average occupancy (size) of the queue. This information is summarized after every epoch.
If the queue is nearly empty, then the data thread is the bottleneck.
1. Use the average occupancy (size) of the queue. This information is summarized by default.
If the queue is nearly empty (default size 50), then the input source is the bottleneck.
2. Benchmark them separately. You can use `TestDataSpeed` to benchmark a DataFlow, and
use `FakeData` as a fast replacement in a dry run, to benchmark the training iterations.
If you found your input is the bottleneck, then you'll need to think about how to speed up your data.
You may either change `InputSource`, or look at [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html).