Commit a0247332 authored by Yuxin Wu's avatar Yuxin Wu

docs update

parent a62e68a3
...@@ -41,7 +41,7 @@ It's Yet Another TF wrapper, but different in: ...@@ -41,7 +41,7 @@ It's Yet Another TF wrapper, but different in:
+ Data-Parallel Multi-GPU training is off-the-shelf to use. It is as fast as Google's [benchmark code](https://github.com/tensorflow/benchmarks). + Data-Parallel Multi-GPU training is off-the-shelf to use. It is as fast as Google's [benchmark code](https://github.com/tensorflow/benchmarks).
3. Focus on large datasets. 3. Focus on large datasets.
+ It's painful to read/preprocess data from TF. Use __DataFlow__ to process large datasets such as ImageNet in pure Python. + It's painful to read/preprocess data from TF. Use __DataFlow__ to efficiently process large datasets such as ImageNet in __pure Python__.
+ DataFlow has a unified interface, so you can compose and reuse them to perform complex preprocessing. + DataFlow has a unified interface, so you can compose and reuse them to perform complex preprocessing.
4. Interface of extensible __Callbacks__. 4. Interface of extensible __Callbacks__.
......
...@@ -227,6 +227,8 @@ html_show_copyright = True ...@@ -227,6 +227,8 @@ html_show_copyright = True
# This is the file name suffix for HTML files (e.g. ".xhtml"). # This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None #html_file_suffix = None
# avoid li fonts being larger
# TODO but li indices fonts are still larger
html_compact_lists = False html_compact_lists = False
# Language to be used for generating the HTML full-text search index. # Language to be used for generating the HTML full-text search index.
......
...@@ -26,7 +26,7 @@ tensorpack.tfutils.gradproc module ...@@ -26,7 +26,7 @@ tensorpack.tfutils.gradproc module
:show-inheritance: :show-inheritance:
tensorpack.tfutils.model_utils module tensorpack.tfutils.model_utils module
------------------------------------ --------------------------------------
.. automodule:: tensorpack.tfutils.model_utils .. automodule:: tensorpack.tfutils.model_utils
:members: :members:
...@@ -34,7 +34,7 @@ tensorpack.tfutils.model_utils module ...@@ -34,7 +34,7 @@ tensorpack.tfutils.model_utils module
:show-inheritance: :show-inheritance:
tensorpack.tfutils.scope_utils module tensorpack.tfutils.scope_utils module
------------------------------------ --------------------------------------
.. automodule:: tensorpack.tfutils.scope_utils .. automodule:: tensorpack.tfutils.scope_utils
:members: :members:
......
...@@ -5,11 +5,3 @@ tensorpack.train package ...@@ -5,11 +5,3 @@ tensorpack.train package
:members: :members:
:undoc-members: :undoc-members:
:show-inheritance: :show-inheritance:
tensorpack.train.monitor module
------------------------------------
.. automodule:: tensorpack.train.monitor
:members:
:undoc-members:
:show-inheritance:
...@@ -59,13 +59,3 @@ generator = df.get_data() ...@@ -59,13 +59,3 @@ generator = df.get_data()
for dp in generator: for dp in generator:
# dp is now a list. do whatever # dp is now a list. do whatever
``` ```
### Efficiency
DataFlow is purely Python -- a convenient and slow language (w.r.t C++). But faster data loading doesn't always mean faster
training: we only need data to be __fast enough__.
DataFlow is fast enough for problems up to the scale of multi-GPU ImageNet training.
See [efficient dataflow tutorial](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html)
for details.
Therefore, for most usecases, writing format conversion/preprocessing code with TensorFlow operators doesn't help you at all.
# Input Sources # Input Pipeline
This tutorial covers how data goes from DataFlow or other sources to TensorFlow graph. This tutorial covers some general basics of the possible methods to send data from external sources to TensorFlow graph,
You don't have to read it because these are details under the tensorpack interface, but knowing it could help understand the efficiency. and how tensorpack support these methods.
You don't have to read it because these are details under the tensorpack interface,
but knowing it could help understand the efficiency and choose the best input pipeline for your task.
`InputSource` is an abstract interface in tensorpack describing where the input come from and how they enter the graph. ## Prepare Data in Parallel
For example,
1. Come from a DataFlow and been fed to the graph. <!--
2. Come from a DataFlow and been prefetched on CPU by a TF queue. -![prefetch](input-source.png)
3. Come from a DataFlow, prefetched on CPU by a TF queue, then prefetched on GPU by a TF StagingArea. -->
4. Come from some TF native reading pipeline.
5. Come from some ZMQ pipe. ![prefetch](https://cloud.githubusercontent.com/assets/1381301/26525192/36e5de48-4304-11e7-88ab-3b790bd0e028.png)
A common sense no matter what framework you use:
Start to prepare the next (batch of) data while you're training!
The reasons are:
1. Data preparation often consumes non-trivial time (depend on the actual problem).
2. Data preparation often uses completely different resources from training --
doing them together doesn't slow you down. In fact you can further parallelize different stages in
the preparation, because they also use different resources (as shown in the figure).
3. Data preparation often doesn't depend on the result of the previous training step.
For most tasks, DataFlow with some prefetch is fast enough. You can use `TrainConfig(data=)` option Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks),
to customize your `InputSource`. 4 P100 GPUs can train ResNet50 at 852 images/sec, and the size of those images are 852\*224\*224\*3\*4bytes = 489MB.
Assuming you have 5GB/s `memcpy` bandwidth, simply copying the data once would take 0.1s -- slowing
down your training by 10%. Think about how many more copies are made during your preprocessing.
## Use Prefetch Failure to hide the data preparation latency is the major reason why people
cannot see good GPU utilization. Always choose a framework that allows latency hiding.
In general, `feed_dict` is slow and should never appear in training loops. ## Python or C++ ?
i.e., when you use TensorFlow without any wrappers, you should avoid loops like this:
```python
while True:
X, y = get_some_data()
minimize_op.run(feed_dict={'X': X, 'y': y})
```
However, when you need to load data from Python-side, this is the only available interface in frameworks such as Keras, tflearn.
This is part of the reason why [tensorpack is faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6).
You could use something like this instead, to prefetch data into the graph in one thread and hide the copy latency: The above discussion is valid regardless of what you use to load/preprocess, Python code or TensorFlow operators (written in C++).
```python
# Thread 1:
while True:
X, y = get_some_data()
enqueue.run(feed_dict={'X': X, 'y': y}) # feed data to a TensorFlow queue
# Thread 2: The benefit of using TensorFlow ops is:
while True: * Faster preprocessing.
minimize_op.run() # minimize_op was built from dequeued tensors * No "Copy to TF" (i.e. `feed_dict`) stage.
```
This is now automatically handled by tensorpack trainers already, see [Trainer](trainer.md) for details. While Python is much easier to write, and has much more libraries to use.
TensorFlow StagingArea can further hide H2D (CPU->GPU) copy latency. Though C++ ops are potentially faster, they're usually __not necessary__.
It is also automatically included in tensorpack when you use Synchronous MultiGPU training. As long as data preparation runs faster than training, it makes no difference at all.
And for most types of problems, up to the scale of multi-GPU ImageNet training,
Python can offer enough speed if written properly (e.g. use `tensorpack.dataflow`).
See the [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html) tutorial.
You can also avoid `feed_dict` by using TensorFlow native operators to read data, which is also supported in tensorpack. When you use Python to load/preprocess data, TF `QueueBase` can help hide the "Copy to TF" latency,
It probably allows you to reach the best performance, and TF `StagingArea` can help hide the "Copy to GPU" latency.
but at the cost of implementing the reading / preprocessing ops in C++ if there isn't one for your task. They are used by most examples in tensorpack,
however most other TensorFlow wrappers are `feed_dict` based -- no latency hiding at all.
This is the major reason why tensorpack is [faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6).
## InputSource
`InputSource` is an abstract interface in tensorpack, to describe where the input come from and how they enter the graph.
For example,
1. Come from a DataFlow and been fed to the graph.
2. Come from a DataFlow and been prefetched on CPU by a TF queue.
3. Come from a DataFlow, prefetched on CPU by a TF queue, then prefetched on GPU by a TF StagingArea.
4. Come from some TF native reading pipeline.
5. Come from some ZMQ pipe, where the load/preprocessing may happen on a different machine.
## Figure out the bottleneck You can use `TrainConfig(data=)` option to use a customized `InputSource`.
Usually you don't need this API, and only have to specify `TrainConfig(dataflow=)`, because
tensorpack trainers automatically adds proper prefetching for you.
In cases you want to use TF ops rather than DataFlow, you can use `TensorInput` as the `InputSource`
(See the [PTB example](https://github.com/ppwwyyxx/tensorpack/tree/master/examples/PennTreebank)).
Thread 1 & 2 runs in parallel and the faster one will block to wait for the slower one. ## Figure out the Bottleneck
So the overall throughput will appear to be the slower one.
There is no way to accurately benchmark the two dependent threads while they are running, Training and data preparation run in parallel and the faster one will block to wait for the slower one.
without introducing overhead. However, are ways to understand which one is the bottleneck: So the overall throughput will be dominated by the slower one.
1. Use the average occupancy (size) of the queue. This information is summarized by default. There is no way to accurately benchmark two threads waiting on queues,
If the queue is nearly empty (default size 50), then the input source is the bottleneck. without introducing overhead. However, there are ways to understand which one is the bottleneck:
2. Benchmark them separately. You can use `TestDataSpeed` to benchmark a DataFlow, and 1. Use the average occupancy (size) of the queue. This information is summarized in tensorpack by default.
use `FakeData` as a fast replacement in a dry run, to benchmark the training iterations. If the queue is nearly empty (default size is 50), then the input source is the bottleneck.
If you found your input is the bottleneck, then you'll need to think about how to speed up your data. 2. Benchmark them separately. Use `TestDataSpeed` to benchmark a DataFlow.
You may either change `InputSource`, or look at [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html). Use `FakeData(..., random=False)` as a fast DataFlow, to benchmark the training iterations plus the copies.
Or use `DummyConstantInput` as a fast InputSource, to benchmark the training iterations only.
...@@ -28,7 +28,7 @@ config = TrainConfig( ...@@ -28,7 +28,7 @@ config = TrainConfig(
callbacks=[...] callbacks=[...]
) )
# start training (with a slow trainer. See 'tutorials - Input Sources' for details): # start training (with a slow trainer. See 'tutorials - Input Pipeline' for details):
# SimpleTrainer(config).train() # SimpleTrainer(config).train()
# start training with queue prefetch: # start training with queue prefetch:
......
...@@ -118,7 +118,7 @@ class ModelDesc(object): ...@@ -118,7 +118,7 @@ class ModelDesc(object):
``self.cost``. You can override :meth:`_get_cost()` if needed. ``self.cost``. You can override :meth:`_get_cost()` if needed.
This function also applies the collection This function also applies the collection
``tf.GraphKeys.REGULARIZATION_LOSSES``to the cost automatically. ``tf.GraphKeys.REGULARIZATION_LOSSES`` to the cost automatically.
Because slim users would expect the regularizer being automatically applied once used in slim layers. Because slim users would expect the regularizer being automatically applied once used in slim layers.
""" """
cost = self._get_cost() cost = self._get_cost()
......
...@@ -88,8 +88,7 @@ class SimpleFeedfreeTrainer(SingleCostFeedfreeTrainer): ...@@ -88,8 +88,7 @@ class SimpleFeedfreeTrainer(SingleCostFeedfreeTrainer):
def __init__(self, config): def __init__(self, config):
""" """
Args: Args:
config (TrainConfig): ``config.data`` must exist and is a config (TrainConfig): ``config.data`` must exist and is a :class:`FeedfreeInput`.
:class:`FeedfreeInput`.
""" """
self._input_source = config.data self._input_source = config.data
assert isinstance(self._input_source, FeedfreeInput), self._input_source assert isinstance(self._input_source, FeedfreeInput), self._input_source
......
...@@ -7,4 +7,5 @@ exclude = .git, ...@@ -7,4 +7,5 @@ exclude = .git,
snippet, snippet,
docs, docs,
examples, examples,
_test.py _test.py,
docs/conf.py
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment