Commit ea720903 authored by Yuxin Wu's avatar Yuxin Wu

add performance tuning doc

parent f0ee73fc
...@@ -57,3 +57,7 @@ Unmatched variables on both sides will be printed as a warning. ...@@ -57,3 +57,7 @@ Unmatched variables on both sides will be printed as a warning.
Note that the above methods only prevent variables being updated by SGD. Note that the above methods only prevent variables being updated by SGD.
Some variables may be updated by other means, Some variables may be updated by other means,
e.g., BatchNorm statistics are updated through the `UPDATE_OPS` collection and the [RunUpdateOps](../modules/callbacks.html#tensorpack.callbacks.RunUpdateOps) callback. e.g., BatchNorm statistics are updated through the `UPDATE_OPS` collection and the [RunUpdateOps](../modules/callbacks.html#tensorpack.callbacks.RunUpdateOps) callback.
## My training is slow!
Checkout the [Performance Tuning tutorial](performance-tuning.html)
...@@ -37,6 +37,7 @@ User Tutorials ...@@ -37,6 +37,7 @@ User Tutorials
callback callback
summary summary
faq faq
performance-tuning
Extend Tensorpack Extend Tensorpack
================= =================
......
...@@ -79,18 +79,3 @@ When you set `TrainConfig(dataflow=)`, tensorpack trainers automatically adds pr ...@@ -79,18 +79,3 @@ When you set `TrainConfig(dataflow=)`, tensorpack trainers automatically adds pr
You can also use `TrainConfig(data=)` option to use a customized `InputSource`. You can also use `TrainConfig(data=)` option to use a customized `InputSource`.
In case you want to use TF ops rather than a DataFlow, you can use `TensorInput` as the `InputSource` In case you want to use TF ops rather than a DataFlow, you can use `TensorInput` as the `InputSource`
(See the [PTB example](../../tensorpack/tree/master/examples/PennTreebank)). (See the [PTB example](../../tensorpack/tree/master/examples/PennTreebank)).
## Figure out the Bottleneck
Training and data preparation run in parallel and the faster one will block to wait for the slower one.
So the overall throughput will be dominated by the slower one.
There is no way to accurately benchmark two threads waiting on queues,
without introducing overhead. However, there are ways to understand which one is the bottleneck:
1. Use the average occupancy (size) of the queue. This information is summarized in tensorpack by default.
If the queue is nearly empty (default size is 50), then the input source is the bottleneck.
2. Benchmark them separately. Use `TestDataSpeed` to benchmark a DataFlow.
Use `FakeData(..., random=False)` as a fast DataFlow, to benchmark the training iterations plus the copies.
Or use `DummyConstantInput` as a fast InputSource, to benchmark the training iterations only.
# Performance Tuning
Here's a list of things you can do when your training is slow:
## Figure out the bottleneck
1. If you use feed-based input (unrecommended) and datapoints are large, data is likely to become the
bottleneck.
2. If you use queue-based input + dataflow, you can look for the queue size statistics in
training log. Ideally the queue should be near-full (default size is 50).
If the size is near-zero, data is the bottleneck.
3. If the GPU utilization is low, data is likely to be the bottleneck. Also make sure GPUs are not locked in P8 state.
## Benchmark the components
1. Use `data=DummyConstantInput(shapes)` in `TrainConfig`,
so that the iterations doesn't take any data from Python side but train on a constant tensor.
This will help find out the slow operations you're using in the graph.
2. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow.
Compared to using `DummyConstantInput`, this will include the extra Python-TF overhead, which is supposed to be negligible.
3. If you're using a TF-based input pipeline you wrote, you can simply run it in a loop and test its speed.
4. Use `TestDataSpeed(mydf).start()` to benchmark your DataFlow.
A benchmark will give you more precise information about which part you should improve.
## Improve DataFlow
Understand the [Efficient DataFlow](efficient-dataflow.html) tutorial,
so that you have an idea of what your DataFlow is doing.
Benchmark your DataFlow with modifications and you'll understand why it runs slow. Some examples
include:
1. Remove everything except for the raw reader (and perhaps add some prefetching).
2. Remove some suspicious pre-processing.
3. Change the number of parallel processes or threads.
A DataFlow could be blocked by CPU/hard disk/network/IPC bandwidth. Only by benchmarking will you
know the reason and improve it accordingly, e.g.:
1. Use single-file database to avoid random read on hard disk.
2. Write faster pre-processing, or use distributed data preprocessing to reduce CPU burden.
3. Compress your data (e.g. use uint8 images, or JPEG-compressed images) before sending them through
anything (network, ZMQ pipe, Python-TF copy etc.)
## Improve TensorFlow
You can add a `GraphProfiler` callback when benchmarking the graph. It will
dump TF tracing information (to either TensorBoard or chrome) to help diagnose the issue.
Usually there isn't much you can do if a TF op is slow, except to optimize the kernels.
But there may be something cheap you can try:
1. Device placement of ops can affect speed,
sometimes it helps to change device placement to avoid some copy.
2. Sometimes there are several mathematically equivalent ways of writing the same model
with different speed.
...@@ -117,7 +117,8 @@ class GraphProfiler(Callback): ...@@ -117,7 +117,8 @@ class GraphProfiler(Callback):
Args: Args:
dump_metadata(bool): Dump :class:`tf.RunMetadata` to be used with tfprof. dump_metadata(bool): Dump :class:`tf.RunMetadata` to be used with tfprof.
dump_tracing(bool): Dump chrome tracing files. dump_tracing(bool): Dump chrome tracing files.
dump_event(bool): Dump to an event processed by FileWriter. dump_event(bool): Dump to an event processed by FileWriter and
will be shown in TensorBoard.
""" """
self._dir = logger.LOG_DIR self._dir = logger.LOG_DIR
self._dump_meta = bool(dump_metadata) self._dump_meta = bool(dump_metadata)
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment