add performance tuning doc

ea720903 · Yuxin Wu · f0ee73fc · ea720903 · ea720903 · ea720903
Commit ea720903 authored Sep 05, 2017 by Yuxin Wu
5 changed files
--- a/docs/tutorial/faq.md
+++ b/docs/tutorial/faq.md
@@ -57,3 +57,7 @@ Unmatched variables on both sides will be printed as a warning.
 Note that the above methods only prevent variables being updated by SGD.
 Some variables may be updated by other means,
 e.g., BatchNorm statistics are updated through the `UPDATE_OPS` collection and the [RunUpdateOps](../modules/callbacks.html#tensorpack.callbacks.RunUpdateOps) callback.
+
+## My training is slow!
+
+Checkout the [Performance Tuning tutorial](performance-tuning.html)
--- a/docs/tutorial/index.rst
+++ b/docs/tutorial/index.rst
@@ -37,6 +37,7 @@ User Tutorials
  callback
  summary
  faq
+  performance-tuning

 Extend Tensorpack
 =================

--- a/docs/tutorial/input-source.md
+++ b/docs/tutorial/input-source.md
@@ -79,18 +79,3 @@ When you set `TrainConfig(dataflow=)`, tensorpack trainers automatically adds pr
 You can also use `TrainConfig(data=)` option to use a customized `InputSource`.
 In case you want to use TF ops rather than a DataFlow, you can use `TensorInput` as the `InputSource`
 (See the [PTB example](../../tensorpack/tree/master/examples/PennTreebank)).
-
-## Figure out the Bottleneck
-
-Training and data preparation run in parallel and the faster one will block to wait for the slower one.
-So the overall throughput will be dominated by the slower one.
-
-There is no way to accurately benchmark two threads waiting on queues,
-without introducing overhead. However, there are ways to understand which one is the bottleneck:
-
-1. Use the average occupancy (size) of the queue. This information is summarized in tensorpack by default.
-	If the queue is nearly empty (default size is 50), then the input source is the bottleneck.
-
-2. Benchmark them separately. Use `TestDataSpeed` to benchmark a DataFlow.
-	 Use `FakeData(..., random=False)` as a fast DataFlow, to benchmark the training iterations plus the copies.
-	 Or use `DummyConstantInput` as a fast InputSource, to benchmark the training iterations only.
--- a/docs/tutorial/performance-tuning.md
+++ b/docs/tutorial/performance-tuning.md
+
+# Performance Tuning
+
+Here's a list of things you can do when your training is slow:
+
+## Figure out the bottleneck
+
+1. If you use feed-based input (unrecommended) and datapoints are large, data is likely to become the
+	 bottleneck.
+2. If you use queue-based input + dataflow, you can look for the queue size statistics in
+	 training log. Ideally the queue should be near-full (default size is 50).
+ 	 If the size is near-zero, data is the bottleneck.
+3. If the GPU utilization is low, data is likely to be the bottleneck. Also make sure GPUs are not locked in P8 state.
+
+## Benchmark the components
+1. Use `data=DummyConstantInput(shapes)` in `TrainConfig`,
+	so that the iterations doesn't take any data from Python side but train on a constant tensor.
+	This will help find out the slow operations you're using in the graph.
+2. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow.
+	Compared to using `DummyConstantInput`, this will include the extra Python-TF overhead, which is supposed to be negligible.
+3. If you're using a TF-based input pipeline you wrote, you can simply run it in a loop and test its speed.
+4. Use `TestDataSpeed(mydf).start()` to benchmark your DataFlow.
+
+A benchmark will give you more precise information about which part you should improve.
+
+## Improve DataFlow
+
+Understand the [Efficient DataFlow](efficient-dataflow.html) tutorial,
+so that you have an idea of what your DataFlow is doing.
+
+Benchmark your DataFlow with modifications and you'll understand why it runs slow. Some examples
+include:
+
+1. Remove everything except for the raw reader (and perhaps add some prefetching).
+2. Remove some suspicious pre-processing.
+3. Change the number of parallel processes or threads.
+
+A DataFlow could be blocked by CPU/hard disk/network/IPC bandwidth. Only by benchmarking will you
+know the reason and improve it accordingly, e.g.:
+
+1. Use single-file database to avoid random read on hard disk.
+2. Write faster pre-processing, or use distributed data preprocessing to reduce CPU burden.
+3. Compress your data (e.g. use uint8 images, or JPEG-compressed images) before sending them through
+	 anything (network, ZMQ pipe, Python-TF copy etc.)
+
+## Improve TensorFlow
+
+You can add a `GraphProfiler` callback when benchmarking the graph. It will
+dump TF tracing information (to either TensorBoard or chrome) to help diagnose the issue.
+
+Usually there isn't much you can do if a TF op is slow, except to optimize the kernels.
+But there may be something cheap you can try:
+1. Device placement of ops can affect speed,
+	 sometimes it helps to change device placement to avoid some copy.
+2. Sometimes there are several mathematically equivalent ways of writing the same model
+	 with different speed.
--- a/tensorpack/callbacks/prof.py
+++ b/tensorpack/callbacks/prof.py
@@ -117,7 +117,8 @@ class GraphProfiler(Callback):
        Args:
            dump_metadata(bool): Dump :class:`tf.RunMetadata` to be used with tfprof.
            dump_tracing(bool): Dump chrome tracing files.
-            dump_event(bool): Dump to an event processed by FileWriter.
+            dump_event(bool): Dump to an event processed by FileWriter and
+                will be shown in TensorBoard.
        """
        self._dir = logger.LOG_DIR
        self._dump_meta = bool(dump_metadata)