update docs

96f8f96e · Yuxin Wu · 14964cc7 · 96f8f96e · 96f8f96e · 96f8f96e
Commit 96f8f96e authored Nov 14, 2018 by Yuxin Wu
5 changed files
--- a/docs/tutorial/performance-tuning.md
+++ b/docs/tutorial/performance-tuning.md
@@ -5,11 +5,13 @@ __We do not know why your training is slow__ (and most of the times it's not a t
 Tensorpack is designed to be high-performance, as can be seen in the [benchmarks](https://github.com/tensorpack/benchmarks).
 But performance is different across machines and tasks,
-so you need to figure out what goes wrong by your own.
+so it's not easy to understand what goes wrong without doing some investigations by your own.
 Tensorpack has some tools to make it easier to understand the performance.
-Here's a list of things you can do when your training is slow.
+Here is a list of things you can do to understand why your training is slow.
-If you ask for help to understand and improve the speed, PLEASE do them and include your findings.
+If you ask for help to understand and improve the speed, PLEASE do the
+investigations below, post your hardware information and your findings from the investigation, such as what changes
+you've made and what performance numbers you've seen.
 ## Figure out the bottleneck
@@ -40,18 +42,29 @@ A benchmark will give you more precise information about which part you should i
 ## Investigate DataFlow
 Understand the [Efficient DataFlow](efficient-dataflow.html) tutorial, so you know what your DataFlow is doing.
-Then, make modifications and benchmark to understand which part of dataflow is the bottleneck.
+Then, make modifications and benchmark to understand what in the data pipeline is your bottleneck.
-Use [TestDataSpeed](../modules/dataflow.html#tensorpack.dataflow.TestDataSpeed).
+Do __NOT__ look at training speed when you benchmark a DataFlow, only use the output of `TestDataSpeed`.
-Do __NOT__ look at training speed when you benchmark a DataFlow.
-Some example things to try:
-1. Benchmark only the raw reader (and perhaps add some parallelism).
-2. Gradually add some pre-processing and see how the performance changes.
-3. Change the number of parallel processes or threads.
 A DataFlow could be blocked by CPU/disk/network/IPC bandwidth.
-Only by benchmarking will you know the reason and improve it accordingly, e.g.:
+Do __NOT__ optimize the DataFlow before knowing what it is blocked on.
+By benchmarking with modifications to your dataflow, you can see which
+components is the bottleneck of your dataflow. For example, with a simple
+dataflow, you can usually do the following:
+1. If your dataflow becomes fast enough after removing some pre-processing (e.g.
+   augmentations), then the pre-processing is the bottleneck.
+1. Without pre-processing, your dataflow is just reading + parallelism, which
+   includes both reading cost and the multiprocess communication cost.
+   You can now let your reader produce only a single float after reading a large
+   amount of data, so that the pipeline contains only parallel reading, but negligible
+   communication cost any more. 
+   If this becomes fast enough, it means that communication is the bottleneck.
+   If pure parallel reading is still not fast enough, it means your raw reader is the bottleneck.
+1. In practice the dataflow can be more complicated and you'll need to design
+   your own strategies to understand its performance.
+Once you've understand what is the bottleneck, you can try some improvements such as:
 1. Use single-file database to avoid random read on hard disk.
 2. Use fewer pre-processings or write faster ones with whatever tools you have.

--- a/docs/tutorial/symbolic.md
+++ b/docs/tutorial/symbolic.md
@@ -50,6 +50,9 @@ l = func(l, *args, **kwargs)
 l = FullyConnected('fc1', l, 10, activation=tf.identity)
 ```
+If you need to access the output of some layer and use it with some other
+operations, then just don't use `LinearWrap`, because the graph is not linear anymore.
 ### Access Relevant Tensors
 The variables inside the layer will be named `name/W`, `name/b`, etc.
@@ -60,7 +63,7 @@ l = Conv2D('conv1', l, 32, 3)
 print(l.variables.W)
 print(l.variables.b)
 ```
-But note that this is a hacky way and may not work with future versions of TensorFlow.
+But note that this is a __hacky__ way and may not work with future versions of TensorFlow.
 Also this method doesn't work with LinearWrap, and cannot access the variables created by an activation function.
 The output of a layer is usually named `name/output` unless documented differently in the API.

--- a/docs/tutorial/trainer.md
+++ b/docs/tutorial/trainer.md
@@ -51,6 +51,8 @@ The tower function needs to follow some rules:
     On the other hand, for a non-trainable variable, it may be desirable to not reuse it between towers.
     In this case, `tf.Variable` can be used to ensure creation of new variables in each tower even when `reuse=True`.
+   * Do not modify the reuse option (e.g., by `scope.reuse_variables()`) of a variable
+     scope that is not created by you. This affects other's code.
 4. It cannot create scopes or variables containing the name 'tower', as it is
   reserved for special use.

--- a/examples/FasterRCNN/model_frcnn.py
+++ b/examples/FasterRCNN/model_frcnn.py
@@ -42,8 +42,8 @@ def proposal_metrics(iou):
 @under_name_scope()
 def sample_fast_rcnn_targets(boxes, gt_boxes, gt_labels):
    """
-    Sample some ROIs from all proposals for training.
+    Sample some boxes from all proposals for training.
-    #fg is guaranteed to be > 0, because grount truth boxes are added as RoIs.
+    #fg is guaranteed to be > 0, because ground truth boxes will be added as proposals.
    Args:
        boxes: nx4 region proposals, floatbox

--- a/tensorpack/libinfo.py
+++ b/tensorpack/libinfo.py
@@ -43,6 +43,10 @@ os.environ['TF_GPU_THREAD_COUNT'] = '2'
 # overflow for certain input data range.
 os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '0'
+# Available since 1.12. issue#15874
+os.environ['TF_ENABLE_WHILE_V2'] = '1'
+os.environ['TF_ENABLE_COND_V2'] = '1'
 try:
    import tensorflow as tf  # noqa
    _version = tf.__version__.split('.')