docs and deprecations

4142b9e7 · Yuxin Wu · 9edc0ca5 · 4142b9e7 · 4142b9e7 · 4142b9e7
Commit 4142b9e7 authored Feb 28, 2018 by Yuxin Wu
8 changed files
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -376,7 +376,7 @@ def autodoc_skip_member(app, what, name, obj, skip, options):
        'PeriodicRunHooks',
        'apply_default_prefetch',
-        'guided_relu', 'saliency_map', 'get_scalar_var', 'psnr',
+        'saliency_map', 'get_scalar_var', 'psnr',
        'prediction_incorrect', 'huber_loss', 'SoftMax'
        ]:
        return True

--- a/docs/tutorial/performance-tuning.md
+++ b/docs/tutorial/performance-tuning.md
@@ -2,41 +2,39 @@
 # Performance Tuning
 __We do not know why your training is slow__ (and most of the times it's not a tensorpack problem).
-Performance is different across machines and tasks.
+Performance is different across machines and tasks,
-So you need to figure out most parts by your own.
+so you need to figure out most parts by your own.
 Here's a list of things you can do when your training is slow.
-If you need help improving the speed,
+If you ask for help understanding and improving the speed, PLEASE do them and include your findings.
-PLEASE do them and include your findings.
 ## Figure out the bottleneck
-1. If you use feed-based input (unrecommended) and datapoints are large, data is likely to become the
+1. If you use feed-based input (unrecommended) and datapoints are large, data is likely to become the bottleneck.
-	 bottleneck.
 2. If you use queue-based input + dataflow, you can look for the queue size statistics in
-	 training log. Ideally the queue should be near-full (default size is 50).
+	 training log. Ideally the input queue should be near-full (default size is 50).
 	 If the size is near-zero, data is the bottleneck.
 3. If GPU utilization is low, it may be because of slow data, or some ops are inefficient. Also make sure GPUs are not locked in P8 state.
 ## Benchmark the components
-1. Use `DummyConstantInput(shapes)` as the `InputSource`.
+1. (usually not needed) Use `data=DummyConstantInput(shapes)` for training,
 	so that the iterations only take data from a constant tensor.
-	This will help find out the slow operations you're using in the graph.
+	This will benchmark the graph without the overhead of data.
 2. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow.
-  This is almost the same as (1), i.e., it removes the overhead of data.
+  This is almost the same as (1).
 3. If you're using a TF-based input pipeline you wrote, you can simply run it in a loop and test its speed.
 4. Use `TestDataSpeed(mydf).start()` to benchmark your DataFlow.
 A benchmark will give you more precise information about which part you should improve.
+Note that you should only look at iteration speed after about 50 iterations, since everything is slow at the beginning.
 ## Investigate DataFlow
 Understand the [Efficient DataFlow](efficient-dataflow.html) tutorial, so you know what your DataFlow is doing.
-Benchmark your DataFlow with modifications to understand which part is the bottleneck. Some examples
+Benchmark your DataFlow with modifications to understand which part is the bottleneck. Some examples include:
-include:
-1. Benchmark only raw reader (and perhaps add some parallel prefetching).
+1. Benchmark only raw reader (and perhaps add some parallelism).
 2. Gradually add some pre-processing and see how the performance changes.
 3. Change the number of parallel processes or threads.
@@ -52,17 +50,19 @@ know the reason and improve it accordingly, e.g.:
 ## Investigate TensorFlow
-When you're sure that data is not a bottleneck (e.g. when queue is always full), you can start to
+When you're sure that data is not a bottleneck (e.g. when the logs show that queue is almost full), you can start to
 worry about the model.
-You can add a `GraphProfiler` callback when benchmarking the graph. It will
+A naive but effective way is to remove ops from your model to understand how much time they cost.
+Or you can use `GraphProfiler` callback to benchmark the graph. It will
 dump runtime tracing information (to either TensorBoard or chrome) to help diagnose the issue.
+Remember not to use the first several iterations.
 ### Slow with single-GPU
 This is literally saying TF ops are slow. Usually there isn't much you can do, except to optimize the kernels.
 But there may be something cheap you can try:
-1. You can visualize copies across devices in chrome.
+1. Visualize copies across devices in chrome.
 	 It may help to change device placement to avoid some CPU-GPU copies.
 	 It may help to replace some CPU-only ops with equivalent GPU ops to avoid copies.

--- a/examples/ResNet/README.md
+++ b/examples/ResNet/README.md
@@ -26,7 +26,9 @@ To train, first decompress ImageNet data into [this structure](http://tensorpack
 ```
 You should be able to see good GPU utilization (95%~99%), if your data is fast enough.
-The default data pipeline is probably OK for most systems.
+It can finish training [within 20 hours](http://dawn.cs.stanford.edu/benchmark/ImageNet/train.html) on AWS p3.16xlarge.
+The default data pipeline is probably OK for most SSD systems.
 See the [tutorial](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html) on other options to speed up your data.
 ![imagenet](imagenet-resnet.png)

--- a/tensorpack/callbacks/inference.py
+++ b/tensorpack/callbacks/inference.py
@@ -9,7 +9,6 @@ from six.moves import zip
 from .base import Callback
 from ..utils import logger
-from ..utils.utils import execute_only_once
 from ..utils.stats import RatioCounter, BinaryStatistics
 from ..tfutils.common import get_op_tensor_name
@@ -55,17 +54,9 @@ class Inferencer(Callback):
        """
        Return a list of tensor names (guaranteed not op name) this inferencer needs.
        """
-        try:
+        ret = self._get_fetches()
-            ret = self._get_fetches()
-        except NotImplementedError:
-            logger.warn("Inferencer._get_output_tensors was deprecated and renamed to _get_fetches")
-            ret = self._get_output_tensors()
        return [get_op_tensor_name(n)[1] for n in ret]
-    def _get_output_tensors(self):
-        pass
    def _get_fetches(self):
        raise NotImplementedError()
@@ -77,15 +68,7 @@ class Inferencer(Callback):
            results(list): list of results this inferencer fetched. Has the same
                length as ``self._get_fetches()``.
        """
-        try:
+        self._on_fetches(results)
-            self._on_fetches(results)
-        except NotImplementedError:
-            if execute_only_once():
-                logger.warn("Inferencer._datapoint was deprecated and renamed to _on_fetches.")
-            self._datapoint(results)
-    def _datapoint(self, results):
-        pass
    def _on_fetches(self, results):
        raise NotImplementedError()

--- a/tensorpack/callbacks/monitor.py
+++ b/tensorpack/callbacks/monitor.py
@@ -19,7 +19,7 @@ from ..tfutils.summary import create_scalar_summary, create_image_summary
 from .base import Callback
 __all__ = ['TrainingMonitor', 'Monitors',
-           'TFSummaryWriter', 'TFEventWriter', 'JSONWriter',
+           'TFEventWriter', 'JSONWriter',
           'ScalarPrinter', 'SendMonitorData']
@@ -108,7 +108,7 @@ class Monitors(Callback):
    _chief_only = False
    def __init__(self, monitors):
-        self._scalar_history = ScalarHistory().set_chief_only(False)
+        self._scalar_history = ScalarHistory()
        self._monitors = monitors + [self._scalar_history]
        for m in self._monitors:
            assert isinstance(m, TrainingMonitor), m
@@ -172,7 +172,7 @@ class Monitors(Callback):
    def put_event(self, evt):
        """
-        Put an tf.Event.
+        Put an :class:`tf.Event`.
        `step` and `wall_time` fields of :class:`tf.Event` will be filled automatically.
        Args:
@@ -185,12 +185,18 @@ class Monitors(Callback):
    def get_latest(self, name):
        """
        Get latest scalar value of some data.
+        If you run multiprocess training, keep in mind that
+        the data is perhaps only available on chief process.
        """
        return self._scalar_history.get_latest(name)
    def get_history(self, name):
        """
        Get a history of the scalar value of some data.
+        If you run multiprocess training, keep in mind that
+        the data is perhaps only available on chief process.
        """
        return self._scalar_history.get_history(name)
@@ -240,11 +246,6 @@ class TFEventWriter(TrainingMonitor):
        self._writer.close()
-def TFSummaryWriter(*args, **kwargs):
-    logger.warn("TFSummaryWriter was renamed to TFEventWriter!")
-    return TFEventWriter(*args, **kwargs)
 class JSONWriter(TrainingMonitor):
    """
    Write all scalar data to a json file under ``logger.get_logger_dir()``, grouped by their global step.
@@ -397,6 +398,9 @@ class ScalarHistory(TrainingMonitor):
    """
    Only used by monitors internally.
    """
+    _chief_only = False
    def _setup_graph(self):
        self._dic = defaultdict(list)

--- a/tensorpack/dataflow/common.py
+++ b/tensorpack/dataflow/common.py
@@ -688,7 +688,7 @@ class PrintData(ProxyDataFlow):
        self.num = num
        if label:
-            log_deprecated("PrintData(label, ...", "Use PrintData(name, ... instead.")
+            log_deprecated("PrintData(label, ...", "Use PrintData(name, ...  instead.", "2018-05-01")
            self.name = label
        else:
            self.name = name

--- a/tensorpack/tfutils/sessinit.py
+++ b/tensorpack/tfutils/sessinit.py
@@ -8,6 +8,7 @@ import tensorflow as tf
 import six
 from ..utils import logger
+from ..utils.develop import deprecated
 from .common import get_op_tensor_name
 from .varmanip import (SessionUpdate, get_savename_from_varname,
                       is_training_name, get_checkpoint_path)
@@ -261,6 +262,7 @@ def get_model_loader(filename):
        return SaverRestore(filename)
+@deprecated("Write the logic yourself!", "2018-06-01")
 def TryResumeTraining():
    """
    Try loading latest checkpoint from ``logger.get_logger_dir()``, only if there is one.

--- a/tensorpack/tfutils/symbolic_functions.py
+++ b/tensorpack/tfutils/symbolic_functions.py
@@ -3,7 +3,6 @@
 import tensorflow as tf
-from contextlib import contextmanager
 import numpy as np
 from ..utils.develop import deprecated
@@ -17,19 +16,6 @@ def prediction_incorrect(logits, label, topk=1, name='incorrect_vector'):
                   tf.float32, name=name)
-@deprecated("Please implement it by yourself.", "2018-02-28")
-def accuracy(logits, label, topk=1, name='accuracy'):
-    """
-    Args:
-        logits: shape [B,C].
-        label: shape [B].
-        topk(int): topk
-    Returns:
-        a single scalar
-    """
-    return tf.reduce_mean(tf.cast(tf.nn.in_top_k(logits, label, topk), tf.float32), name=name)
 def flatten(x):
    """
    Flatten the tensor.
@@ -47,54 +33,6 @@ def batch_flatten(x):
    return tf.reshape(x, tf.stack([tf.shape(x)[0], -1]))
-@deprecated("Please implement it by yourself.", "2018-02-28")
-def class_balanced_cross_entropy(pred, label, name='cross_entropy_loss'):
-    """
-    The class-balanced cross entropy loss,
-    as in `Holistically-Nested Edge Detection
-    <http://arxiv.org/abs/1504.06375>`_.
-    Args:
-        pred: of shape (b, ...). the predictions in [0,1].
-        label: of the same shape. the ground truth in {0,1}.
-    Returns:
-        class-balanced cross entropy loss.
-    """
-    with tf.name_scope('class_balanced_cross_entropy'):
-        z = batch_flatten(pred)
-        y = tf.cast(batch_flatten(label), tf.float32)
-        count_neg = tf.reduce_sum(1. - y)
-        count_pos = tf.reduce_sum(y)
-        beta = count_neg / (count_neg + count_pos)
-        eps = 1e-12
-        loss_pos = -beta * tf.reduce_mean(y * tf.log(z + eps))
-        loss_neg = (1. - beta) * tf.reduce_mean((1. - y) * tf.log(1. - z + eps))
-    cost = tf.subtract(loss_pos, loss_neg, name=name)
-    return cost
-@deprecated("Please implement it by yourself.", "2018-02-28")
-def class_balanced_sigmoid_cross_entropy(logits, label, name='cross_entropy_loss'):
-    """
-    This function accepts logits rather than predictions, and is more numerically stable than
-    :func:`class_balanced_cross_entropy`.
-    """
-    with tf.name_scope('class_balanced_sigmoid_cross_entropy'):
-        y = tf.cast(label, tf.float32)
-        count_neg = tf.reduce_sum(1. - y)
-        count_pos = tf.reduce_sum(y)
-        beta = count_neg / (count_neg + count_pos)
-        pos_weight = beta / (1 - beta)
-        cost = tf.nn.weighted_cross_entropy_with_logits(logits=logits, targets=y, pos_weight=pos_weight)
-        cost = tf.reduce_mean(cost * (1 - beta))
-        zero = tf.equal(count_pos, 0.0)
-    return tf.where(zero, 0.0, cost, name=name)
 def print_stat(x, message=None):
    """ A simple print Op that might be easier to use than :meth:`tf.Print`.
        Use it like: ``x = print_stat(x, message='This is x')``.
@@ -206,29 +144,6 @@ def psnr(prediction, ground_truth, maxp=None, name='psnr'):
    return psnr
-@contextmanager
-@deprecated("Please implement it by yourself.", "2018-02-28")
-def guided_relu():
-    """
-    Returns:
-        A context where the gradient of :meth:`tf.nn.relu` is replaced by
-        guided back-propagation, as described in the paper:
-        `Striving for Simplicity: The All Convolutional Net
-        <https://arxiv.org/abs/1412.6806>`_
-    """
-    from tensorflow.python.ops import gen_nn_ops   # noqa
-    @tf.RegisterGradient("GuidedReLU")
-    def GuidedReluGrad(op, grad):
-        return tf.where(0. < grad,
-                        gen_nn_ops._relu_grad(grad, op.outputs[0]),
-                        tf.zeros(grad.get_shape()))
-    g = tf.get_default_graph()
-    with g.gradient_override_map({'Relu': 'GuidedReLU'}):
-        yield
 @deprecated("Please implement it by yourself.", "2018-04-28")
 def saliency_map(output, input, name="saliency_map"):
    """