Add docs on summary

f573d8b6 · Yuxin Wu · 2325f7ac · f573d8b6 · f573d8b6 · f573d8b6
Commit f573d8b6 authored Aug 25, 2017 by Yuxin Wu
10 changed files
--- a/README.md
+++ b/README.md
 # tensorpack
-Neural Network Toolbox on TensorFlow.
+A neural net training interface based on TensorFlow.
 [![Build Status](https://travis-ci.org/ppwwyyxx/tensorpack.svg?branch=master)](https://travis-ci.org/ppwwyyxx/tensorpack)
 [![badge](https://readthedocs.org/projects/pip/badge/?version=latest)](http://tensorpack.readthedocs.io/en/latest/index.html)
@@ -31,29 +31,27 @@ Examples are not only for demonstration of the framework -- you can train them a
 It's Yet Another TF wrapper, but different in:
 1. Not focus on models.
 	+ There are already too many symbolic function wrappers.
-		Tensorpack includes only a few common models, and helpful tools such as `LinearWrap` to simplify large models.
+		Tensorpack includes only a few common models,
-	  But you can use any other wrappers within tensorpack, such as sonnet/Keras/slim/tflearn/tensorlayer/....
+	  but you can use any other wrappers within tensorpack, such as sonnet/Keras/slim/tflearn/tensorlayer/....
 2. Focus on __training speed__.
-	+	Speed comes for free with tensorpack.
+	+	Speed comes for free with tensorpack -- it uses TensorFlow in the correct way.
 	  Even on a tiny CNN example, the training runs [1.6x faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6) than the equivalent Keras code.
 	+ Data-parallel multi-GPU training is off-the-shelf to use. It is as fast as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).
 	+ Data-parallel distributed training is off-the-shelf to use. It is as slow as Google's official benchmark.
-3. Focus on large datasets.
+3. Focus on __large datasets__.
-	+ It's painful to read/preprocess data through TF. Use __DataFlow__ to load large datasets (e.g. ImageNet) in __pure Python__ with autoparallelization.
+	+ It's painful to read/preprocess data through TF. tensorpack helps you load large datasets (e.g. ImageNet) in __pure Python__ with autoparallelization.
-	+ DataFlow has a unified interface, so you can compose and reuse them to perform complex preprocessing.
 4. Interface of extensible __Callbacks__.
 	Write a callback to implement everything you want to do apart from the training iterations, and
 	enable it with one line of code. Common examples include:
 	+ Change hyperparameters during training
 	+ Print some tensors of interest
-	+ Run inference on a test dataset
+	+ Monitor GPU utilization
-	+ Run some operations once a while
+	+ Send error rate to your phone
-	+ Send loss to your phone
 See [tutorials](http://tensorpack.readthedocs.io/en/latest/tutorial/index.html) to know more about these features.
@@ -68,4 +66,4 @@ Dependencies:
 pip install -U git+https://github.com/ppwwyyxx/tensorpack.git
 # or add `--user` to avoid system-wide installation.
 ```
-Besides, if you only want to use `tensorpack.dataflow` alone, TensorFlow is also optional.
+Besides, if you only want to use `tensorpack.dataflow` alone as a data processing library, TensorFlow is also optional.
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -10,6 +10,8 @@ BUILDDIR      = build
 .PHONY: help Makefile docset
+all: html
 # Put it first so that "make" without argument is like "make help".
 help:
 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
@@ -20,5 +22,5 @@ docset: html
 # Catch-all target: route all unknown targets to Sphinx using the new
 # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile
+html: Makefile
 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/tutorial/faq.md
+++ b/docs/tutorial/faq.md
@@ -5,7 +5,7 @@
 The library tries to __support__ everything, but it could not really __include__ everything.
-The interface tries to be flexible enough so you can put any XYZ on it.
+The interface attempts to be flexible enough so you can put any XYZ on it.
 You can either implement them under the interface or simply wrap some existing Python code.
 See [Extend Tensorpack](index.html#extend-tensorpack)
 for more details.

--- a/docs/tutorial/graph.md
+++ b/docs/tutorial/graph.md
 # Build the Graph
+This tutorial explains how a graph is built in tensorpack.
 ### ModelDesc
 `ModelDesc` is an abstraction over the most common type of models people train.

--- a/docs/tutorial/index.rst
+++ b/docs/tutorial/index.rst
@@ -35,6 +35,7 @@ User Tutorials
  symbolic
  trainer
  callback
+  summary
  faq
 Extend Tensorpack

--- a/docs/tutorial/summary.md
+++ b/docs/tutorial/summary.md
+# Summary and Logging
+This tutorial will introduce the `Monitor` backend and
+explain how tensorpack handles summaries and logging.
+### Monitors
+In tensorpack, everything besides the training iterations are done in callbacks, including all the logging.
+When a callback gets something to log, it will write to the monitor backend through
+`trainer.monitors`, by calling `put_{scalar,image,summary,...}`.
+The call gets dispatched to multiple `TrainingMonitor` instances.
+These monitors are a special type of callback which can process different types of log data,
+and can be customized in `TrainConfig`.
+### TensorFlow Summaries
+Here is how TensorFlow summaries eventually get logged/saved/printed:
+1. __What to Log__: When you call `tf.summary.xxx` in your graph code, TensorFlow adds an op to
+	`tf.GraphKeys.SUMMARIES` collection (by default).
+2. __When to Log__: A [MergeAllSummaries](../modules/callbacks.html#tensorpack.callbacks.MergeAllSummaries)
+	callback is enabled by default in `TrainConfig`.
+	It runs ops in the `SUMMARIES` collection (by default) every epoch (by default),
+	and writes results to the monitor backend.
+3. __Where to Log__:
+	* A [TFEventWriter](../modules/callbacks.html#tensorpack.callbacks.TFEventWriter)
+		monitor is enabled by default in [TrainConfig](../modules/train.html#tensorpack.train.TrainConfig),
+		which writes things to an event file used by tensorboard.
+	* A [ScalarPrinter](../modules/callbacks.html#tensorpack.callbacks.ScalarPrinter)
+		monitor is enabled by default, which prints all scalars in your terminal.
+	* A [JSONWriter](../modules/callbacks.html#tensorpack.callbacks.JSONWriter)
+		monitor is enabled by default, which saves scalars to a file.
+Since summaries are evaluated every epoch by default, if the content is data-dependent, the results
+are likely to have too much variance. You can:
+1. Change "When to Log": log more frequently, but note that some large summaries are expensive to
+	 log. You may want to use a separate collection for frequent logging.
+2. Change "What to Log": you can call
+	 [tfutils.summary.add_moving_summary](../modules/tfutils.html#tensorpack.tfutils.summary.add_moving_summary)
+	 on scalar tensors, which will summarize the moving average of those scalars instead of their instant values.
+	 The moving averages are maintained by the
+	 [MovingAverageSummary](../modules/callbacks.html#tensorpack.callbacks.MovingAverageSummary)
+	 callback (enabled by default).
+Besides TensorFlow summaries,
+a callback is free to log any other types of data to the monitor backend,
+anytime after the training has started.
--- a/docs/tutorial/symbolic.md
+++ b/docs/tutorial/symbolic.md
@@ -7,7 +7,7 @@ such as conv/deconv, fc, bn, pooling layers.
 Using the tensorpack implementations, you can also benefit from `argscope` and `LinearWrap` to
 simplify the code.
-Note that these layers were written because there are no other alternatives back at that time.
+Note that these layers were written because there were no other alternatives back at that time.
 In the future we may shift the implementation to `tf.layers` because they will be better maintained.
 ### argscope and LinearWrap

--- a/docs/tutorial/trainer.md
+++ b/docs/tutorial/trainer.md
@@ -15,9 +15,9 @@ Tensorpack base trainer implements the logic of __running the iteration__.
 Users or derived trainers should implement __what the iteration is__.
 2. Trainer assumes the existence of __"epoch"__, i.e. that the iterations run in double for-loops.
-But it doesn't need to be a full pass of your dataset, ``steps_per_epoch`` can be any number you set
+But an epoch doesn't need to be a full pass of your dataset, the size of an epoch can be any number you set
 and it only affects the [schedule of callbacks](extend/callback.html).
-In other words, an "epoch" is the __default period__ to run callbacks (validation, summary, checkpoint, etc.).
+In other words, an "epoch" in tensorpack is the __default period to run callbacks__ (validation, summary, checkpoint, etc.).
 ### Common Trainers

--- a/tensorpack/callbacks/monitor.py
+++ b/tensorpack/callbacks/monitor.py
@@ -54,22 +54,26 @@ class TrainingMonitor(Callback):
        """ Override this method to setup the monitor."""
        pass
-    def put_summary(self, summary):
+    def process_summary(self, summary):
        """
        Process a tf.Summary.
        """
        pass
-    def put(self, name, val):
+    def process(self, name, val):
        """
        Process a key-value pair.
        """
        pass
-    def put_scalar(self, name, val):
+    def process_scalar(self, name, val):
+        """
+        Args:
+            val: a scalar
+        """
        pass
-    def put_image(self, name, val):
+    def process_image(self, name, val):
        """
        Args:
            val (np.ndarray): 4D (NHWC) numpy array of images in range [0,255].
@@ -77,27 +81,34 @@ class TrainingMonitor(Callback):
        """
        pass
-    def put_event(self, evt):
+    def process_event(self, evt):
        """
        Args:
-            evt (tf.Event): the most basic format, could include Summary,
+            evt (tf.Event): the most basic format acceptable by tensorboard.
-                RunMetadata, LogMessage, and more.
+                It could include Summary, RunMetadata, LogMessage, and more.
        """
        pass
-    # TODO put other types
+    # TODO process other types
 class NoOpMonitor(TrainingMonitor):
    pass
-class Monitors(TrainingMonitor):
+class Monitors(Callback):
    """
    Merge monitors together for trainer to use.
+    In training, each trainer will create a :class:`Monitors` instance,
+    and you can access it through `trainer.monitors`.
+    You should use `trainer.monitors` for logging and it will dispatch your
+    logs to each sub-monitor.
    """
    def __init__(self, monitors):
        self._scalar_history = ScalarHistory()
        self._monitors = monitors + [self._scalar_history]
+        for m in self._monitors:
+            assert isinstance(m, TrainingMonitor), m
    def _setup_graph(self):
        self._scalar_history.setup_graph(self.trainer)
@@ -107,6 +118,9 @@ class Monitors(TrainingMonitor):
            func(m)
    def put_summary(self, summary):
+        """
+        Put a `tf.Summary`.
+        """
        if isinstance(summary, six.binary_type):
            summary = tf.Summary.FromString(summary)
        assert isinstance(summary, tf.Summary), type(summary)
@@ -120,15 +134,19 @@ class Monitors(TrainingMonitor):
                    val.tag = val.tag[:-len(suffix)]
                self._dispatch(lambda m: m.put_scalar(val.tag, val.simple_value))
-        self._dispatch(lambda m: m.put_summary(summary))
+        self._dispatch(lambda m: m.process_summary(summary))
    def put_scalar(self, name, val):
-        self._dispatch(lambda m: m.put_scalar(name, val))
+        """
+        Put a scalar.
+        """
+        self._dispatch(lambda m: m.process_scalar(name, val))
        s = create_scalar_summary(name, val)
-        self._dispatch(lambda m: m.put_summary(s))
+        self._dispatch(lambda m: m.process_summary(s))
    def put_image(self, name, val):
        """
+        Put an image.
        Args:
            name (str):
            val (np.ndarray): 2D, 3D (HWC) or 4D (NHWC) numpy array of images
@@ -136,21 +154,21 @@ class Monitors(TrainingMonitor):
        """
        assert isinstance(val, np.ndarray)
        arr = image_to_nhwc(val)
-        self._dispatch(lambda m: m.put_image(name, arr))
+        self._dispatch(lambda m: m.process_image(name, arr))
        s = create_image_summary(name, arr)
-        self._dispatch(lambda m: m.put_summary(s))
+        self._dispatch(lambda m: m.process_summary(s))
    def put_event(self, evt):
        """
-        Simply call :meth:`put_event` on each monitor.
+        Put an tf.Event.
-        `step` and `wall_time` fields of this proto will be filled automatically.
+        `step` and `wall_time` fields of :class:`tf.Event` will be filled automatically.
        Args:
            evt (tf.Event):
        """
        evt.step = self.global_step
        evt.wall_time = time.time()
-        self._dispatch(lambda m: m.put_event(evt))
+        self._dispatch(lambda m: m.process_event(evt))
    def get_latest(self, name):
        """
@@ -179,10 +197,10 @@ class TFEventWriter(TrainingMonitor):
    def _setup_graph(self):
        self._writer = tf.summary.FileWriter(logger.LOG_DIR, graph=tf.get_default_graph())
-    def put_summary(self, summary):
+    def process_summary(self, summary):
        self._writer.add_summary(summary, self.global_step)
-    def put_event(self, evt):
+    def process_event(self, evt):
        self._writer.add_event(evt)
    def _trigger(self):     # flush every epoch
@@ -200,10 +218,14 @@ def TFSummaryWriter(*args, **kwargs):
 class JSONWriter(TrainingMonitor):
    """
    Write all scalar data to a json file under ``logger.LOG_DIR``, grouped by their global step.
-    It also tries to recover the epoch number during setup, if an existing json file is found at the same place.
+    This monitor also attemps to recover the epoch number during setup,
+    if an existing json file is found at the same place.
    """
    FILENAME = 'stat.json'
+    """
+    The name of the json file.
+    """
    def __new__(cls):
        if logger.LOG_DIR:
@@ -245,8 +267,8 @@ class JSONWriter(TrainingMonitor):
    def _trigger_epoch(self):
        self._push()
-    def put_scalar(self, name, val):
+    def process_scalar(self, name, val):
-        self._stat_now[name] = float(val)   # TODO will fail for non-numeric
+        self._stat_now[name] = val
    def _push(self):
        """ Note that this method is idempotent"""
@@ -316,7 +338,7 @@ class ScalarPrinter(TrainingMonitor):
        if self._enable_epoch:
            self._print_stat()
-    def put_scalar(self, name, val):
+    def process_scalar(self, name, val):
        self._dic[name] = float(val)
    def _print_stat(self):
@@ -341,7 +363,7 @@ class ScalarHistory(TrainingMonitor):
    def _setup_graph(self):
        self._dic = defaultdict(list)
-    def put_scalar(self, name, val):
+    def process_scalar(self, name, val):
        self._dic[name].append(float(val))
    def get_latest(self, name):
@@ -385,7 +407,7 @@ class SendMonitorData(TrainingMonitor):
        self.names = names
        self.dic = {}
-    def put_scalar(self, name, val):
+    def process_scalar(self, name, val):
        if name in self.names:
            self.dic[name] = val
@@ -403,5 +425,5 @@ class SendMonitorData(TrainingMonitor):
        cmd = self.command.format(**v)
        ret = os.system(cmd)
        if ret != 0:
-            logger.error("Command {} failed with ret={}!".format(cmd, ret))
+            logger.error("Command '{}' failed with ret={}!".format(cmd, ret))
        self.dic = {}
--- a/tensorpack/tfutils/summary.py
+++ b/tensorpack/tfutils/summary.py
@@ -194,11 +194,11 @@ def add_param_summary(*summary_lists, **kwargs):
 def add_moving_summary(*args, **kwargs):
    """
-    Enable moving average summary for some tensors.
+    Add moving average summary for some tensors.
    This function is a no-op if not calling from main training tower.
    Args:
-        args: tensors to summary
+        args: tensors to summarize
        decay (float): the decay rate. Defaults to 0.95.
        collection (str or None): the name of the collection to add EMA-maintaining ops.
            The default will work together with the default