Commit f573d8b6 authored by Yuxin Wu's avatar Yuxin Wu

Add docs on summary

parent 2325f7ac
# tensorpack # tensorpack
Neural Network Toolbox on TensorFlow. A neural net training interface based on TensorFlow.
[![Build Status](https://travis-ci.org/ppwwyyxx/tensorpack.svg?branch=master)](https://travis-ci.org/ppwwyyxx/tensorpack) [![Build Status](https://travis-ci.org/ppwwyyxx/tensorpack.svg?branch=master)](https://travis-ci.org/ppwwyyxx/tensorpack)
[![badge](https://readthedocs.org/projects/pip/badge/?version=latest)](http://tensorpack.readthedocs.io/en/latest/index.html) [![badge](https://readthedocs.org/projects/pip/badge/?version=latest)](http://tensorpack.readthedocs.io/en/latest/index.html)
...@@ -31,29 +31,27 @@ Examples are not only for demonstration of the framework -- you can train them a ...@@ -31,29 +31,27 @@ Examples are not only for demonstration of the framework -- you can train them a
It's Yet Another TF wrapper, but different in: It's Yet Another TF wrapper, but different in:
1. Not focus on models. 1. Not focus on models.
+ There are already too many symbolic function wrappers. + There are already too many symbolic function wrappers.
Tensorpack includes only a few common models, and helpful tools such as `LinearWrap` to simplify large models. Tensorpack includes only a few common models,
But you can use any other wrappers within tensorpack, such as sonnet/Keras/slim/tflearn/tensorlayer/.... but you can use any other wrappers within tensorpack, such as sonnet/Keras/slim/tflearn/tensorlayer/....
2. Focus on __training speed__. 2. Focus on __training speed__.
+ Speed comes for free with tensorpack. + Speed comes for free with tensorpack -- it uses TensorFlow in the correct way.
Even on a tiny CNN example, the training runs [1.6x faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6) than the equivalent Keras code. Even on a tiny CNN example, the training runs [1.6x faster](https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6) than the equivalent Keras code.
+ Data-parallel multi-GPU training is off-the-shelf to use. It is as fast as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks). + Data-parallel multi-GPU training is off-the-shelf to use. It is as fast as Google's [official benchmark](https://www.tensorflow.org/performance/benchmarks).
+ Data-parallel distributed training is off-the-shelf to use. It is as slow as Google's official benchmark. + Data-parallel distributed training is off-the-shelf to use. It is as slow as Google's official benchmark.
3. Focus on large datasets. 3. Focus on __large datasets__.
+ It's painful to read/preprocess data through TF. Use __DataFlow__ to load large datasets (e.g. ImageNet) in __pure Python__ with autoparallelization. + It's painful to read/preprocess data through TF. tensorpack helps you load large datasets (e.g. ImageNet) in __pure Python__ with autoparallelization.
+ DataFlow has a unified interface, so you can compose and reuse them to perform complex preprocessing.
4. Interface of extensible __Callbacks__. 4. Interface of extensible __Callbacks__.
Write a callback to implement everything you want to do apart from the training iterations, and Write a callback to implement everything you want to do apart from the training iterations, and
enable it with one line of code. Common examples include: enable it with one line of code. Common examples include:
+ Change hyperparameters during training + Change hyperparameters during training
+ Print some tensors of interest + Print some tensors of interest
+ Run inference on a test dataset + Monitor GPU utilization
+ Run some operations once a while + Send error rate to your phone
+ Send loss to your phone
See [tutorials](http://tensorpack.readthedocs.io/en/latest/tutorial/index.html) to know more about these features. See [tutorials](http://tensorpack.readthedocs.io/en/latest/tutorial/index.html) to know more about these features.
...@@ -68,4 +66,4 @@ Dependencies: ...@@ -68,4 +66,4 @@ Dependencies:
pip install -U git+https://github.com/ppwwyyxx/tensorpack.git pip install -U git+https://github.com/ppwwyyxx/tensorpack.git
# or add `--user` to avoid system-wide installation. # or add `--user` to avoid system-wide installation.
``` ```
Besides, if you only want to use `tensorpack.dataflow` alone, TensorFlow is also optional. Besides, if you only want to use `tensorpack.dataflow` alone as a data processing library, TensorFlow is also optional.
...@@ -10,6 +10,8 @@ BUILDDIR = build ...@@ -10,6 +10,8 @@ BUILDDIR = build
.PHONY: help Makefile docset .PHONY: help Makefile docset
all: html
# Put it first so that "make" without argument is like "make help". # Put it first so that "make" without argument is like "make help".
help: help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
...@@ -20,5 +22,5 @@ docset: html ...@@ -20,5 +22,5 @@ docset: html
# Catch-all target: route all unknown targets to Sphinx using the new # Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile html: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
...@@ -5,7 +5,7 @@ ...@@ -5,7 +5,7 @@
The library tries to __support__ everything, but it could not really __include__ everything. The library tries to __support__ everything, but it could not really __include__ everything.
The interface tries to be flexible enough so you can put any XYZ on it. The interface attempts to be flexible enough so you can put any XYZ on it.
You can either implement them under the interface or simply wrap some existing Python code. You can either implement them under the interface or simply wrap some existing Python code.
See [Extend Tensorpack](index.html#extend-tensorpack) See [Extend Tensorpack](index.html#extend-tensorpack)
for more details. for more details.
......
# Build the Graph # Build the Graph
This tutorial explains how a graph is built in tensorpack.
### ModelDesc ### ModelDesc
`ModelDesc` is an abstraction over the most common type of models people train. `ModelDesc` is an abstraction over the most common type of models people train.
......
...@@ -35,6 +35,7 @@ User Tutorials ...@@ -35,6 +35,7 @@ User Tutorials
symbolic symbolic
trainer trainer
callback callback
summary
faq faq
Extend Tensorpack Extend Tensorpack
......
# Summary and Logging
This tutorial will introduce the `Monitor` backend and
explain how tensorpack handles summaries and logging.
### Monitors
In tensorpack, everything besides the training iterations are done in callbacks, including all the logging.
When a callback gets something to log, it will write to the monitor backend through
`trainer.monitors`, by calling `put_{scalar,image,summary,...}`.
The call gets dispatched to multiple `TrainingMonitor` instances.
These monitors are a special type of callback which can process different types of log data,
and can be customized in `TrainConfig`.
### TensorFlow Summaries
Here is how TensorFlow summaries eventually get logged/saved/printed:
1. __What to Log__: When you call `tf.summary.xxx` in your graph code, TensorFlow adds an op to
`tf.GraphKeys.SUMMARIES` collection (by default).
2. __When to Log__: A [MergeAllSummaries](../modules/callbacks.html#tensorpack.callbacks.MergeAllSummaries)
callback is enabled by default in `TrainConfig`.
It runs ops in the `SUMMARIES` collection (by default) every epoch (by default),
and writes results to the monitor backend.
3. __Where to Log__:
* A [TFEventWriter](../modules/callbacks.html#tensorpack.callbacks.TFEventWriter)
monitor is enabled by default in [TrainConfig](../modules/train.html#tensorpack.train.TrainConfig),
which writes things to an event file used by tensorboard.
* A [ScalarPrinter](../modules/callbacks.html#tensorpack.callbacks.ScalarPrinter)
monitor is enabled by default, which prints all scalars in your terminal.
* A [JSONWriter](../modules/callbacks.html#tensorpack.callbacks.JSONWriter)
monitor is enabled by default, which saves scalars to a file.
Since summaries are evaluated every epoch by default, if the content is data-dependent, the results
are likely to have too much variance. You can:
1. Change "When to Log": log more frequently, but note that some large summaries are expensive to
log. You may want to use a separate collection for frequent logging.
2. Change "What to Log": you can call
[tfutils.summary.add_moving_summary](../modules/tfutils.html#tensorpack.tfutils.summary.add_moving_summary)
on scalar tensors, which will summarize the moving average of those scalars instead of their instant values.
The moving averages are maintained by the
[MovingAverageSummary](../modules/callbacks.html#tensorpack.callbacks.MovingAverageSummary)
callback (enabled by default).
Besides TensorFlow summaries,
a callback is free to log any other types of data to the monitor backend,
anytime after the training has started.
...@@ -7,7 +7,7 @@ such as conv/deconv, fc, bn, pooling layers. ...@@ -7,7 +7,7 @@ such as conv/deconv, fc, bn, pooling layers.
Using the tensorpack implementations, you can also benefit from `argscope` and `LinearWrap` to Using the tensorpack implementations, you can also benefit from `argscope` and `LinearWrap` to
simplify the code. simplify the code.
Note that these layers were written because there are no other alternatives back at that time. Note that these layers were written because there were no other alternatives back at that time.
In the future we may shift the implementation to `tf.layers` because they will be better maintained. In the future we may shift the implementation to `tf.layers` because they will be better maintained.
### argscope and LinearWrap ### argscope and LinearWrap
......
...@@ -15,9 +15,9 @@ Tensorpack base trainer implements the logic of __running the iteration__. ...@@ -15,9 +15,9 @@ Tensorpack base trainer implements the logic of __running the iteration__.
Users or derived trainers should implement __what the iteration is__. Users or derived trainers should implement __what the iteration is__.
2. Trainer assumes the existence of __"epoch"__, i.e. that the iterations run in double for-loops. 2. Trainer assumes the existence of __"epoch"__, i.e. that the iterations run in double for-loops.
But it doesn't need to be a full pass of your dataset, ``steps_per_epoch`` can be any number you set But an epoch doesn't need to be a full pass of your dataset, the size of an epoch can be any number you set
and it only affects the [schedule of callbacks](extend/callback.html). and it only affects the [schedule of callbacks](extend/callback.html).
In other words, an "epoch" is the __default period__ to run callbacks (validation, summary, checkpoint, etc.). In other words, an "epoch" in tensorpack is the __default period to run callbacks__ (validation, summary, checkpoint, etc.).
### Common Trainers ### Common Trainers
......
...@@ -54,22 +54,26 @@ class TrainingMonitor(Callback): ...@@ -54,22 +54,26 @@ class TrainingMonitor(Callback):
""" Override this method to setup the monitor.""" """ Override this method to setup the monitor."""
pass pass
def put_summary(self, summary): def process_summary(self, summary):
""" """
Process a tf.Summary. Process a tf.Summary.
""" """
pass pass
def put(self, name, val): def process(self, name, val):
""" """
Process a key-value pair. Process a key-value pair.
""" """
pass pass
def put_scalar(self, name, val): def process_scalar(self, name, val):
"""
Args:
val: a scalar
"""
pass pass
def put_image(self, name, val): def process_image(self, name, val):
""" """
Args: Args:
val (np.ndarray): 4D (NHWC) numpy array of images in range [0,255]. val (np.ndarray): 4D (NHWC) numpy array of images in range [0,255].
...@@ -77,27 +81,34 @@ class TrainingMonitor(Callback): ...@@ -77,27 +81,34 @@ class TrainingMonitor(Callback):
""" """
pass pass
def put_event(self, evt): def process_event(self, evt):
""" """
Args: Args:
evt (tf.Event): the most basic format, could include Summary, evt (tf.Event): the most basic format acceptable by tensorboard.
RunMetadata, LogMessage, and more. It could include Summary, RunMetadata, LogMessage, and more.
""" """
pass pass
# TODO put other types # TODO process other types
class NoOpMonitor(TrainingMonitor): class NoOpMonitor(TrainingMonitor):
pass pass
class Monitors(TrainingMonitor): class Monitors(Callback):
""" """
Merge monitors together for trainer to use. Merge monitors together for trainer to use.
In training, each trainer will create a :class:`Monitors` instance,
and you can access it through `trainer.monitors`.
You should use `trainer.monitors` for logging and it will dispatch your
logs to each sub-monitor.
""" """
def __init__(self, monitors): def __init__(self, monitors):
self._scalar_history = ScalarHistory() self._scalar_history = ScalarHistory()
self._monitors = monitors + [self._scalar_history] self._monitors = monitors + [self._scalar_history]
for m in self._monitors:
assert isinstance(m, TrainingMonitor), m
def _setup_graph(self): def _setup_graph(self):
self._scalar_history.setup_graph(self.trainer) self._scalar_history.setup_graph(self.trainer)
...@@ -107,6 +118,9 @@ class Monitors(TrainingMonitor): ...@@ -107,6 +118,9 @@ class Monitors(TrainingMonitor):
func(m) func(m)
def put_summary(self, summary): def put_summary(self, summary):
"""
Put a `tf.Summary`.
"""
if isinstance(summary, six.binary_type): if isinstance(summary, six.binary_type):
summary = tf.Summary.FromString(summary) summary = tf.Summary.FromString(summary)
assert isinstance(summary, tf.Summary), type(summary) assert isinstance(summary, tf.Summary), type(summary)
...@@ -120,15 +134,19 @@ class Monitors(TrainingMonitor): ...@@ -120,15 +134,19 @@ class Monitors(TrainingMonitor):
val.tag = val.tag[:-len(suffix)] val.tag = val.tag[:-len(suffix)]
self._dispatch(lambda m: m.put_scalar(val.tag, val.simple_value)) self._dispatch(lambda m: m.put_scalar(val.tag, val.simple_value))
self._dispatch(lambda m: m.put_summary(summary)) self._dispatch(lambda m: m.process_summary(summary))
def put_scalar(self, name, val): def put_scalar(self, name, val):
self._dispatch(lambda m: m.put_scalar(name, val)) """
Put a scalar.
"""
self._dispatch(lambda m: m.process_scalar(name, val))
s = create_scalar_summary(name, val) s = create_scalar_summary(name, val)
self._dispatch(lambda m: m.put_summary(s)) self._dispatch(lambda m: m.process_summary(s))
def put_image(self, name, val): def put_image(self, name, val):
""" """
Put an image.
Args: Args:
name (str): name (str):
val (np.ndarray): 2D, 3D (HWC) or 4D (NHWC) numpy array of images val (np.ndarray): 2D, 3D (HWC) or 4D (NHWC) numpy array of images
...@@ -136,21 +154,21 @@ class Monitors(TrainingMonitor): ...@@ -136,21 +154,21 @@ class Monitors(TrainingMonitor):
""" """
assert isinstance(val, np.ndarray) assert isinstance(val, np.ndarray)
arr = image_to_nhwc(val) arr = image_to_nhwc(val)
self._dispatch(lambda m: m.put_image(name, arr)) self._dispatch(lambda m: m.process_image(name, arr))
s = create_image_summary(name, arr) s = create_image_summary(name, arr)
self._dispatch(lambda m: m.put_summary(s)) self._dispatch(lambda m: m.process_summary(s))
def put_event(self, evt): def put_event(self, evt):
""" """
Simply call :meth:`put_event` on each monitor. Put an tf.Event.
`step` and `wall_time` fields of this proto will be filled automatically. `step` and `wall_time` fields of :class:`tf.Event` will be filled automatically.
Args: Args:
evt (tf.Event): evt (tf.Event):
""" """
evt.step = self.global_step evt.step = self.global_step
evt.wall_time = time.time() evt.wall_time = time.time()
self._dispatch(lambda m: m.put_event(evt)) self._dispatch(lambda m: m.process_event(evt))
def get_latest(self, name): def get_latest(self, name):
""" """
...@@ -179,10 +197,10 @@ class TFEventWriter(TrainingMonitor): ...@@ -179,10 +197,10 @@ class TFEventWriter(TrainingMonitor):
def _setup_graph(self): def _setup_graph(self):
self._writer = tf.summary.FileWriter(logger.LOG_DIR, graph=tf.get_default_graph()) self._writer = tf.summary.FileWriter(logger.LOG_DIR, graph=tf.get_default_graph())
def put_summary(self, summary): def process_summary(self, summary):
self._writer.add_summary(summary, self.global_step) self._writer.add_summary(summary, self.global_step)
def put_event(self, evt): def process_event(self, evt):
self._writer.add_event(evt) self._writer.add_event(evt)
def _trigger(self): # flush every epoch def _trigger(self): # flush every epoch
...@@ -200,10 +218,14 @@ def TFSummaryWriter(*args, **kwargs): ...@@ -200,10 +218,14 @@ def TFSummaryWriter(*args, **kwargs):
class JSONWriter(TrainingMonitor): class JSONWriter(TrainingMonitor):
""" """
Write all scalar data to a json file under ``logger.LOG_DIR``, grouped by their global step. Write all scalar data to a json file under ``logger.LOG_DIR``, grouped by their global step.
It also tries to recover the epoch number during setup, if an existing json file is found at the same place. This monitor also attemps to recover the epoch number during setup,
if an existing json file is found at the same place.
""" """
FILENAME = 'stat.json' FILENAME = 'stat.json'
"""
The name of the json file.
"""
def __new__(cls): def __new__(cls):
if logger.LOG_DIR: if logger.LOG_DIR:
...@@ -245,8 +267,8 @@ class JSONWriter(TrainingMonitor): ...@@ -245,8 +267,8 @@ class JSONWriter(TrainingMonitor):
def _trigger_epoch(self): def _trigger_epoch(self):
self._push() self._push()
def put_scalar(self, name, val): def process_scalar(self, name, val):
self._stat_now[name] = float(val) # TODO will fail for non-numeric self._stat_now[name] = val
def _push(self): def _push(self):
""" Note that this method is idempotent""" """ Note that this method is idempotent"""
...@@ -316,7 +338,7 @@ class ScalarPrinter(TrainingMonitor): ...@@ -316,7 +338,7 @@ class ScalarPrinter(TrainingMonitor):
if self._enable_epoch: if self._enable_epoch:
self._print_stat() self._print_stat()
def put_scalar(self, name, val): def process_scalar(self, name, val):
self._dic[name] = float(val) self._dic[name] = float(val)
def _print_stat(self): def _print_stat(self):
...@@ -341,7 +363,7 @@ class ScalarHistory(TrainingMonitor): ...@@ -341,7 +363,7 @@ class ScalarHistory(TrainingMonitor):
def _setup_graph(self): def _setup_graph(self):
self._dic = defaultdict(list) self._dic = defaultdict(list)
def put_scalar(self, name, val): def process_scalar(self, name, val):
self._dic[name].append(float(val)) self._dic[name].append(float(val))
def get_latest(self, name): def get_latest(self, name):
...@@ -385,7 +407,7 @@ class SendMonitorData(TrainingMonitor): ...@@ -385,7 +407,7 @@ class SendMonitorData(TrainingMonitor):
self.names = names self.names = names
self.dic = {} self.dic = {}
def put_scalar(self, name, val): def process_scalar(self, name, val):
if name in self.names: if name in self.names:
self.dic[name] = val self.dic[name] = val
...@@ -403,5 +425,5 @@ class SendMonitorData(TrainingMonitor): ...@@ -403,5 +425,5 @@ class SendMonitorData(TrainingMonitor):
cmd = self.command.format(**v) cmd = self.command.format(**v)
ret = os.system(cmd) ret = os.system(cmd)
if ret != 0: if ret != 0:
logger.error("Command {} failed with ret={}!".format(cmd, ret)) logger.error("Command '{}' failed with ret={}!".format(cmd, ret))
self.dic = {} self.dic = {}
...@@ -194,11 +194,11 @@ def add_param_summary(*summary_lists, **kwargs): ...@@ -194,11 +194,11 @@ def add_param_summary(*summary_lists, **kwargs):
def add_moving_summary(*args, **kwargs): def add_moving_summary(*args, **kwargs):
""" """
Enable moving average summary for some tensors. Add moving average summary for some tensors.
This function is a no-op if not calling from main training tower. This function is a no-op if not calling from main training tower.
Args: Args:
args: tensors to summary args: tensors to summarize
decay (float): the decay rate. Defaults to 0.95. decay (float): the decay rate. Defaults to 0.95.
collection (str or None): the name of the collection to add EMA-maintaining ops. collection (str or None): the name of the collection to add EMA-maintaining ops.
The default will work together with the default The default will work together with the default
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment