Augmentor uses register_at_fork; update docs (fix #1099)

2878aceb · Yuxin Wu · 964f5d03 · 2878aceb · 2878aceb · 2878aceb
Commit 2878aceb authored Mar 03, 2019 by Yuxin Wu
13 changed files
--- a/README.md
+++ b/README.md
@@ -22,7 +22,7 @@ It's Yet Another TF high-level API, with __speed__, and __flexibility__ built to
    some benchmark scripts.
 2. Focus on __large datasets__.
-	+ [You don't usually need `tf.data`](http://tensorpack.readthedocs.io/tutorial/input-source.html#tensorflow-reader-cons).
+	+ [You don't usually need `tf.data`](http://tensorpack.readthedocs.io/tutorial/extend/input-source.html#tensorflow-reader-cons).
    Symbolic programming often makes data processing harder.
 	  Tensorpack helps you efficiently process large datasets (e.g. ImageNet) in __pure Python__ with autoparallelization.

--- a/docs/modules/dataflow.imgaug.rst
+++ b/docs/modules/dataflow.imgaug.rst
@@ -2,30 +2,8 @@ tensorpack.dataflow.imgaug package
 ==================================
 This package contains Tensorpack's augmentors.
-The imgaug module is designed to allow the following usage:
+Read the `tutorial <../tutorial/extend/augmentor.html>`_
+first for its design and general usage.
-1. Factor out randomness and determinism.
-   An augmentor may be randomized, but you can call
-   `augment_return_params <#tensorpack.dataflow.imgaug.Augmentor.augment_return_params>`_
-   to obtain the randomized parameters and then call
-   `augment_with_params <#tensorpack.dataflow.imgaug.Augmentor.augment_with_params>`_
-   on other data with the same randomized parameters.
-2. Because of (1), tensorpack's augmentor can augment multiple images together
-   easily. This is commonly used for augmenting an image together with its masks.
-3. An image augmentor (e.g. flip) may also augment a coordinate, with
-   `augment_coords <#tensorpack.dataflow.imgaug.ImageAugmentor.augment_coords>`_.
-   In this way, images can be augmented together with
-   boxes, polygons, keypoints, etc.
-   Coordinate augmentation enforces floating points coordinates
-   to avoid quantization error.
-4. Reset random seed. Random seed can be reset by
-   `reset_state <#tensorpack.dataflow.imgaug.Augmentor.reset_state>`_.
-   This is important for multi-process data loading, and
-   it is called automatically if you use tensorpack's
-   `image augmentation dataflow <dataflow.html#tensorpack.dataflow.AugmentImageComponent>`_.
 Note that other image augmentation libraries can be wrapped into Tensorpack's interface as well.
 For example, `imgaug.IAAugmentor <#tensorpack.dataflow.imgaug.IAAugmentor>`_

--- a/docs/modules/dataflow.rst
+++ b/docs/modules/dataflow.rst
 tensorpack.dataflow package
 ===========================
-Relevant tutorials: :doc:`../tutorial/dataflow`, :doc:`../tutorial/input-source`.
+Relevant tutorials: :doc:`../tutorial/dataflow`, :doc:`../tutorial/extend/input-source`.
 .. container:: custom-index

--- a/docs/modules/input_source.rst
+++ b/docs/modules/input_source.rst
 tensorpack.input_source package
 ================================
-Read the relevant tutorials first for an overview of InputSource: :doc:`../tutorial/input-source`.
+Read the relevant tutorials first for an overview of InputSource: :doc:`../tutorial/extend/input-source`.
 .. automodule:: tensorpack.input_source
    :members:

--- a/docs/tutorial/callback.md
+++ b/docs/tutorial/callback.md
@@ -13,7 +13,7 @@ There are several places where you might want to do something else:
 * Between epochs (e.g. save the model, run some validation)
 * After the training (e.g. send the model somewhere, send a message to your phone)
-We found people traditionally tend to write the training loop together with these extra features.
+People normally would write the training loop together with these extra features.
 This makes the loop lengthy, and the code for the same feature probably get separated (imagine a
 feature which needs initialization in the beginning and then some actual work between iterations).
@@ -72,6 +72,9 @@ monitors=[        # monitors are a special kind of callbacks. these are also ena
 Notice that callbacks cover every detail of training, ranging from graph operations to the progress bar.
 This means you can customize every part of the training to your preference, e.g. display something
 different in the progress bar, evaluate part of the summaries at a different frequency, etc.
+Similar concepts also exists in other frameworks, such as Keras callbacks, or
+`tf.train.SessionRunHook`. But tensorpack callbacks have more functionalities in
+design, and can achive much more features, as you can see above.
 These features are not always necessary, but think about how messy the main loop would look like if you
 were to write these logic together with the loops, and how easy your life will be if you could enable

--- a/docs/tutorial/dataflow.md
+++ b/docs/tutorial/dataflow.md
@@ -45,25 +45,26 @@ with all the data preprocessing.
 Unless you are working with standard data types (image folders, LMDB, etc),
 you would usually want to write the source DataFlow (`MyDataFlow` in the above example) for your data format.
 See [another tutorial](extend/dataflow.html) for simple instructions on writing a DataFlow.
-Once you have the source reader, all the [existing DataFlows](../modules/dataflow.html) are ready for you to complete
+Once you have the source reader, all the [existing
-the rest of the data pipeline.
+DataFlows](../modules/dataflow.html) are ready for you to build up the rest of the data pipeline.
 ### Why DataFlow
 1. It's easy: write everything in pure Python, and reuse existing utilities.
 	 On the contrary, writing data loaders in TF operators is usually painful, and performance is hard to tune.
-	 See more discussions in [Python Reader or TF Reader](input-source.html#python-reader-or-tf-reader).
+	 See more discussions in [Python Reader or TF Reader](extend/input-source.html#python-reader-or-tf-reader).
 2. It's fast: see [Efficient DataFlow](efficient-dataflow.html)
 	on how to build a fast DataFlow with parallelism.
-	If you're using DataFlow with tensorpack, also see [Input Pipeline tutorial](input-source.html)
+	If you're using DataFlow with tensorpack, also see [Input Pipeline tutorial](extend/input-source.html)
 	on how tensorpack further accelerates data loading in the graph.
 Nevertheless, tensorpack supports data loading with native TF operators / TF datasets as well.
-### Use DataFlow outside Tensorpack
+### Use DataFlow in Your Own Code
-Normally, tensorpack `InputSource` interface links DataFlow to the graph for training.
+Normally, tensorpack `InputSource` interface runs the DataFlow during training.
-If you use DataFlow in other places such as your custom code, call `reset_state()` first to initialize it,
+However, DataFlow can also be used without other tensorpack components.
+If you need to run the DataFlow by yourself, call `reset_state()` first to initialize it,
 and then use the generator however you like:
 ```python
 df = SomeDataFlow()

--- a/docs/tutorial/efficient-dataflow.md
+++ b/docs/tutorial/efficient-dataflow.md
@@ -22,7 +22,7 @@ Some things to know before reading:
 	 This tutorial could be a bit complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-scale dataset.
 2. Having a fast Python generator **alone** may or may not improve your overall training speed.
 	 You need mechanisms to hide the latency of **all** preprocessing stages, as mentioned in the
-	 [previous tutorial](input-source.html).
+	 [InputSource tutorial](extend/input-source.html).
 3. Reading training set and validation set are different.
 	 In training it's OK to reorder, regroup, or even duplicate some datapoints, as long as the
 	 data distribution roughly stays the same.

--- a/docs/tutorial/extend/augmentor.md
+++ b/docs/tutorial/extend/augmentor.md
-#### Design of Tensorpack's imgaug Module
-The [imgaug module](../../modules/dataflow.imgaug.html) is designed to allow the following usage:
-1. Factor out randomness and determinism.
-   An augmentor may be randomized, but you can call
-   [augment_return_params](../../modules/dataflow.imgaug.html#tensorpack.dataflow.imgaug.Augmentor.augment_return_params)
-   to obtain the randomized parameters and then call
-   [augment_with_params](../../modules/dataflow.imgaug.html#tensorpack.dataflow.imgaug.Augmentor.augment_with_params)
-   on other data with the same randomized parameters.
-2. Because of (1), tensorpack's augmentor can augment multiple images together
-   easily. This is commonly used for augmenting an image together with its masks.
-3. An image augmentor (e.g. flip) may also augment a coordinate, with
-   [augment_coords](../../modules/dataflow.imgaug.html#tensorpack.dataflow.imgaug.ImageAugmentor.augment_coords).
-   In this way, images can be augmented together with
-   boxes, polygons, keypoints, etc.
-   Coordinate augmentation enforces floating points coordinates
-   to avoid quantization error.
-4. Reset random seed. Random seed can be reset by
-   [reset_state](../../modules/dataflow.imgaug.html#tensorpack.dataflow.imgaug.Augmentor.reset_state).
-   This is important for multi-process data loading, and
-   the reset method is called automatically if you use tensorpack's 
-   [image augmentation dataflow](../../modules/dataflow.html#tensorpack.dataflow.AugmentImageComponent).
 ### Write an Image Augmentor
@@ -31,7 +6,13 @@ The first thing to note: __you never have to write an augmentor__.
 An augmentor is a part of the DataFlow, so you can always
 [write a DataFlow](dataflow.html)
 to do whatever operations to your data, rather than writing an augmentor.
-Augmentors just sometimes make things easier.
+Augmentor makes things easier when what you want fits its design.
+But remember it is just an abstraction that may not always work for your use case.
+For example, if your data transformation depend on multiple dataflow components,
+or if you want to apply different transformations to different components,
+the abstraction is often not enough for you, and you need to write code on the
+DataFlow level instead.
 An image augmentor maps an image to an image.
 If you have such a mapping function `f` already, you can simply use
@@ -58,3 +39,36 @@ class MyAug(imgaug.ImageAugmentor):
    # coords is a Nx2 floating point array, each row is (x, y)
    return augmented_coords
 ```
+#### The Design of imgaug Module
+The [imgaug module](../../modules/dataflow.imgaug.html) is designed to allow the following usage:
+* Factor out randomness and determinism.
+  An augmentor may be randomized, but you can call
+  [augment_return_params](../../modules/dataflow.imgaug.html#tensorpack.dataflow.imgaug.Augmentor.augment_return_params)
+  to obtain the randomized parameters and then call
+  [augment_with_params](../../modules/dataflow.imgaug.html#tensorpack.dataflow.imgaug.Augmentor.augment_with_params)
+  on other data with the same randomized parameters.
+* Because of the above reason, tensorpack's augmentor can augment multiple images together
+  easily. This is commonly used for augmenting an image together with its masks.
+* An image augmentor (e.g. flip) may also augment a coordinate, with
+  [augment_coords](../../modules/dataflow.imgaug.html#tensorpack.dataflow.imgaug.ImageAugmentor.augment_coords).
+  In this way, images can be augmented together with
+  boxes, polygons, keypoints, etc.
+  Coordinate augmentation enforces floating points coordinates
+  to avoid quantization error.
+* Reset random seed. Random seed can be reset by
+  [reset_state](../../modules/dataflow.imgaug.html#tensorpack.dataflow.imgaug.Augmentor.reset_state).
+  This is important for multi-process data loading, to make sure different
+  processes get different seeds. 
+  The reset method is called automatically if you use tensorpack's 
+  [image augmentation dataflow](../../modules/dataflow.html#tensorpack.dataflow.AugmentImageComponent).
+  Otherwise, **you are responsible** for calling it by yourself in subprocesses.
+  See the
+  [API documentation](../../modules/dataflow.imgaug.html#tensorpack.dataflow.imgaug.Augmentor.reset_state)
+  of this method for more details.
--- a/docs/tutorial/index.rst
+++ b/docs/tutorial/index.rst
@@ -7,14 +7,13 @@ Introduction
 .. include:: intro.rst
-User Tutorials
+Basic Tutorials
 ========================
 .. toctree::
  :maxdepth: 1
  dataflow
-  input-source
  symbolic
  trainer
  training-interface
@@ -24,25 +23,24 @@ User Tutorials
  inference
  faq
+Advanced Tutorials
-Performance
-============
-.. toctree::
-  :maxdepth: 1
-  efficient-dataflow
-  performance-tuning
-Extend Tensorpack
 ==================
 .. toctree::
  :maxdepth: 1
  extend/dataflow
+  extend/input-source
  extend/augmentor
  extend/model
  extend/callback
  extend/trainer
+Performance
+============
+.. toctree::
+  :maxdepth: 1
+  efficient-dataflow
+  performance-tuning
--- a/docs/tutorial/input-source.md
+++ b/docs/tutorial/input-source.md
-# Input Pipeline
-This tutorial contains some general discussions on the topic of
-"how to read data efficiently to work with TensorFlow",
-and how tensorpack supports these methods.
-As a beginner you can skip this tutorial, because these are details under the tensorpack interface,
-but knowing it could help understand the efficiency and choose the best input pipeline for your task.
-## Prepare Data in Parallel
-![prefetch](https://cloud.githubusercontent.com/assets/1381301/26525192/36e5de48-4304-11e7-88ab-3b790bd0e028.png)
-A common sense no matter what framework you use:
-<center>
-Prepare data in parallel with the training!
-</center>
-The reasons are:
-1. Data preparation often consumes non-trivial time (depend on the actual problem).
-2. Data preparation often uses completely different resources from training (see figure above) --
-	doing them together doesn't slow you down. In fact you can further parallelize different stages in
-	the preparation since they also use different resources.
-3. Data preparation often doesn't depend on the result of the previous training step.
-Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks),
-4 P100 GPUs can train ResNet50 at 852 images/sec, and the size of those images are 852\*224\*224\*3\*4bytes = 489MB.
-Assuming you have 5GB/s `memcpy` bandwidth (roughly like this if you run single-thread copy), simply copying the data once would take 0.1s -- slowing
-down your training by 10%. Think about how many more copies are made during your preprocessing.
-Failure to hide the data preparation latency is the major reason why people
-cannot see good GPU utilization. You should __always choose a framework that enables latency hiding.__
-However most other TensorFlow wrappers are designed to be `feed_dict` based.
-Tensorpack has built-in mechanisms to hide latency of the above stages.
-This is one of the reasons why tensorpack is [faster](https://github.com/tensorpack/benchmarks).
-## Python Reader or TF Reader ?
-The above discussion is valid regardless of what you use to load/preprocess data,
-either Python code or TensorFlow operators, or a mix of two.
-Both are supported in tensorpack, while we recommend using Python.
-### TensorFlow Reader: Pros
-People often think they should use `tf.data` because it's fast.
-* Indeed it's often fast, but not necessarily. With Python you have access to many other fast libraries, which might be unsupported in TF.
-* Python may be just fast enough.
-    As long as data preparation keeps up with training, and the latency of all four blocks in the
-    above figure is hidden, __faster reader brings no gains to overall throughput__.
-    For most types of problems, up to the scale of multi-GPU ImageNet training,
-    Python can offer enough speed if you use a fast library (e.g. `tensorpack.dataflow`).
-    See the [Efficient DataFlow](efficient-dataflow.html) tutorial on how to build a fast Python reader with DataFlow.
-### TensorFlow Reader: Cons
-The disadvantage of TF reader is obvious and it's huge: it's __too complicated__.
-Unlike running a mathematical model, data processing is a complicated and poorly-structured task.
-You need to handle different formats, handle corner cases, noisy data, combination of data.
-Doing these requires condition operations, loops, data structures, sometimes even exception handling.
-These operations are __naturally not the right task for a symbolic graph__.
-Let's take a look at what users are asking for `tf.data`:
-* Different ways to [pad data](https://github.com/tensorflow/tensorflow/issues/13969), [shuffle data](https://github.com/tensorflow/tensorflow/issues/14518)
-* [Handle none values in data](https://github.com/tensorflow/tensorflow/issues/13865)
-* [Handle dataset that's not a multiple of batch size](https://github.com/tensorflow/tensorflow/issues/13745)
-* [Different levels of determinism](https://github.com/tensorflow/tensorflow/issues/13932)
-* [Sort/skip some data](https://github.com/tensorflow/tensorflow/issues/14250)
-* [Write data to files](https://github.com/tensorflow/tensorflow/issues/15014)
-To support all these features which could've been done with __3 lines of code in Python__, you need either a new TF
-API, or ask [Dataset.from_generator](https://www.tensorflow.org/versions/r1.4/api_docs/python/tf/contrib/data/Dataset#from_generator)
-(i.e. Python again) to the rescue.
-It only makes sense to use TF to read data, if your data is originally very clean and well-formated.
-If not, you may feel like writing a script to format your data, but then you're almost writing a Python loader already!
-Think about it: it's a waste of time to write a Python script to transform from some format to TF-friendly format,
-then a TF script to transform from this format to tensors.
-The intermediate format doesn't have to exist.
-You just need the right interface to connect Python to the graph directly, efficiently.
-`tensorpack.InputSource` is such an interface.
-## InputSource
-`InputSource` is an abstract interface used by tensorpack trainers, to describe where the inputs come from and how they enter the graph.
-Some choices are:
-1. [FeedInput](../modules/input_source.html#tensorpack.input_source.FeedInput):
-	Data come from a DataFlow and get fed to the graph (slow).
-2. [QueueInput](../modules/input_source.html#tensorpack.input_source.QueueInput):
-    Data come from a DataFlow and get buffered on CPU by a TF queue.
-3. [StagingInput](../modules/input_source.html#tensorpack.input_source.StagingInput):
-	Come from some other `InputSource`, then prefetched on GPU by a TF StagingArea.
-4. [TFDatasetInput](../modules/input_source.html#tensorpack.input_source.TFDatasetInput)
-	Come from a `tf.data.Dataset`.
-5. [dataflow_to_dataset](../modules/input_source.html#tensorpack.input_source.TFDatasetInput.dataflow_to_dataset)
-	Come from a DataFlow, and then lfurther processed by utilities in `tf.data.Dataset`.
-6. [TensorInput](../modules/input_source.html#tensorpack.input_source.TensorInput):
-	Come from some tensors you define (can be reading ops, for example).
-7. [ZMQInput](../modules/input_source.html#tensorpack.input_source.ZMQInput)
-	Come from some ZeroMQ pipe, where the reading/preprocessing may happen in a different process or even a different machine.
-Typically, we recommend using `DataFlow + QueueInput` as it's good for most use cases.
-If your data has to come from a separate process for whatever reasons, use `ZMQInput`.
-If you need to use TF reading ops directly, either define a `tf.data.Dataset`
-and use `TFDatasetInput`, or use `TensorInput`.
-Refer to the documentation of these `InputSource` for more details.
--- a/docs/tutorial/training-interface.md
+++ b/docs/tutorial/training-interface.md
@@ -27,14 +27,14 @@ class MyModel(ModelDesc):
    return tf.train.GradientDescentOptimizer(0.1)
 ```
-`inputs` should define the metainfo of all the inputs your graph will take to build.
+`inputs()` should define the metainfo of all the inputs your graph will take to build.
-`build_graph` takes inputs tensors that matches what you've defined in `inputs()`.
+`build_graph()` takes inputs tensors that matches what you've defined in `inputs()`.
 You can use any symbolic functions in `build_graph`, including TensorFlow core library
 functions and other symbolic libraries.
 `build_graph` will be the tower function, so you need to follow [some rules](trainer.md#tower-trainer).
-Because this interface is for single-cost training, you need to return the cost tensor.
+Because this interface is specialized for single-cost training, you need to return the cost tensor.
 After defining such a model, use it with `TrainConfig` and `launch_train_with_config`:
@@ -59,7 +59,7 @@ and
 for detailed functionalities.
 The function `launch_train_with_config(config, trainer)`
-uses the raw trainer interface and is almost equivalent to the following two lines of code:
+uses the raw trainer interface under the hood, and is almost equivalent to the following two lines of code:
 ```python
 trainer.setup_graph(
    my_model.get_inputs_desc(),
@@ -83,8 +83,8 @@ The function `launch_train_with_config` exists mainly for historical reasons.
 ### Keras Interface
-Some wrappers were made on top of tensorpack trainers, to create a Keras-like
+Some wrappers were made on top of tensorpack trainers, to create a Keras-like interface.
-interface. See [Tensorpack+Keras examples](../examples/keras) for details.
+See [Tensorpack+Keras examples](../examples/keras) for details.
 ### Raw Trainer Interface
@@ -93,11 +93,14 @@ To get a lower-level control, you can also access trainer methods directly:
 __Build the graph__:
 For single-cost trainers, build the graph by calling
 [SingleCostTrainer.setup_graph](../modules/train.html#tensorpack.train.SingleCostTrainer.setup_graph).
+For other types of tasks, you can build the graph by yourself.
 __Start training__: Call
-[Trainer.train()](../modules/train.html#tensorpack.train.Trainer.train),
+[Trainer.train()](../modules/train.html#tensorpack.train.Trainer.train) to start
-or
+training, or call
 [Trainer.train_with_defaults()](../modules/train.html#tensorpack.train.Trainer.train_with_defaults)
 which applies some defaults options for common use cases.
-Read their API documentation for detail usage.
+Read their API documentation and the
+[advanced trainer tutorial](extend/trainer.html)
+for more details.
--- a/tensorpack/dataflow/base.py
+++ b/tensorpack/dataflow/base.py
@@ -126,7 +126,7 @@ class DataFlow(object):
    def reset_state(self):
        """
-        * It's guaranteed that :meth:`reset_state` should be called **once and only once**
+        * The caller must guarantee that :meth:`reset_state` should be called **once and only once**
          by the **process that uses the dataflow** before :meth:`__iter__` is called.
          The caller thread of this method should stay alive to keep this dataflow alive.
@@ -139,8 +139,13 @@ class DataFlow(object):
        * A dataflow is not fork-safe after :meth:`reset_state` is called (because this will violate the guarantee).
          A few number of dataflow is not fork-safe anytime, which will be mentioned in the docs.
-        * You should follow the above guarantee if you're using a dataflow yourself
+        * Tensorpack's built-in forking dataflows (:class:`MultiProcessPrefetchData`, :class:`MultiProcessMapData`, etc)
-          (either outside of tensorpack, or writing a wrapper dataflow)
+          and other component that uses dataflows (:class:`InputSource`)
+          already take care of the responsibility of calling this method.
+        * You should take the responsibility and follow the above guarantee if you're the caller of a dataflow yourself
+          (either if you're using dtaflow outside of tensorpack,
+          or if you're writing a wrapper dataflow).
        """
        pass

--- a/tensorpack/dataflow/imgaug/base.py
+++ b/tensorpack/dataflow/imgaug/base.py
 # -*- coding: utf-8 -*-
 # File: base.py
+import os
 import inspect
 import pprint
 from abc import ABCMeta, abstractmethod
 import six
 from six.moves import zip
+import weakref
 from ...utils.argtools import log_once
 from ...utils.utils import get_rng
@@ -15,6 +16,12 @@ from ..image import check_dtype
 __all__ = ['Augmentor', 'ImageAugmentor', 'AugmentorList']
+def _reset_augmentor_after_fork(aug_ref):
+    aug = aug_ref()
+    if aug:
+        aug.reset_state()
 @six.add_metaclass(ABCMeta)
 class Augmentor(object):
    """ Base class for an augmentor"""
@@ -22,6 +29,11 @@ class Augmentor(object):
    def __init__(self):
        self.reset_state()
+        # only available on Unix after Python 3.7
+        if hasattr(os, 'register_at_fork'):
+            os.register_at_fork(
+                after_in_child=lambda: _reset_augmentor_after_fork(weakref.ref(self)))
    def _init(self, params=None):
        if params:
            for k, v in params.items():
@@ -29,7 +41,19 @@ class Augmentor(object):
                    setattr(self, k, v)
    def reset_state(self):
-        """ reset rng and other state """
+        """
+        Reset rng and other state of the augmentor.
+        Similar to :meth:`DataFlow.reset_state`, the caller of Augmentor
+        is responsible for calling this method (once or more times) in the **process that uses the augmentor**
+        before using it.
+        If you use tensorpack's built-in augmentation dataflow (:class:`AugmentImageComponent`, etc),
+        this method will be called in the dataflow's own `reset_state` method.
+        If you use Python≥3.7 on Unix, this method will be automatically called after fork,
+        and you do not need to bother calling it.
+        """
        self.rng = get_rng(self)
    def augment(self, d):
@@ -199,5 +223,6 @@ class AugmentorList(ImageAugmentor):
    def reset_state(self):
        """ Will reset state of each augmentor """
+        super(AugmentorList, self).reset_state()
        for a in self.augmentors:
            a.reset_state()