update docs

3476cb43 · Yuxin Wu · 45d63caf · 3476cb43 · 3476cb43 · 3476cb43
Commit 3476cb43 authored Aug 31, 2018 by Yuxin Wu
12 changed files
--- a/.github/ISSUE_TEMPLATE/feature-requests.md
+++ b/.github/ISSUE_TEMPLATE/feature-requests.md
@@ -8,5 +8,7 @@ about: Suggest an idea for Tensorpack
  (See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack).
  It does not have to be added to Tensorpack unless you have a good reason.
 + "Could you improve/implement an example/paper ?"
-  -- The answer is: we have no plans to do so. We don't take feature requests for
-  examples or implement a paper for you. If you don't know how to do it, you may ask a usage question.
+  -- The answer is: we have no plans to do so. We don't consider feature
+  requests for examples or implement a paper for you, unless it demonstrates 
+  some Tensorpack features not yet demonstrated in the existing examples.
+  If you don't know how to do it, you may ask a usage question.
--- a/docs/tutorial/dataflow.md
+++ b/docs/tutorial/dataflow.md
@@ -14,7 +14,7 @@ that yields datapoints (lists) of two components:
 a numpy array of shape (64, 28, 28), and an array of shape (64,).

 As you saw,
-DataFlow is __independent__ of TensorFlow since it produces any python objects
+DataFlow is __independent of TensorFlow__ since it produces any python objects
 (usually numpy arrays).
 To `import tensorpack.dataflow`, you don't even have to install TensorFlow.
 You can simply use DataFlow as a data processing pipeline and plug it into any other frameworks.
@@ -24,7 +24,7 @@ You can simply use DataFlow as a data processing pipeline and plug it into any o
 One good thing about having a standard interface is to be able to provide
 the greatest code reusability.
 There are a lot of existing DataFlow utilities in tensorpack, which you can use to compose
-complex DataFlow with a long data pipeline. A common pipeline usually
+DataFlow with complex data pipeline. A common pipeline usually
 would __read from disk (or other sources), apply transformations, group into batches,
 prefetch data__, etc. A simple example is as the following:

@@ -38,13 +38,12 @@ df = BatchData(df, 128)
 # start 3 processes to run the dataflow in parallel
 df = PrefetchDataZMQ(df, 3)
 ````
-You can find more complicated DataFlow in the [ResNet training script](../examples/ResNet/imagenet_utils.py)
+You can find more complicated DataFlow in the [ImageNet training script](../examples/ImageNetModels/imagenet_utils.py)
 with all the data preprocessing.

 Unless you are working with standard data types (image folders, LMDB, etc),
 you would usually want to write the source DataFlow (`MyDataFlow` in the above example) for your data format.
-See [another tutorial](extend/dataflow.html)
-for simple instructions on writing a DataFlow.
+See [another tutorial](extend/dataflow.html) for simple instructions on writing a DataFlow.
 Once you have the source reader, all the [existing DataFlows](../modules/dataflow.html) are ready for you to complete
 the rest of the data pipeline.

@@ -62,7 +61,7 @@ Nevertheless, tensorpack supports data loading with native TF operators / TF dat

 ### Use DataFlow (outside Tensorpack)
 Normally, tensorpack `InputSource` interface links DataFlow to the graph for training.
-If you use DataFlow in some custom code, call `reset_state()` first to initialize it,
+If you use DataFlow in other places such as your custom code, call `reset_state()` first to initialize it,
 and then use the generator however you like:
 ```python
 df = SomeDataFlow()

--- a/docs/tutorial/inference.md
+++ b/docs/tutorial/inference.md
@@ -11,7 +11,7 @@ There are two ways to do inference during training.
 	See [Write a Callback](extend/callback.html).

 2. If your inference follows the paradigm of:
-	"fetch some tensors for each input, and aggregate the results".
+	"evaluate some tensors for each input, and aggregate the results in the end".
 	You can use the `InferenceRunner` interface with some `Inferencer**.
 	This will further support prefetch & data-parallel inference.
 	More details to come.
@@ -22,18 +22,19 @@ You can use this predicate to choose a different code path in inference mode.
 ## Inference After Training

 Tensorpack is a training interface -- __it doesn't care what happened after training__.
-You have everything needed for inference or model diagnosis after
+You have everything you need for inference or model diagnosis after
 training:
 1. The trained weights: tensorpack saves them in standard TF checkpoint format.
 2. The model: you've already written it yourself with TF symbolic functions.

-Therefore, you can build the graph for inference, load the checkpoint, and then use whatever deployment methods TensorFlow supports.
+Therefore, you can build the graph for inference, load the checkpoint, and apply
+any processing or deployment TensorFlow supports.
 And you'll need to read TF docs and __do it on your own__.

 ### Don't Use Training Metagraph for Inference

 Metagraph is the wrong abstraction for a "model". 
-It stores the entire graph which contains not only the model, but also all the
+It stores the entire graph which contains not only the mathematical model, but also all the
 training settings (queues, iterators, summaries, evaluations, multi-gpu replications).
 Therefore it is usually wrong to import a training metagraph for inference.


--- a/docs/tutorial/save-load.md
+++ b/docs/tutorial/save-load.md
@@ -39,5 +39,5 @@ Variables that appear in only one side will be printed as warning.

 ## Transfer Learning
 Therefore, transfer learning is trivial.
-If you want to load some model, just use the same variable names.
+If you want to load a pre-trained model, just use the same variable names.
 If you want to re-train some layer, just rename it.
--- a/docs/tutorial/symbolic.md
+++ b/docs/tutorial/symbolic.md
@@ -2,7 +2,8 @@
 # Symbolic Layers

 Tensorpack contains a small collection of common model primitives,
-such as conv/deconv, fc, bn, pooling layers.
+such as conv/deconv, fc, bn, pooling layers. **You do not need to learn them.**
+
 These layers were written only because there were no alternatives when
 tensorpack was first developed.
 Nowadays, these implementation actually call `tf.layers` directly.

--- a/examples/FasterRCNN/NOTES.md
+++ b/examples/FasterRCNN/NOTES.md
@@ -46,12 +46,13 @@ Model:

 Speed:

-1. The training will start very slow due to convolution warmup, until about 10k
-   steps to reach a maximum speed.
+1. The training will start very slowly due to convolution warmup, until about
+   10k steps (or more if scale augmentation is used) to reach a maximum speed.
+   As a result, the ETA is also inaccurate at the beginning.
   You can disable warmup by `export TF_CUDNN_USE_AUTOTUNE=0`, which makes the
   training faster at the beginning, but perhaps not in the end.

-1. After warmup the training speed will slowly decrease due to more accurate proposals.
+1. After warmup, the training speed will slowly decrease due to more accurate proposals.

 1. This implementation is about 10% slower than detectron,
   probably due to the lack of specialized ops (e.g. AffineChannel, ROIAlign) in TensorFlow.
@@ -62,7 +63,7 @@ Speed:

 Possible Future Enhancements:

-1. Define an interface to load custom dataset.
+1. Define a better interface to load custom dataset.

 1. Support batch>1 per GPU.


--- a/examples/FasterRCNN/coco.py
+++ b/examples/FasterRCNN/coco.py
@@ -24,7 +24,9 @@ class _COCOMeta(object):
        'val2014': 'val2014',
        'valminusminival2014': 'val2014',
        'minival2014': 'val2014',
-        'test2014': 'test2014'
+        'test2014': 'test2014',
+        'train2017': 'train2017',
+        'val2017': 'val2017',
    }

    def valid(self):

--- a/examples/FasterRCNN/config.py
+++ b/examples/FasterRCNN/config.py
@@ -88,7 +88,9 @@ _C.BACKBONE.FREEZE_AT = 2  # options: 0, 1, 2
 # See https://github.com/tensorflow/tensorflow/issues/18213

 # In tensorpack model zoo, ResNet models with TF_PAD_MODE=False are marked with "-AlignPadding".
-# All other models under `ResNet/` in the model zoo are trained with TF_PAD_MODE=True.
+# All other models under `ResNet/` in the model zoo are using TF_PAD_MODE=True.
+# Using either one should probably give the same performance.
+# We use the "AlignPadding" one just to be consistent with caffe2.
 _C.BACKBONE.TF_PAD_MODE = False
 _C.BACKBONE.STRIDE_1X1 = False  # True for MSRA models

@@ -101,6 +103,7 @@ _C.TRAIN.STEPS_PER_EPOCH = 500

 # LR_SCHEDULE means "steps" only when total batch size is 8.
 # Otherwise the actual steps to decrease learning rate are computed from the schedule.
+# Therefore, there is *no need* to modify the config if you only change the number of GPUs.
 # LR_SCHEDULE = [120000, 160000, 180000]  # "1x" schedule in detectron
 _C.TRAIN.LR_SCHEDULE = [240000, 320000, 360000]    # "2x" schedule in detectron
 _C.TRAIN.NUM_EVALS = 20  # number of evaluations to run during training

--- a/examples/FasterRCNN/data.py
+++ b/examples/FasterRCNN/data.py
@@ -292,7 +292,7 @@ def get_train_dataflow():
    class: numpy array of k integers
    is_crowd: k booleans. Use k False if you don't know what it means.
    segmentation: k lists of numpy arrays (one for each box).
-        Each list of numpy array corresponds to the mask for one instance.
+        Each list of numpy arrays corresponds to the mask for one instance.
        Each numpy array in the list is a polygon of shape Nx2,
        because one mask can be represented by N polygons.


--- a/examples/FasterRCNN/train.py
+++ b/examples/FasterRCNN/train.py
@@ -417,6 +417,8 @@ class EvalCallback(Callback):
            self.dataflows = [get_eval_dataflow(shard=k, num_shards=self.num_predictor)
                              for k in range(self.num_predictor)]
        else:
+            if hvd.size() > hvd.local_size():
+                logger.warn("Distributed evaluation with horovod is unstable. Sometimes MPI hangs for unknown reasons.")
            self.predictor = self._build_coco_predictor(0)
            self.dataflow = get_eval_dataflow(shard=hvd.rank(), num_shards=hvd.size())

@@ -495,7 +497,7 @@ if __name__ == '__main__':

    if get_tf_version_tuple() < (1, 6):
        # https://github.com/tensorflow/tensorflow/issues/14657
-        logger.warn("TF<1.6 has a bug which may lead to crash in FasterRCNN training if you're unlucky.")
+        logger.warn("TF<1.6 has a bug which may lead to crash in FasterRCNN if you're unlucky.")

    args = parser.parse_args()
    if args.config:
@@ -540,7 +542,7 @@ if __name__ == '__main__':
        init_lr = cfg.TRAIN.BASE_LR * 0.33 * min(8. / cfg.TRAIN.NUM_GPUS, 1.)
        warmup_schedule = [(0, init_lr), (cfg.TRAIN.WARMUP, cfg.TRAIN.BASE_LR)]
        warmup_end_epoch = cfg.TRAIN.WARMUP * 1. / stepnum
-        lr_schedule = [(int(np.ceil(warmup_end_epoch)), cfg.TRAIN.BASE_LR)]
+        lr_schedule = [(int(warmup_end_epoch + 0.5), cfg.TRAIN.BASE_LR)]

        factor = 8. / cfg.TRAIN.NUM_GPUS
        for idx, steps in enumerate(cfg.TRAIN.LR_SCHEDULE[:-1]):
@@ -549,6 +551,10 @@ if __name__ == '__main__':
                (steps * factor // stepnum, cfg.TRAIN.BASE_LR * mult))
        logger.info("Warm Up Schedule (steps, value): " + str(warmup_schedule))
        logger.info("LR Schedule (epochs, value): " + str(lr_schedule))
+        train_dataflow = get_train_dataflow()
+        # This is what's commonly referred to as "epochs"
+        total_passes = cfg.TRAIN.LR_SCHEDULE[-1] * 8 / train_dataflow.size()
+        logger.info("Total passes of the training set is: {}".format(total_passes))

        callbacks = [
            PeriodicCallback(
@@ -573,7 +579,7 @@ if __name__ == '__main__':

        traincfg = TrainConfig(
            model=MODEL,
-            data=QueueInput(get_train_dataflow()),
+            data=QueueInput(train_dataflow),
            callbacks=callbacks,
            steps_per_epoch=stepnum,
            max_epoch=cfg.TRAIN.LR_SCHEDULE[-1] * factor // stepnum,

--- a/examples/GAN/GAN.py
+++ b/examples/GAN/GAN.py
@@ -170,7 +170,7 @@ class MultiGPUGANTrainer(TowerTrainer):
            list(range(num_gpu)),
            lambda: self.tower_func(*input.get_input_tensors()),
            devices)
-        # Simply average the cost here. It might be faster to average the gradients
+        # For simplicity, average the cost here. It might be faster to average the gradients
        with tf.name_scope('optimize'):
            d_loss = tf.add_n([x[0] for x in cost_list]) * (1.0 / num_gpu)
            g_loss = tf.add_n([x[1] for x in cost_list]) * (1.0 / num_gpu)

--- a/tensorpack/callbacks/misc.py
+++ b/tensorpack/callbacks/misc.py
@@ -35,10 +35,20 @@ class InjectShell(Callback):
    """
    Allow users to create a specific file as a signal to pause
    and iteratively debug the training.
-    Once triggered, it detects whether the file exists, and opens an
+    Once the :meth:`trigger` method is called, it detects whether the file exists, and opens an
    IPython/pdb shell if yes.
    In the shell, `self` is this callback, `self.trainer` is the trainer, and
    from that you can access everything else.
+
+    Example:
+
+    .. code-block:: python
+
+        callbacks=[InjectShell('/path/to/pause-training.tmp'), ...]
+
+        # the following command will pause the training when the epoch finishes:
+        $ touch /path/to/pause-training.tmp
+
    """

    def __init__(self, file='INJECT_SHELL.tmp', shell='ipython'):