update docs; Mask R-CNN horovod mode eval only on master machine

f63e0ee4 · Yuxin Wu · 7b8728f9 · f63e0ee4 · f63e0ee4 · f63e0ee4
Commit f63e0ee4 authored Sep 01, 2018 by Yuxin Wu
10 changed files
--- a/.github/ISSUE_TEMPLATE.md
+++ b/.github/ISSUE_TEMPLATE.md
+## DO NOT post an issue if you're seeing this. You're at the wrong place.
+To post an issue, please:
+1. Click the "New Issue" button
+2. __Choose your category__!
+3. __Read instructions there__!
 An issue has to be one of the following:
 - Unexpected Problems / Potential Bugs
 - Feature Requests
 - Questions on Using/Understanding Tensorpack
-To post an issue, please click "New Issue", choose your category, and read
-instructions there.
--- a/.github/ISSUE_TEMPLATE/feature-requests.md
+++ b/.github/ISSUE_TEMPLATE/feature-requests.md
@@ -7,8 +7,9 @@ about: Suggest an idea for Tensorpack
 + Note that you can implement a lot of features by extending Tensorpack
  (See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack).
  It does not have to be added to Tensorpack unless you have a good reason.
 + "Could you improve/implement an example/paper ?"
  -- The answer is: we have no plans to do so. We don't consider feature
  requests for examples or implement a paper for you, unless it demonstrates 
  some Tensorpack features not yet demonstrated in the existing examples.
-  If you don't know how to do it, you may ask a usage question.
+  If you don't know how to do something yourself, you may ask a usage question.
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -11,7 +11,7 @@ TensorFlow itself also changes API and those are not listed here.
 + [2018/08/27] msgpack is used again for "serialization to disk", because pyarrow
  has no compatibility between versions. To use pyarrow instead, `export TENSORPACK_COMPATIBLE_SERIALIZE=pyarrow`.
 + [2018/04/05] msgpack is replaced by pyarrow in favor of its speed. If you want old behavior,
-	`export TENSORPACK_SERIALIZE=msgpack`.
+	`export TENSORPACK_SERIALIZE=msgpack`. It's later found that pyarrow is unstable and may lead to crash.
 + [2018/03/20] `ModelDesc` starts to use simplified interfaces:
 	+ `_get_inputs()` renamed to `inputs()` and returns `tf.placeholder`s.
 	+ `build_graph(self, tensor1, tensor2)` returns the cost tensor directly.

--- a/examples/FasterRCNN/NOTES.md
+++ b/examples/FasterRCNN/NOTES.md
@@ -46,11 +46,10 @@ Model:
 Speed:
-1. The training will start very slowly due to convolution warmup, until about
+1. If cudnn warmup is on, the training will start very slowly, until about
   10k steps (or more if scale augmentation is used) to reach a maximum speed.
   As a result, the ETA is also inaccurate at the beginning.
-   You can disable warmup by `export TF_CUDNN_USE_AUTOTUNE=0`, which makes the
+   Warmup is by default on when no scale augmentation is used.
-   training faster at the beginning, but perhaps not in the end.
 1. After warmup, the training speed will slowly decrease due to more accurate proposals.

--- a/examples/FasterRCNN/README.md
+++ b/examples/FasterRCNN/README.md
-# Faster-RCNN / Mask-RCNN on COCO
+# Faster R-CNN / Mask R-CNN on COCO
 This example provides a minimal (2k lines) and faithful implementation of the following papers:
 + [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497)
@@ -73,7 +73,7 @@ prediction will need to be run with the corresponding training configs.
 These models are trained with different configurations on trainval35k and evaluated on minival using mAP@IoU=0.50:0.95.
 Performance in [Detectron](https://github.com/facebookresearch/Detectron/) can be roughly reproduced.
-MaskRCNN results contain both box and mask mAP.
+Mask R-CNN results contain both box and mask mAP.
 | Backbone | mAP<br/>(box;mask)                                                                                                            | Detectron mAP <sup>[1](#ft1)</sup><br/> (box;mask) | Time on 8 V100s | Configurations <br/> (click to expand)                                                                                                                                                                                                                 |
 | -        | -                                                                                                                             | -                                                  | -               | -                                                                                                                                                                                                                                                      |

--- a/examples/FasterRCNN/config.py
+++ b/examples/FasterRCNN/config.py
@@ -215,6 +215,10 @@ def finalize_configs(is_training):
            assert len(_C.CASCADE.BBOX_REG_WEIGHTS) == num_cascade
    if is_training:
+        train_scales = _C.PREPROC.TRAIN_SHORT_EDGE_SIZE
+        if train_scales[1] - train_scales[0] > 100:
+            # don't warmup if augmentation is on
+            os.environ['TF_CUDNN_USE_AUTOTUNE'] = '0'
        os.environ['TF_AUTOTUNE_THRESHOLD'] = '1'
        assert _C.TRAINER in ['horovod', 'replicated'], _C.TRAINER

--- a/examples/FasterRCNN/train.py
+++ b/examples/FasterRCNN/train.py
@@ -24,7 +24,6 @@ from tensorpack import *
 from tensorpack.tfutils.summary import add_moving_summary
 from tensorpack.tfutils import optimizer
 from tensorpack.tfutils.common import get_tf_version_tuple
-from tensorpack.utils.serialize import loads, dumps
 import tensorpack.utils.viz as tpviz
 from coco import COCODetection
@@ -417,16 +416,14 @@ class EvalCallback(Callback):
            self.dataflows = [get_eval_dataflow(shard=k, num_shards=self.num_predictor)
                              for k in range(self.num_predictor)]
        else:
-            if hvd.size() > hvd.local_size():
+            # Only eval on the first machine.
-                logger.warn("Distributed evaluation with horovod is unstable. Sometimes MPI hangs for unknown reasons.")
+            # Alternatively, can eval on all ranks and use allgather, but allgather sometimes hangs
+            self._horovod_run_eval = hvd.rank() == hvd.local_rank()
+            if self._horovod_run_eval:
                self.predictor = self._build_coco_predictor(0)
-            self.dataflow = get_eval_dataflow(shard=hvd.rank(), num_shards=hvd.size())
+                self.dataflow = get_eval_dataflow(shard=hvd.local_rank(), num_shards=hvd.local_size())
-            # use uint8 to aggregate strings
+            self.barrier = hvd.allreduce(tf.random_normal(shape=[1]))
-            self.local_result_tensor = tf.placeholder(tf.uint8, shape=[None], name='local_result_string')
-            self.concat_results = hvd.allgather(self.local_result_tensor, name='concat_results')
-            local_size = tf.expand_dims(tf.size(self.local_result_tensor), 0)
-            self.string_lens = hvd.allgather(local_size, name='concat_sizes')
    def _build_coco_predictor(self, idx):
        graph_func = self.trainer.get_predictor(self._in_names, self._out_names, device=idx)
@@ -443,6 +440,7 @@ class EvalCallback(Callback):
            logger.info("[EvalCallback] Will evaluate every {} epochs".format(interval))
    def _eval(self):
+        logdir = args.logdir
        if cfg.TRAINER == 'replicated':
            with ThreadPoolExecutor(max_workers=self.num_predictor, thread_name_prefix='EvalWorker') as executor, \
                    tqdm.tqdm(total=sum([df.size() for df in self.dataflows])) as pbar:
@@ -451,23 +449,26 @@ class EvalCallback(Callback):
                    futures.append(executor.submit(eval_coco, dataflow, pred, pbar))
                all_results = list(itertools.chain(*[fut.result() for fut in futures]))
        else:
+            if self._horovod_run_eval:
                local_results = eval_coco(self.dataflow, self.predictor)
-            results_as_arr = np.frombuffer(dumps(local_results), dtype=np.uint8)
+                output_partial = os.path.join(
-            sizes, concat_arrs = tf.get_default_session().run(
+                    logdir, 'outputs{}-part{}.json'.format(self.global_step, hvd.local_rank()))
-                [self.string_lens, self.concat_results],
+                with open(output_partial, 'w') as f:
-                feed_dict={self.local_result_tensor: results_as_arr})
+                    json.dump(local_results, f)
+            self.barrier.eval()
            if hvd.rank() > 0:
                return
            all_results = []
-            start = 0
+            for k in range(hvd.local_size()):
-            for size in sizes:
+                output_partial = os.path.join(
-                substr = concat_arrs[start: start + size]
+                    logdir, 'outputs{}-part{}.json'.format(self.global_step, k))
-                results = loads(substr.tobytes())
+                with open(output_partial, 'r') as f:
-                all_results.extend(results)
+                    obj = json.load(f)
-                start = start + size
+                all_results.extend(obj)
+                os.unlink(output_partial)
        output_file = os.path.join(
-            logger.get_logger_dir(), 'outputs{}.json'.format(self.global_step))
+            logdir, 'outputs{}.json'.format(self.global_step))
        with open(output_file, 'w') as f:
            json.dump(all_results, f)
        try:
@@ -572,6 +573,9 @@ if __name__ == '__main__':
        if not is_horovod:
            callbacks.append(GPUUtilizationTracker())
+        if is_horovod and hvd.rank() > 0:
+            session_init = None
+        else:
            if args.load:
                session_init = get_model_loader(args.load)
            else:

--- a/tensorpack/dataflow/parallel.py
+++ b/tensorpack/dataflow/parallel.py
@@ -447,9 +447,11 @@ class PlasmaGetData(ProxyDataFlow):
            yield dp
-try:
+plasma = None
-    import pyarrow.plasma as plasma
+# These plasma code is only experimental
-except ImportError:
+# try:
-    from ..utils.develop import create_dummy_class
+#     import pyarrow.plasma as plasma
-    PlasmaPutData = create_dummy_class('PlasmaPutData', 'pyarrow')   # noqa
+# except ImportError:
-    PlasmaGetData = create_dummy_class('PlasmaGetData', 'pyarrow')   # noqa
+#     from ..utils.develop import create_dummy_class
+#     PlasmaPutData = create_dummy_class('PlasmaPutData', 'pyarrow')   # noqa
+#     PlasmaGetData = create_dummy_class('PlasmaGetData', 'pyarrow')   # noqa
--- a/tensorpack/libinfo.py
+++ b/tensorpack/libinfo.py
@@ -37,11 +37,11 @@ os.environ['TF_SYNC_ON_FINISH'] = '0'   # will become default
 os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
 os.environ['TF_GPU_THREAD_COUNT'] = '2'
-# Available in TF1.6+. Haven't seen different performance on R50.
+# Available in TF1.6+ & cudnn7. Haven't seen different performance on R50.
-# NOTE TF set it to 0 by default, because:
+# NOTE we disable it because:
 # this mode may use scaled atomic integer reduction that may cause a numerical
 # overflow for certain input data range.
-# os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '1'
+os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '0'
 try:
    import tensorflow as tf  # noqa

--- a/tensorpack/utils/serialize.py
+++ b/tensorpack/utils/serialize.py
@@ -64,7 +64,7 @@ except ImportError:
    dumps_msgpack = create_dummy_func(  # noqa
        'dumps_msgpack', ['msgpack', 'msgpack_numpy'])
-if os.environ.get('TENSORPACK_SERIALIZE', None) == 'msgpack':
+if pa is None or os.environ.get('TENSORPACK_SERIALIZE', None) == 'msgpack':
    loads = loads_msgpack
    dumps = dumps_msgpack
 else: