Commit f63e0ee4 authored by Yuxin Wu's avatar Yuxin Wu

update docs; Mask R-CNN horovod mode eval only on master machine

parent 7b8728f9
## DO NOT post an issue if you're seeing this. You're at the wrong place.
To post an issue, please:
1. Click the "New Issue" button
2. __Choose your category__!
3. __Read instructions there__!
An issue has to be one of the following: An issue has to be one of the following:
- Unexpected Problems / Potential Bugs - Unexpected Problems / Potential Bugs
- Feature Requests - Feature Requests
- Questions on Using/Understanding Tensorpack - Questions on Using/Understanding Tensorpack
To post an issue, please click "New Issue", choose your category, and read
instructions there.
...@@ -7,8 +7,9 @@ about: Suggest an idea for Tensorpack ...@@ -7,8 +7,9 @@ about: Suggest an idea for Tensorpack
+ Note that you can implement a lot of features by extending Tensorpack + Note that you can implement a lot of features by extending Tensorpack
(See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack). (See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack).
It does not have to be added to Tensorpack unless you have a good reason. It does not have to be added to Tensorpack unless you have a good reason.
+ "Could you improve/implement an example/paper ?" + "Could you improve/implement an example/paper ?"
-- The answer is: we have no plans to do so. We don't consider feature -- The answer is: we have no plans to do so. We don't consider feature
requests for examples or implement a paper for you, unless it demonstrates requests for examples or implement a paper for you, unless it demonstrates
some Tensorpack features not yet demonstrated in the existing examples. some Tensorpack features not yet demonstrated in the existing examples.
If you don't know how to do it, you may ask a usage question. If you don't know how to do something yourself, you may ask a usage question.
...@@ -11,7 +11,7 @@ TensorFlow itself also changes API and those are not listed here. ...@@ -11,7 +11,7 @@ TensorFlow itself also changes API and those are not listed here.
+ [2018/08/27] msgpack is used again for "serialization to disk", because pyarrow + [2018/08/27] msgpack is used again for "serialization to disk", because pyarrow
has no compatibility between versions. To use pyarrow instead, `export TENSORPACK_COMPATIBLE_SERIALIZE=pyarrow`. has no compatibility between versions. To use pyarrow instead, `export TENSORPACK_COMPATIBLE_SERIALIZE=pyarrow`.
+ [2018/04/05] msgpack is replaced by pyarrow in favor of its speed. If you want old behavior, + [2018/04/05] msgpack is replaced by pyarrow in favor of its speed. If you want old behavior,
`export TENSORPACK_SERIALIZE=msgpack`. `export TENSORPACK_SERIALIZE=msgpack`. It's later found that pyarrow is unstable and may lead to crash.
+ [2018/03/20] `ModelDesc` starts to use simplified interfaces: + [2018/03/20] `ModelDesc` starts to use simplified interfaces:
+ `_get_inputs()` renamed to `inputs()` and returns `tf.placeholder`s. + `_get_inputs()` renamed to `inputs()` and returns `tf.placeholder`s.
+ `build_graph(self, tensor1, tensor2)` returns the cost tensor directly. + `build_graph(self, tensor1, tensor2)` returns the cost tensor directly.
......
...@@ -46,11 +46,10 @@ Model: ...@@ -46,11 +46,10 @@ Model:
Speed: Speed:
1. The training will start very slowly due to convolution warmup, until about 1. If cudnn warmup is on, the training will start very slowly, until about
10k steps (or more if scale augmentation is used) to reach a maximum speed. 10k steps (or more if scale augmentation is used) to reach a maximum speed.
As a result, the ETA is also inaccurate at the beginning. As a result, the ETA is also inaccurate at the beginning.
You can disable warmup by `export TF_CUDNN_USE_AUTOTUNE=0`, which makes the Warmup is by default on when no scale augmentation is used.
training faster at the beginning, but perhaps not in the end.
1. After warmup, the training speed will slowly decrease due to more accurate proposals. 1. After warmup, the training speed will slowly decrease due to more accurate proposals.
......
# Faster-RCNN / Mask-RCNN on COCO # Faster R-CNN / Mask R-CNN on COCO
This example provides a minimal (2k lines) and faithful implementation of the following papers: This example provides a minimal (2k lines) and faithful implementation of the following papers:
+ [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497) + [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497)
...@@ -73,7 +73,7 @@ prediction will need to be run with the corresponding training configs. ...@@ -73,7 +73,7 @@ prediction will need to be run with the corresponding training configs.
These models are trained with different configurations on trainval35k and evaluated on minival using mAP@IoU=0.50:0.95. These models are trained with different configurations on trainval35k and evaluated on minival using mAP@IoU=0.50:0.95.
Performance in [Detectron](https://github.com/facebookresearch/Detectron/) can be roughly reproduced. Performance in [Detectron](https://github.com/facebookresearch/Detectron/) can be roughly reproduced.
MaskRCNN results contain both box and mask mAP. Mask R-CNN results contain both box and mask mAP.
| Backbone | mAP<br/>(box;mask) | Detectron mAP <sup>[1](#ft1)</sup><br/> (box;mask) | Time on 8 V100s | Configurations <br/> (click to expand) | | Backbone | mAP<br/>(box;mask) | Detectron mAP <sup>[1](#ft1)</sup><br/> (box;mask) | Time on 8 V100s | Configurations <br/> (click to expand) |
| - | - | - | - | - | | - | - | - | - | - |
......
...@@ -215,6 +215,10 @@ def finalize_configs(is_training): ...@@ -215,6 +215,10 @@ def finalize_configs(is_training):
assert len(_C.CASCADE.BBOX_REG_WEIGHTS) == num_cascade assert len(_C.CASCADE.BBOX_REG_WEIGHTS) == num_cascade
if is_training: if is_training:
train_scales = _C.PREPROC.TRAIN_SHORT_EDGE_SIZE
if train_scales[1] - train_scales[0] > 100:
# don't warmup if augmentation is on
os.environ['TF_CUDNN_USE_AUTOTUNE'] = '0'
os.environ['TF_AUTOTUNE_THRESHOLD'] = '1' os.environ['TF_AUTOTUNE_THRESHOLD'] = '1'
assert _C.TRAINER in ['horovod', 'replicated'], _C.TRAINER assert _C.TRAINER in ['horovod', 'replicated'], _C.TRAINER
......
...@@ -24,7 +24,6 @@ from tensorpack import * ...@@ -24,7 +24,6 @@ from tensorpack import *
from tensorpack.tfutils.summary import add_moving_summary from tensorpack.tfutils.summary import add_moving_summary
from tensorpack.tfutils import optimizer from tensorpack.tfutils import optimizer
from tensorpack.tfutils.common import get_tf_version_tuple from tensorpack.tfutils.common import get_tf_version_tuple
from tensorpack.utils.serialize import loads, dumps
import tensorpack.utils.viz as tpviz import tensorpack.utils.viz as tpviz
from coco import COCODetection from coco import COCODetection
...@@ -417,16 +416,14 @@ class EvalCallback(Callback): ...@@ -417,16 +416,14 @@ class EvalCallback(Callback):
self.dataflows = [get_eval_dataflow(shard=k, num_shards=self.num_predictor) self.dataflows = [get_eval_dataflow(shard=k, num_shards=self.num_predictor)
for k in range(self.num_predictor)] for k in range(self.num_predictor)]
else: else:
if hvd.size() > hvd.local_size(): # Only eval on the first machine.
logger.warn("Distributed evaluation with horovod is unstable. Sometimes MPI hangs for unknown reasons.") # Alternatively, can eval on all ranks and use allgather, but allgather sometimes hangs
self.predictor = self._build_coco_predictor(0) self._horovod_run_eval = hvd.rank() == hvd.local_rank()
self.dataflow = get_eval_dataflow(shard=hvd.rank(), num_shards=hvd.size()) if self._horovod_run_eval:
self.predictor = self._build_coco_predictor(0)
self.dataflow = get_eval_dataflow(shard=hvd.local_rank(), num_shards=hvd.local_size())
# use uint8 to aggregate strings self.barrier = hvd.allreduce(tf.random_normal(shape=[1]))
self.local_result_tensor = tf.placeholder(tf.uint8, shape=[None], name='local_result_string')
self.concat_results = hvd.allgather(self.local_result_tensor, name='concat_results')
local_size = tf.expand_dims(tf.size(self.local_result_tensor), 0)
self.string_lens = hvd.allgather(local_size, name='concat_sizes')
def _build_coco_predictor(self, idx): def _build_coco_predictor(self, idx):
graph_func = self.trainer.get_predictor(self._in_names, self._out_names, device=idx) graph_func = self.trainer.get_predictor(self._in_names, self._out_names, device=idx)
...@@ -443,6 +440,7 @@ class EvalCallback(Callback): ...@@ -443,6 +440,7 @@ class EvalCallback(Callback):
logger.info("[EvalCallback] Will evaluate every {} epochs".format(interval)) logger.info("[EvalCallback] Will evaluate every {} epochs".format(interval))
def _eval(self): def _eval(self):
logdir = args.logdir
if cfg.TRAINER == 'replicated': if cfg.TRAINER == 'replicated':
with ThreadPoolExecutor(max_workers=self.num_predictor, thread_name_prefix='EvalWorker') as executor, \ with ThreadPoolExecutor(max_workers=self.num_predictor, thread_name_prefix='EvalWorker') as executor, \
tqdm.tqdm(total=sum([df.size() for df in self.dataflows])) as pbar: tqdm.tqdm(total=sum([df.size() for df in self.dataflows])) as pbar:
...@@ -451,23 +449,26 @@ class EvalCallback(Callback): ...@@ -451,23 +449,26 @@ class EvalCallback(Callback):
futures.append(executor.submit(eval_coco, dataflow, pred, pbar)) futures.append(executor.submit(eval_coco, dataflow, pred, pbar))
all_results = list(itertools.chain(*[fut.result() for fut in futures])) all_results = list(itertools.chain(*[fut.result() for fut in futures]))
else: else:
local_results = eval_coco(self.dataflow, self.predictor) if self._horovod_run_eval:
results_as_arr = np.frombuffer(dumps(local_results), dtype=np.uint8) local_results = eval_coco(self.dataflow, self.predictor)
sizes, concat_arrs = tf.get_default_session().run( output_partial = os.path.join(
[self.string_lens, self.concat_results], logdir, 'outputs{}-part{}.json'.format(self.global_step, hvd.local_rank()))
feed_dict={self.local_result_tensor: results_as_arr}) with open(output_partial, 'w') as f:
json.dump(local_results, f)
self.barrier.eval()
if hvd.rank() > 0: if hvd.rank() > 0:
return return
all_results = [] all_results = []
start = 0 for k in range(hvd.local_size()):
for size in sizes: output_partial = os.path.join(
substr = concat_arrs[start: start + size] logdir, 'outputs{}-part{}.json'.format(self.global_step, k))
results = loads(substr.tobytes()) with open(output_partial, 'r') as f:
all_results.extend(results) obj = json.load(f)
start = start + size all_results.extend(obj)
os.unlink(output_partial)
output_file = os.path.join( output_file = os.path.join(
logger.get_logger_dir(), 'outputs{}.json'.format(self.global_step)) logdir, 'outputs{}.json'.format(self.global_step))
with open(output_file, 'w') as f: with open(output_file, 'w') as f:
json.dump(all_results, f) json.dump(all_results, f)
try: try:
...@@ -572,10 +573,13 @@ if __name__ == '__main__': ...@@ -572,10 +573,13 @@ if __name__ == '__main__':
if not is_horovod: if not is_horovod:
callbacks.append(GPUUtilizationTracker()) callbacks.append(GPUUtilizationTracker())
if args.load: if is_horovod and hvd.rank() > 0:
session_init = get_model_loader(args.load) session_init = None
else: else:
session_init = get_model_loader(cfg.BACKBONE.WEIGHTS) if cfg.BACKBONE.WEIGHTS else None if args.load:
session_init = get_model_loader(args.load)
else:
session_init = get_model_loader(cfg.BACKBONE.WEIGHTS) if cfg.BACKBONE.WEIGHTS else None
traincfg = TrainConfig( traincfg = TrainConfig(
model=MODEL, model=MODEL,
......
...@@ -447,9 +447,11 @@ class PlasmaGetData(ProxyDataFlow): ...@@ -447,9 +447,11 @@ class PlasmaGetData(ProxyDataFlow):
yield dp yield dp
try: plasma = None
import pyarrow.plasma as plasma # These plasma code is only experimental
except ImportError: # try:
from ..utils.develop import create_dummy_class # import pyarrow.plasma as plasma
PlasmaPutData = create_dummy_class('PlasmaPutData', 'pyarrow') # noqa # except ImportError:
PlasmaGetData = create_dummy_class('PlasmaGetData', 'pyarrow') # noqa # from ..utils.develop import create_dummy_class
# PlasmaPutData = create_dummy_class('PlasmaPutData', 'pyarrow') # noqa
# PlasmaGetData = create_dummy_class('PlasmaGetData', 'pyarrow') # noqa
...@@ -37,11 +37,11 @@ os.environ['TF_SYNC_ON_FINISH'] = '0' # will become default ...@@ -37,11 +37,11 @@ os.environ['TF_SYNC_ON_FINISH'] = '0' # will become default
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private' os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
os.environ['TF_GPU_THREAD_COUNT'] = '2' os.environ['TF_GPU_THREAD_COUNT'] = '2'
# Available in TF1.6+. Haven't seen different performance on R50. # Available in TF1.6+ & cudnn7. Haven't seen different performance on R50.
# NOTE TF set it to 0 by default, because: # NOTE we disable it because:
# this mode may use scaled atomic integer reduction that may cause a numerical # this mode may use scaled atomic integer reduction that may cause a numerical
# overflow for certain input data range. # overflow for certain input data range.
# os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '1' os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '0'
try: try:
import tensorflow as tf # noqa import tensorflow as tf # noqa
......
...@@ -64,7 +64,7 @@ except ImportError: ...@@ -64,7 +64,7 @@ except ImportError:
dumps_msgpack = create_dummy_func( # noqa dumps_msgpack = create_dummy_func( # noqa
'dumps_msgpack', ['msgpack', 'msgpack_numpy']) 'dumps_msgpack', ['msgpack', 'msgpack_numpy'])
if os.environ.get('TENSORPACK_SERIALIZE', None) == 'msgpack': if pa is None or os.environ.get('TENSORPACK_SERIALIZE', None) == 'msgpack':
loads = loads_msgpack loads = loads_msgpack
dumps = dumps_msgpack dumps = dumps_msgpack
else: else:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment