Commit f63e0ee4 authored by Yuxin Wu's avatar Yuxin Wu

update docs; Mask R-CNN horovod mode eval only on master machine

parent 7b8728f9
## DO NOT post an issue if you're seeing this. You're at the wrong place.
To post an issue, please:
1. Click the "New Issue" button
2. __Choose your category__!
3. __Read instructions there__!
An issue has to be one of the following:
- Unexpected Problems / Potential Bugs
- Feature Requests
- Questions on Using/Understanding Tensorpack
To post an issue, please click "New Issue", choose your category, and read
instructions there.
......@@ -7,8 +7,9 @@ about: Suggest an idea for Tensorpack
+ Note that you can implement a lot of features by extending Tensorpack
(See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack).
It does not have to be added to Tensorpack unless you have a good reason.
+ "Could you improve/implement an example/paper ?"
-- The answer is: we have no plans to do so. We don't consider feature
requests for examples or implement a paper for you, unless it demonstrates
some Tensorpack features not yet demonstrated in the existing examples.
If you don't know how to do it, you may ask a usage question.
If you don't know how to do something yourself, you may ask a usage question.
......@@ -11,7 +11,7 @@ TensorFlow itself also changes API and those are not listed here.
+ [2018/08/27] msgpack is used again for "serialization to disk", because pyarrow
has no compatibility between versions. To use pyarrow instead, `export TENSORPACK_COMPATIBLE_SERIALIZE=pyarrow`.
+ [2018/04/05] msgpack is replaced by pyarrow in favor of its speed. If you want old behavior,
`export TENSORPACK_SERIALIZE=msgpack`.
`export TENSORPACK_SERIALIZE=msgpack`. It's later found that pyarrow is unstable and may lead to crash.
+ [2018/03/20] `ModelDesc` starts to use simplified interfaces:
+ `_get_inputs()` renamed to `inputs()` and returns `tf.placeholder`s.
+ `build_graph(self, tensor1, tensor2)` returns the cost tensor directly.
......
......@@ -46,11 +46,10 @@ Model:
Speed:
1. The training will start very slowly due to convolution warmup, until about
1. If cudnn warmup is on, the training will start very slowly, until about
10k steps (or more if scale augmentation is used) to reach a maximum speed.
As a result, the ETA is also inaccurate at the beginning.
You can disable warmup by `export TF_CUDNN_USE_AUTOTUNE=0`, which makes the
training faster at the beginning, but perhaps not in the end.
Warmup is by default on when no scale augmentation is used.
1. After warmup, the training speed will slowly decrease due to more accurate proposals.
......
# Faster-RCNN / Mask-RCNN on COCO
# Faster R-CNN / Mask R-CNN on COCO
This example provides a minimal (2k lines) and faithful implementation of the following papers:
+ [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497)
......@@ -73,7 +73,7 @@ prediction will need to be run with the corresponding training configs.
These models are trained with different configurations on trainval35k and evaluated on minival using mAP@IoU=0.50:0.95.
Performance in [Detectron](https://github.com/facebookresearch/Detectron/) can be roughly reproduced.
MaskRCNN results contain both box and mask mAP.
Mask R-CNN results contain both box and mask mAP.
| Backbone | mAP<br/>(box;mask) | Detectron mAP <sup>[1](#ft1)</sup><br/> (box;mask) | Time on 8 V100s | Configurations <br/> (click to expand) |
| - | - | - | - | - |
......
......@@ -215,6 +215,10 @@ def finalize_configs(is_training):
assert len(_C.CASCADE.BBOX_REG_WEIGHTS) == num_cascade
if is_training:
train_scales = _C.PREPROC.TRAIN_SHORT_EDGE_SIZE
if train_scales[1] - train_scales[0] > 100:
# don't warmup if augmentation is on
os.environ['TF_CUDNN_USE_AUTOTUNE'] = '0'
os.environ['TF_AUTOTUNE_THRESHOLD'] = '1'
assert _C.TRAINER in ['horovod', 'replicated'], _C.TRAINER
......
......@@ -24,7 +24,6 @@ from tensorpack import *
from tensorpack.tfutils.summary import add_moving_summary
from tensorpack.tfutils import optimizer
from tensorpack.tfutils.common import get_tf_version_tuple
from tensorpack.utils.serialize import loads, dumps
import tensorpack.utils.viz as tpviz
from coco import COCODetection
......@@ -417,16 +416,14 @@ class EvalCallback(Callback):
self.dataflows = [get_eval_dataflow(shard=k, num_shards=self.num_predictor)
for k in range(self.num_predictor)]
else:
if hvd.size() > hvd.local_size():
logger.warn("Distributed evaluation with horovod is unstable. Sometimes MPI hangs for unknown reasons.")
self.predictor = self._build_coco_predictor(0)
self.dataflow = get_eval_dataflow(shard=hvd.rank(), num_shards=hvd.size())
# Only eval on the first machine.
# Alternatively, can eval on all ranks and use allgather, but allgather sometimes hangs
self._horovod_run_eval = hvd.rank() == hvd.local_rank()
if self._horovod_run_eval:
self.predictor = self._build_coco_predictor(0)
self.dataflow = get_eval_dataflow(shard=hvd.local_rank(), num_shards=hvd.local_size())
# use uint8 to aggregate strings
self.local_result_tensor = tf.placeholder(tf.uint8, shape=[None], name='local_result_string')
self.concat_results = hvd.allgather(self.local_result_tensor, name='concat_results')
local_size = tf.expand_dims(tf.size(self.local_result_tensor), 0)
self.string_lens = hvd.allgather(local_size, name='concat_sizes')
self.barrier = hvd.allreduce(tf.random_normal(shape=[1]))
def _build_coco_predictor(self, idx):
graph_func = self.trainer.get_predictor(self._in_names, self._out_names, device=idx)
......@@ -443,6 +440,7 @@ class EvalCallback(Callback):
logger.info("[EvalCallback] Will evaluate every {} epochs".format(interval))
def _eval(self):
logdir = args.logdir
if cfg.TRAINER == 'replicated':
with ThreadPoolExecutor(max_workers=self.num_predictor, thread_name_prefix='EvalWorker') as executor, \
tqdm.tqdm(total=sum([df.size() for df in self.dataflows])) as pbar:
......@@ -451,23 +449,26 @@ class EvalCallback(Callback):
futures.append(executor.submit(eval_coco, dataflow, pred, pbar))
all_results = list(itertools.chain(*[fut.result() for fut in futures]))
else:
local_results = eval_coco(self.dataflow, self.predictor)
results_as_arr = np.frombuffer(dumps(local_results), dtype=np.uint8)
sizes, concat_arrs = tf.get_default_session().run(
[self.string_lens, self.concat_results],
feed_dict={self.local_result_tensor: results_as_arr})
if self._horovod_run_eval:
local_results = eval_coco(self.dataflow, self.predictor)
output_partial = os.path.join(
logdir, 'outputs{}-part{}.json'.format(self.global_step, hvd.local_rank()))
with open(output_partial, 'w') as f:
json.dump(local_results, f)
self.barrier.eval()
if hvd.rank() > 0:
return
all_results = []
start = 0
for size in sizes:
substr = concat_arrs[start: start + size]
results = loads(substr.tobytes())
all_results.extend(results)
start = start + size
for k in range(hvd.local_size()):
output_partial = os.path.join(
logdir, 'outputs{}-part{}.json'.format(self.global_step, k))
with open(output_partial, 'r') as f:
obj = json.load(f)
all_results.extend(obj)
os.unlink(output_partial)
output_file = os.path.join(
logger.get_logger_dir(), 'outputs{}.json'.format(self.global_step))
logdir, 'outputs{}.json'.format(self.global_step))
with open(output_file, 'w') as f:
json.dump(all_results, f)
try:
......@@ -572,10 +573,13 @@ if __name__ == '__main__':
if not is_horovod:
callbacks.append(GPUUtilizationTracker())
if args.load:
session_init = get_model_loader(args.load)
if is_horovod and hvd.rank() > 0:
session_init = None
else:
session_init = get_model_loader(cfg.BACKBONE.WEIGHTS) if cfg.BACKBONE.WEIGHTS else None
if args.load:
session_init = get_model_loader(args.load)
else:
session_init = get_model_loader(cfg.BACKBONE.WEIGHTS) if cfg.BACKBONE.WEIGHTS else None
traincfg = TrainConfig(
model=MODEL,
......
......@@ -447,9 +447,11 @@ class PlasmaGetData(ProxyDataFlow):
yield dp
try:
import pyarrow.plasma as plasma
except ImportError:
from ..utils.develop import create_dummy_class
PlasmaPutData = create_dummy_class('PlasmaPutData', 'pyarrow') # noqa
PlasmaGetData = create_dummy_class('PlasmaGetData', 'pyarrow') # noqa
plasma = None
# These plasma code is only experimental
# try:
# import pyarrow.plasma as plasma
# except ImportError:
# from ..utils.develop import create_dummy_class
# PlasmaPutData = create_dummy_class('PlasmaPutData', 'pyarrow') # noqa
# PlasmaGetData = create_dummy_class('PlasmaGetData', 'pyarrow') # noqa
......@@ -37,11 +37,11 @@ os.environ['TF_SYNC_ON_FINISH'] = '0' # will become default
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
os.environ['TF_GPU_THREAD_COUNT'] = '2'
# Available in TF1.6+. Haven't seen different performance on R50.
# NOTE TF set it to 0 by default, because:
# Available in TF1.6+ & cudnn7. Haven't seen different performance on R50.
# NOTE we disable it because:
# this mode may use scaled atomic integer reduction that may cause a numerical
# overflow for certain input data range.
# os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '1'
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '0'
try:
import tensorflow as tf # noqa
......
......@@ -64,7 +64,7 @@ except ImportError:
dumps_msgpack = create_dummy_func( # noqa
'dumps_msgpack', ['msgpack', 'msgpack_numpy'])
if os.environ.get('TENSORPACK_SERIALIZE', None) == 'msgpack':
if pa is None or os.environ.get('TENSORPACK_SERIALIZE', None) == 'msgpack':
loads = loads_msgpack
dumps = dumps_msgpack
else:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment