Commit eafe564b authored by Yuxin Wu's avatar Yuxin Wu

update docs; add retrying when importing metagraph (fix #1184)

parent f6ede612
...@@ -49,7 +49,7 @@ Model: ...@@ -49,7 +49,7 @@ Model:
2. We use ROIAlign, and `tf.image.crop_and_resize` is __NOT__ ROIAlign. 2. We use ROIAlign, and `tf.image.crop_and_resize` is __NOT__ ROIAlign.
3. We currently only support single image per GPU. 3. We currently only support single image per GPU in this example.
4. Because of (3), BatchNorm statistics are supposed to be freezed during fine-tuning. 4. Because of (3), BatchNorm statistics are supposed to be freezed during fine-tuning.
......
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
# File: dump-model-params.py # File: dump-model-params.py
import argparse import argparse
import sys
import numpy as np import numpy as np
import os import os
import six import six
...@@ -11,6 +12,19 @@ import tensorflow as tf ...@@ -11,6 +12,19 @@ import tensorflow as tf
from tensorpack.tfutils import varmanip from tensorpack.tfutils import varmanip
from tensorpack.tfutils.common import get_op_tensor_name from tensorpack.tfutils.common import get_op_tensor_name
def _import_external_ops(op_name):
if "horovod" in op_name.lower():
import horovod.tensorflow # noqa
return
if op_name == "MaxBytesInUse":
from tensorflow.contrib.memory_stats import MaxBytesInUse # noqa
return
print("Your graph contains op '{}' which is not loaded into your Tensorflow runtime.".format(op_name))
print("Therefore the graph cannot be loaded unless you import the relevant libraries first.")
sys.exit(1)
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description='Keep only TRAINABLE and MODEL variables in a checkpoint.') description='Keep only TRAINABLE and MODEL variables in a checkpoint.')
...@@ -22,11 +36,15 @@ if __name__ == '__main__': ...@@ -22,11 +36,15 @@ if __name__ == '__main__':
# this script does not need GPU # this script does not need GPU
os.environ['CUDA_VISIBLE_DEVICES'] = '' os.environ['CUDA_VISIBLE_DEVICES'] = ''
try: while True:
tf.train.import_meta_graph(args.meta, clear_devices=True) try:
except KeyError: tf.reset_default_graph()
print("If your graph contains non-standard ops, you need to import the relevant library first.") tf.train.import_meta_graph(args.meta, clear_devices=True)
raise except KeyError as e:
op_name = e.args[0]
_import_external_ops(op_name)
else:
break
# loading... # loading...
if args.input.endswith('.npz'): if args.input.endswith('.npz'):
......
...@@ -76,13 +76,14 @@ def BatchNorm(inputs, axis=None, training=None, momentum=0.9, epsilon=1e-5, ...@@ -76,13 +76,14 @@ def BatchNorm(inputs, axis=None, training=None, momentum=0.9, epsilon=1e-5,
sync_statistics=None, sync_statistics=None,
internal_update=None): internal_update=None):
""" """
Almost equivalent to `tf.layers.batch_normalization`, but different (and more powerful) A more powerful version of `tf.layers.batch_normalization`. It differs from
in the following: the offical one in the following aspects:
1. Accepts an alternative `data_format` option when `axis` is None. For 2D input, this argument will be ignored. 1. Accepts an alternative ``data_format`` option when ``axis`` is None. For 2D input, this argument will be ignored.
2. Default value for `momentum` and `epsilon` is different. 2. Default value for ``momentum`` and ``epsilon`` is different.
3. Default value for `training` is automatically obtained from tensorpack's `TowerContext`, but can be overwritten. 3. Default value for ``training`` is automatically obtained from tensorpack's ``TowerContext``.
4. Support the ``ema_update`` option, which cover more use cases than the standard EMA update. User-provided value can overwrite this behavior.
4. Support the ``ema_update`` option, which covers broader use cases than the standard EMA update.
5. Support the ``sync_statistics`` option, which implements "SyncBN" and is very useful in small-batch models. 5. Support the ``sync_statistics`` option, which implements "SyncBN" and is very useful in small-batch models.
Args: Args:
...@@ -90,46 +91,53 @@ def BatchNorm(inputs, axis=None, training=None, momentum=0.9, epsilon=1e-5, ...@@ -90,46 +91,53 @@ def BatchNorm(inputs, axis=None, training=None, momentum=0.9, epsilon=1e-5,
to normalize. By default, it is equal to `get_current_tower_context().is_training`. to normalize. By default, it is equal to `get_current_tower_context().is_training`.
This is not a good argument name, but it is what the Tensorflow layer uses. This is not a good argument name, but it is what the Tensorflow layer uses.
ema_update (str): Only effective when ``training=True``. It has the following options: ema_update (str): Only effective when ``training=True``. It has the following options:
* "default": same as "collection". Because this is the default behavior in tensorflow. * "default": same as "collection". Because this is the default behavior in tensorflow.
* "skip": do not update EMA. * "skip": do not update EMA. This can be useful when you reuse a batch norm layer in several places
but do not want them to all update your EMA.
* "collection": Add EMA update ops to collection `tf.GraphKeys.UPDATE_OPS`. * "collection": Add EMA update ops to collection `tf.GraphKeys.UPDATE_OPS`.
The ops in the collection will be run automatically by the callback :class:`RunUpdateOps`. The ops in the collection will be run automatically by the callback :class:`RunUpdateOps`, along with
your training iterations. This can waste compute if your training iterations do not always depend
on the BatchNorm layer.
* "internal": EMA is updated inside this layer itself by control dependencies. * "internal": EMA is updated inside this layer itself by control dependencies.
It has similar speed to "collection", but "internal" is recommended and can be helpful when: In common cases, it has similar speed to "collection". But it covers more cases, e.g.:
1. BatchNorm is used inside dynamic control flow. 1. BatchNorm is used inside dynamic control flow.
The collection-based update does not support dynamic control flows. The collection-based update does not support dynamic control flows.
2. BatchNorm layer is sometimes unused (e.g., when you have two networks to train alternatively). 2. BatchNorm layer is sometimes unused (e.g., in GANs you have two networks to train alternatively).
Putting all update ops into a single collection will waste a lot of compute. Putting all update ops into a single collection will waste a lot of compute.
3. Other part of the model relies on the "updated" EMA. The collection-based method does not update
EMA immediately.
Corresponding TF issue: https://github.com/tensorflow/tensorflow/issues/14699 Corresponding TF issue: https://github.com/tensorflow/tensorflow/issues/14699
sync_statistics (str or None): one of None, "nccl", or "horovod". It determines how to compute the sync_statistics (str or None): one of None, "nccl", or "horovod". It determines how to compute the
"per-batch statistics" when ``training==True``. "per-batch statistics" when ``training==True``.
By default (None), it uses statistics of the input tensor to normalize during training. * None: it uses statistics of the input tensor to normalize during training.
This is the standard way BatchNorm was implemented in most frameworks. This is the standard way BatchNorm was implemented in most frameworks.
When set to "nccl", this layer must be used under tensorpack's multi-GPU trainers. * "nccl": this layer must be used under tensorpack's multi-GPU trainers.
It uses the aggregated statistics of the whole batch (across all GPUs) to normalize. It uses the aggregated statistics of the whole batch (across all GPUs) to normalize.
When set to "horovod", this layer must be used under tensorpack's :class:`HorovodTrainer`. * "horovod": this layer must be used under tensorpack's :class:`HorovodTrainer`.
It uses the aggregated statistics of the whole batch (across all MPI ranks) to normalize. It uses the aggregated statistics of the whole batch (across all MPI ranks) to normalize.
Note that on single machine this is significantly slower than the "nccl" implementation. Note that on single machine this is significantly slower than the "nccl" implementation.
When enabled, per-GPU E[x] and E[x^2] among all GPUs are averaged to compute When not None, each GPU computes its own E[x] and E[x^2],
global mean & variance. Therefore each GPU needs to have the same batch size. which are then averaged among all GPUs to compute global mean & variance.
Therefore each GPU needs to have the same batch size.
The synchronization is based on the current variable scope + the name of the layer The synchronization is based on the current variable scope + the name of the layer
(`BatchNorm('name', input)`). Therefore, you need to make sure that: (`BatchNorm('name', input)`). Therefore, you need to make sure that:
1. The BatchNorm layer on different GPUs needs to have the same name, so that 1. The BatchNorm layer on different GPUs needs to have the same name, so that
statistics can be synchronized. If names do not match, this layer will hang. statistics can be synchronized. If names do not match, this layer will hang.
2. Different BatchNorm layers in one tower cannot share the same name. 2. A BatchNorm layer cannot be reused within one tower.
3. A BatchNorm layer needs to be executed for the same number of times by all GPUs. 3. A BatchNorm layer needs to be executed for the same number of times by all GPUs.
If different GPUs execute one BatchNorm layer for different number of times If different GPUs execute one BatchNorm layer for different number of times
(e.g., if some GPUs do not execute it), this layer may hang. (e.g., if some GPUs do not execute it), this layer may hang.
This option is also known as "SyncBN" or Cross-GPU BatchNorm" as mentioned in: This option is also known as "SyncBN" or "Cross-GPU BatchNorm" as mentioned in:
`MegDet: A Large Mini-Batch Object Detector <https://arxiv.org/abs/1711.07240>`_. `MegDet: A Large Mini-Batch Object Detector <https://arxiv.org/abs/1711.07240>`_.
Corresponding TF issue: https://github.com/tensorflow/tensorflow/issues/18222. Corresponding TF issue: https://github.com/tensorflow/tensorflow/issues/18222.
...@@ -147,8 +155,9 @@ def BatchNorm(inputs, axis=None, training=None, momentum=0.9, epsilon=1e-5, ...@@ -147,8 +155,9 @@ def BatchNorm(inputs, axis=None, training=None, momentum=0.9, epsilon=1e-5,
Note: Note:
This layer is more flexible than the standard "BatchNorm" layer and provides more features: This layer is more flexible than the standard "BatchNorm" layer and provides more features:
1. No matter whether you're doing training or not, you can set the `training` argument
to use batch statistics / EMA statistics. 1. No matter whether you're doing training or not, you can set the ``training`` argument
to use batch statistics or EMA statistics.
i.e., you can use batch statistics during inference, or use EMA statistics during training. i.e., you can use batch statistics during inference, or use EMA statistics during training.
Using EMA statistics in training is useful when you load a pre-trained BN and Using EMA statistics in training is useful when you load a pre-trained BN and
don't want to update it. don't want to update it.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment