Commit cbd698ad authored by Yuxin Wu's avatar Yuxin Wu

get_model_loader->SmartRestore; Improve horovod integration

parent d71184c4
...@@ -377,6 +377,7 @@ _DEPRECATED_NAMES = set([ ...@@ -377,6 +377,7 @@ _DEPRECATED_NAMES = set([
'InputDesc', 'InputDesc',
'inputs_desc', 'inputs_desc',
'Augmentor', 'Augmentor',
"get_model_loader",
# renamed items that should not appear in docs # renamed items that should not appear in docs
'DumpTensor', 'DumpTensor',
......
...@@ -21,7 +21,7 @@ By writing callbacks to implement what to do at each place, tensorpack trainers ...@@ -21,7 +21,7 @@ By writing callbacks to implement what to do at each place, tensorpack trainers
will call the callbacks at the proper time. will call the callbacks at the proper time.
Therefore these features can be reused with one single line, as long as you are using tensorpack trainers. Therefore these features can be reused with one single line, as long as you are using tensorpack trainers.
For example, these are the callbacks I used when training a ResNet: For example, here are some useful callbacks I used during model development:
```python ```python
callbacks=[ callbacks=[
...@@ -43,7 +43,7 @@ callbacks=[ ...@@ -43,7 +43,7 @@ callbacks=[
-d type=note -d title="validation error" \\ -d type=note -d title="validation error" \\
-d body={val-error-top1} > /dev/null 2>&1', -d body={val-error-top1} > /dev/null 2>&1',
'val-error-top1'), 'val-error-top1'),
# record GPU utilizations during training # record GPU utilization during training
GPUUtilizationTracker(), GPUUtilizationTracker(),
# touch a file to pause the training and start a debug shell, to observe what's going on # touch a file to pause the training and start a debug shell, to observe what's going on
InjectShell(shell='ipython'), InjectShell(shell='ipython'),
...@@ -69,12 +69,12 @@ monitors=[ # monitors are a special kind of callbacks. these are also ena ...@@ -69,12 +69,12 @@ monitors=[ # monitors are a special kind of callbacks. these are also ena
] ]
``` ```
You can see from the above snippet, that callbacks cover every detail of training, ranging from graph operations to the progress bar. You can see from the above snippet, that callbacks cover every detail of training, from graph operations to the progress bar.
This means you can customize every part of the training to your preference, e.g. display something This means you can customize every part of the training to your preference, e.g. display something
different in the progress bar, evaluate part of the summaries at a different frequency, etc. different in the progress bar, evaluate part of the summaries at a different frequency, etc.
Similar concepts also exists in other frameworks, such as Keras callbacks, or Similar concepts also exists in other frameworks, such as Keras callbacks, or
`tf.train.SessionRunHook`. But tensorpack callbacks have more functionalities in `tf.train.SessionRunHook`. But tensorpack callbacks have more functionalities in
design, and can achive much more features, as you can see above. design, and can achieve much more features, as you can see above.
These features are not always necessary, but think about how messy the main loop would look like if you These features are not always necessary, but think about how messy the main loop would look like if you
were to write these logic together with the loops, and how easy your life will be if you could enable were to write these logic together with the loops, and how easy your life will be if you could enable
......
...@@ -20,10 +20,10 @@ demos how to print all variables and their shapes in a checkpoint. ...@@ -20,10 +20,10 @@ demos how to print all variables and their shapes in a checkpoint.
Tensorpack includes another tool to save variables to TF checkpoint, see Tensorpack includes another tool to save variables to TF checkpoint, see
[save_chkpt_vars](../modules/tfutils.html#tensorpack.tfutils.varmanip.save_chkpt_vars). [save_chkpt_vars](../modules/tfutils.html#tensorpack.tfutils.varmanip.save_chkpt_vars).
## Work with npz Files in Model Zoo ## Work with .npz Files in the Model Zoo
Most models provided by tensorpack are in npz (dictionary) format, Most models provided by tensorpack are in npz (dictionary) format,
because it's easy to manipulate without TF dependency. because it's easy to use without TF dependency.
You can read/write them with `np.load` and `np.savez`. You can read/write them with `np.load` and `np.savez`.
[scripts/dump-model-params.py](../scripts/dump-model-params.py) can be used to remove unnecessary variables in a checkpoint [scripts/dump-model-params.py](../scripts/dump-model-params.py) can be used to remove unnecessary variables in a checkpoint
...@@ -34,24 +34,24 @@ It dumps the model to a `var-name: value` dict saved in npz format. ...@@ -34,24 +34,24 @@ It dumps the model to a `var-name: value` dict saved in npz format.
## Load a Model to a Session ## Load a Model to a Session
Model loading (in both training and inference) is through the `session_init` interface. Model loading (in both training and inference) is through the `session_init` interface.
For training, use `session_init` in `TrainConfig` or `Trainer.train()`. For training, use `session_init` in `TrainConfig(...)` or `Trainer.train(...)`.
For inference, use `session_init` in `PredictConfig`. For inference, use `session_init` in `PredictConfig(...)`.
There are two ways a session can be initialized: There are a few ways a session can be initialized:
[session_init=SaverRestore(...)](../modules/tfutils.html#tensorpack.tfutils.sessinit.SaverRestore) ```
which restores a TF checkpoint, session_init=SmartRestore("path/to/checkpoint") # load a TF checkpoint
or [session_init=DictRestore(...)](../modules/tfutils.html#tensorpack.tfutils.sessinit.DictRestore) which restores a dict. session_init=SmartRestore("path/to/model_zoo.npz") # load tensorpack model zoo
`DictRestore` is the most general loader because you can make arbitrary changes session_init=SmartRestore(dict_of_parameters) # load a dictionary
you need (e.g., remove variables, rename variables) to the dict. session_init=SmartRestore(["path1", dict2]) # load them sequentially
```
To load multiple models, use [ChainInit](../modules/tfutils.html#tensorpack.tfutils.sessinit.ChainInit).
To load an npz file from tensorpack model zoo to a session, you can use `DictRestore(dict(np.load(filename)))`. [SmartRestore](../modules/tfutils.html#tensorpack.tfutils.sessinit.SmartRestore)
You can also use is in fact a small helper which uses some heuristics to return you one of
[get_model_loader(filename)](../modules/tfutils.html#tensorpack.tfutils.sessinit.get_model_loader), [SaverRestore](../modules/tfutils.html#tensorpack.tfutils.sessinit.SaverRestore) or
a small helper which returns either a `SaverRestore` or a `DictRestore` based on the file name. [DictRestore](../modules/tfutils.html#tensorpack.tfutils.sessinit.DictRestore).
They are responsible for the actual initialization work.
Whatever you use in `session_init`, this is what happen during the loading: Whatever you use in `session_init`, this is what happens during the loading:
* Variable restoring is completely based on __exact name match__ between * Variable restoring is completely based on __exact name match__ between
variables in the current graph and variables in the `session_init` initializer. variables in the current graph and variables in the `session_init` initializer.
......
# Trainers # Trainers
Tensorpack follows the "define-and-run" paradigm. TensorFlow & Tensorpack follow the "define-and-run" paradigm.
Therefore a training contains two steps: Therefore a training contains two steps:
1. __Define__: Build graph for the model. 1. __Define__: Build graph for the model.
Users can call whatever tensorflow functions to setup the graph. Users can call whatever tensorflow functions to setup the graph.
Users may or may not use tensorpack `InputSource`, `ModelDesc` or other utilities to build the graph. Users may or may not use tensorpack `InputSource`, `ModelDesc` or other utilities to build the graph.
The goal of this step is to define "what to run" in later training steps, The goal of this step is to define "what to run" in later training steps.
and it can happen __either inside or outside__ tensorpack trainer.
2. __Run__: Train the model (the [Trainer.train() method](/modules/train.html#tensorpack.train.Trainer.train)): 2. __Run__: Train the model (the [Trainer.train() method](/modules/train.html#tensorpack.train.Trainer.train)):
...@@ -26,7 +25,7 @@ by exploiting some universal patterns. ...@@ -26,7 +25,7 @@ by exploiting some universal patterns.
In research we do training of various kind. In research we do training of various kind.
Tensorpack trainers avoid making assumptions on what type of training Tensorpack trainers avoid making assumptions on what type of training
you want to do. For example, unlike Keras, tensorpack does not wrongly assume that: you want to do. For example, unlike Keras, tensorpack does not wrongly assume that:
1. Your training is batched 1. Your training data is batched
2. Your training is gradient-based optimization 2. Your training is gradient-based optimization
3. Your data has `X`(inputs) and `y`(outputs) 3. Your data has `X`(inputs) and `y`(outputs)
4. You want to evaluate on zero or one validation dataset 4. You want to evaluate on zero or one validation dataset
...@@ -48,7 +47,8 @@ Users or derived trainers should implement __what the iterations are__. ...@@ -48,7 +47,8 @@ Users or derived trainers should implement __what the iterations are__.
In fact, the steps per epoch can be any number In fact, the steps per epoch can be any number
and it only affects the [schedule of callbacks](callback.html). and it only affects the [schedule of callbacks](callback.html).
In other words, an "epoch" in tensorpack is the __default period to run In other words, an "epoch" in tensorpack is the __default period to run
callbacks__ (validation, summary, checkpoint, etc.). It has nothing to do with your dataset. callbacks__ (validation, summary, checkpoint, etc.).
So this assumption effectively puts no extra constraints.
### Built-in Trainers ### Built-in Trainers
......
...@@ -42,7 +42,7 @@ After defining such a model, use it with `TrainConfig` and `launch_train_with_co ...@@ -42,7 +42,7 @@ After defining such a model, use it with `TrainConfig` and `launch_train_with_co
config = TrainConfig( config = TrainConfig(
model=MyModel() model=MyModel()
dataflow=my_dataflow, dataflow=my_dataflow,
# data=my_inputsource, # alternatively, use a customized InputSource # data=my_inputsource, # alternatively, use an InputSource
callbacks=[...], # some default callbacks are automatically applied callbacks=[...], # some default callbacks are automatically applied
# some default monitors are automatically applied # some default monitors are automatically applied
steps_per_epoch=300, # default to the size of your InputSource/DataFlow steps_per_epoch=300, # default to the size of your InputSource/DataFlow
......
...@@ -284,6 +284,8 @@ def finalize_configs(is_training): ...@@ -284,6 +284,8 @@ def finalize_configs(is_training):
if _C.TRAINER == 'horovod': if _C.TRAINER == 'horovod':
import horovod.tensorflow as hvd import horovod.tensorflow as hvd
ngpu = hvd.size() ngpu = hvd.size()
logger.info("Horovod Rank={}, Size={}, LocalRank={}".format(
hvd.rank(), hvd.size(), hvd.local_rank()))
else: else:
assert 'OMPI_COMM_WORLD_SIZE' not in os.environ assert 'OMPI_COMM_WORLD_SIZE' not in os.environ
ngpu = get_num_gpu() ngpu = get_num_gpu()
......
...@@ -45,18 +45,17 @@ if __name__ == '__main__': ...@@ -45,18 +45,17 @@ if __name__ == '__main__':
register_coco(cfg.DATA.BASEDIR) # add COCO datasets to the registry register_coco(cfg.DATA.BASEDIR) # add COCO datasets to the registry
register_balloon(cfg.DATA.BASEDIR) # add the demo balloon datasets to the registry register_balloon(cfg.DATA.BASEDIR) # add the demo balloon datasets to the registry
# Setup logger ... # Setup logging ...
is_horovod = cfg.TRAINER == 'horovod' is_horovod = cfg.TRAINER == 'horovod'
if is_horovod:
hvd.init()
logger.info("Horovod Rank={}, Size={}".format(hvd.rank(), hvd.size()))
if not is_horovod or hvd.rank() == 0: if not is_horovod or hvd.rank() == 0:
logger.set_logger_dir(args.logdir, 'd') logger.set_logger_dir(args.logdir, 'd')
logger.info("Environment Information:\n" + collect_env_info()) logger.info("Environment Information:\n" + collect_env_info())
finalize_configs(is_training=True) finalize_configs(is_training=True)
# Create model
MODEL = ResNetFPNModel() if cfg.MODE_FPN else ResNetC4Model()
# Compute the training schedule from the number of GPUs ... # Compute the training schedule from the number of GPUs ...
stepnum = cfg.TRAIN.STEPS_PER_EPOCH stepnum = cfg.TRAIN.STEPS_PER_EPOCH
# warmup is step based, lr is epoch based # warmup is step based, lr is epoch based
...@@ -77,9 +76,7 @@ if __name__ == '__main__': ...@@ -77,9 +76,7 @@ if __name__ == '__main__':
total_passes = cfg.TRAIN.LR_SCHEDULE[-1] * 8 / train_dataflow.size() total_passes = cfg.TRAIN.LR_SCHEDULE[-1] * 8 / train_dataflow.size()
logger.info("Total passes of the training set is: {:.5g}".format(total_passes)) logger.info("Total passes of the training set is: {:.5g}".format(total_passes))
# Create model and callbacks ... # Create callbacks ...
MODEL = ResNetFPNModel() if cfg.MODE_FPN else ResNetC4Model()
callbacks = [ callbacks = [
PeriodicCallback( PeriodicCallback(
ModelSaver(max_to_keep=10, keep_checkpoint_every_n_hours=1), ModelSaver(max_to_keep=10, keep_checkpoint_every_n_hours=1),
...@@ -93,23 +90,22 @@ if __name__ == '__main__': ...@@ -93,23 +90,22 @@ if __name__ == '__main__':
ThroughputTracker(samples_per_step=cfg.TRAIN.NUM_GPUS), ThroughputTracker(samples_per_step=cfg.TRAIN.NUM_GPUS),
EstimatedTimeLeft(median=True), EstimatedTimeLeft(median=True),
SessionRunTimeout(60000), # 1 minute timeout SessionRunTimeout(60000), # 1 minute timeout
GPUUtilizationTracker()
] ]
if cfg.TRAIN.EVAL_PERIOD > 0: if cfg.TRAIN.EVAL_PERIOD > 0:
callbacks.extend([ callbacks.extend([
EvalCallback(dataset, *MODEL.get_inference_tensor_names(), args.logdir) EvalCallback(dataset, *MODEL.get_inference_tensor_names(), args.logdir)
for dataset in cfg.DATA.VAL for dataset in cfg.DATA.VAL
]) ])
if not is_horovod:
callbacks.append(GPUUtilizationTracker())
if is_horovod and hvd.rank() > 0: if is_horovod and hvd.rank() > 0:
session_init = None session_init = None
else: else:
if args.load: if args.load:
# ignore mismatched values, so you can `--load` a model for fine-tuning # ignore mismatched values, so you can `--load` a model for fine-tuning
session_init = get_model_loader(args.load, ignore_mismatch=True) session_init = SmartRestore(args.load, ignore_mismatch=True)
else: else:
session_init = get_model_loader(cfg.BACKBONE.WEIGHTS) if cfg.BACKBONE.WEIGHTS else None session_init = SmartRestore(cfg.BACKBONE.WEIGHTS)
traincfg = TrainConfig( traincfg = TrainConfig(
model=MODEL, model=MODEL,
...@@ -120,6 +116,7 @@ if __name__ == '__main__': ...@@ -120,6 +116,7 @@ if __name__ == '__main__':
session_init=session_init, session_init=session_init,
starting_epoch=cfg.TRAIN.STARTING_EPOCH starting_epoch=cfg.TRAIN.STARTING_EPOCH
) )
if is_horovod: if is_horovod:
trainer = HorovodTrainer(average=False) trainer = HorovodTrainer(average=False)
else: else:
......
...@@ -37,26 +37,41 @@ class GPUUtilizationTracker(Callback): ...@@ -37,26 +37,41 @@ class GPUUtilizationTracker(Callback):
def __init__(self, devices=None): def __init__(self, devices=None):
""" """
Args: Args:
devices (list[int]): physical GPU ids. If None, will use CUDA_VISIBLE_DEVICES devices (list[int]): physical GPU ids to monitor. If None, will guess from the environment.
""" """
assert os.name != 'nt', "GPUUtilizationTracker does not support windows!" assert os.name != 'nt', "GPUUtilizationTracker does not support windows!"
if devices is None: self._devices = devices
env = os.environ.get('CUDA_VISIBLE_DEVICES') self._enabled = True
if env is None:
self._devices = list(range(get_num_gpu())) def _guess_devices(self):
if len(self._devices) > 1: env = os.environ.get('CUDA_VISIBLE_DEVICES')
logger.warn("[GPUUtilizationTracker] Both devices and CUDA_VISIBLE_DEVICES are None! " if env is None:
"Will monitor all {} visible GPUs!".format(len(self._devices))) devices = list(range(get_num_gpu()))
else: if len(devices) > 1:
if len(env): logger.warn("[GPUUtilizationTracker] Both devices and CUDA_VISIBLE_DEVICES are None! "
self._devices = list(map(int, env.split(','))) "Will monitor all {} visible GPUs!".format(len(devices)))
else:
self._devices = []
else: else:
self._devices = devices if len(env):
assert len(self._devices), "[GPUUtilizationTracker] No GPU device given!" devices = list(map(int, env.split(',')))
else:
devices = []
return devices
def _setup_graph(self): def _setup_graph(self):
# special heuristics for Horovod
from ..train import HorovodTrainer
if isinstance(self.trainer, HorovodTrainer):
if self.trainer.mpi_enabled():
logger.warn("GPUUtilizationTracker is disabled under MPI.")
self._enabled = False
return
else:
self._devices = [self.trainer.hvd.local_rank()]
if self._devices is None:
self._devices = self._guess_devices()
assert len(self._devices), "[GPUUtilizationTracker] No GPU device given!"
self._evt = mp.Event() self._evt = mp.Event()
self._stop_evt = mp.Event() self._stop_evt = mp.Event()
self._queue = mp.Queue() self._queue = mp.Queue()
...@@ -69,16 +84,20 @@ class GPUUtilizationTracker(Callback): ...@@ -69,16 +84,20 @@ class GPUUtilizationTracker(Callback):
assert gpu_available_in_session(), "[GPUUtilizationTracker] needs GPU!" assert gpu_available_in_session(), "[GPUUtilizationTracker] needs GPU!"
def _before_epoch(self): def _before_epoch(self):
self._evt.set() if self._enabled:
self._evt.set()
def _after_epoch(self): def _after_epoch(self):
while self._evt.is_set(): # unlikely, unless the epoch is extremely fast if self._enabled:
pass while self._evt.is_set(): # unlikely, unless the epoch is extremely fast
self._evt.set() pass
self._evt.set()
def _trigger_epoch(self): def _trigger_epoch(self):
# Don't do this in after_epoch because # Don't do this in after_epoch because
# before,after_epoch are supposed to be extremely fast by design. # before,after_epoch are supposed to be extremely fast by design.
if not self._enabled:
return
try: try:
stats = self._queue.get(timeout=60) stats = self._queue.get(timeout=60)
except queue.Empty: except queue.Empty:
...@@ -94,9 +113,10 @@ class GPUUtilizationTracker(Callback): ...@@ -94,9 +113,10 @@ class GPUUtilizationTracker(Callback):
self.trainer.monitors.put_scalar('GPUUtil/{}'.format(dev), stats[idx]) self.trainer.monitors.put_scalar('GPUUtil/{}'.format(dev), stats[idx])
def _after_train(self): def _after_train(self):
self._stop_evt.set() if self._enabled:
self._evt.set() self._stop_evt.set()
self._proc.terminate() self._evt.set()
self._proc.terminate()
@staticmethod @staticmethod
def worker(evt, rst_queue, stop_evt, devices): def worker(evt, rst_queue, stop_evt, devices):
......
...@@ -221,7 +221,7 @@ def collect_env_info(): ...@@ -221,7 +221,7 @@ def collect_env_info():
# Other important dependencies: # Other important dependencies:
try: try:
import horovod import horovod
data.append(("horovod", horovod.__version__)) data.append(("Horovod", horovod.__version__))
except ImportError: except ImportError:
pass pass
......
...@@ -12,7 +12,7 @@ from .varmanip import SessionUpdate, get_checkpoint_path, get_savename_from_varn ...@@ -12,7 +12,7 @@ from .varmanip import SessionUpdate, get_checkpoint_path, get_savename_from_varn
__all__ = ['SessionInit', 'ChainInit', __all__ = ['SessionInit', 'ChainInit',
'SaverRestore', 'SaverRestoreRelaxed', 'DictRestore', 'SaverRestore', 'SaverRestoreRelaxed', 'DictRestore',
'JustCurrentSession', 'get_model_loader'] 'JustCurrentSession', 'get_model_loader', 'SmartRestore']
class SessionInit(object): class SessionInit(object):
...@@ -260,32 +260,52 @@ class ChainInit(SessionInit): ...@@ -260,32 +260,52 @@ class ChainInit(SessionInit):
i._run_init(sess) i._run_init(sess)
def get_model_loader(filename, ignore_mismatch=False): def SmartRestore(obj, ignore_mismatch=False):
""" """
Get a corresponding model loader by looking at the file name. Create a :class:`SessionInit` to be loaded to a session,
automatically from any supported objects, with some smart heuristics.
The object can be:
+ A TF checkpoint
+ A dict of numpy arrays
+ A npz file
+ An empty string or None
+ A list of supported objects
Args: Args:
filename (str): either a tensorflow checkpoint, or a npz file. obj: a supported object
ignore_mismatch (bool): ignore failures when the value and the ignore_mismatch (bool): ignore failures when the value and the
variable does not match in their shapes. variable does not match in their shapes.
If False, it will throw exception on such errors. If False, it will throw exception on such errors.
If True, it will only print a warning. If True, it will only print a warning.
Returns: Returns:
SessInit: either a :class:`DictRestore` (if name ends with 'npy/npz') or SessionInit:
:class:`SaverRestore` (otherwise).
""" """
assert isinstance(filename, six.string_types), filename if not obj:
filename = os.path.expanduser(filename) return JustCurrentSession()
if filename.endswith('.npy'): if isinstance(obj, list):
assert tf.gfile.Exists(filename), filename return ChainInit([SmartRestore(x, ignore_mismatch=ignore_mismatch) for x in obj])
return DictRestore(np.load(filename, encoding='latin1').item(), ignore_mismatch=ignore_mismatch) if isinstance(obj, six.string_types):
elif filename.endswith('.npz'): obj = os.path.expanduser(obj)
assert tf.gfile.Exists(filename), filename if obj.endswith(".npy") or obj.endswith(".npz"):
obj = np.load(filename) assert tf.gfile.Exists(obj), "File {} does not exist!".format(obj)
return DictRestore(dict(obj), ignore_mismatch=ignore_mismatch) filename = obj
else: logger.info("Loading dictionary from {} ...".format(filename))
if ignore_mismatch: if filename.endswith('.npy'):
return SaverRestoreRelaxed(filename) obj = np.load(filename, encoding='latin1').item()
elif filename.endswith('.npz'):
obj = dict(np.load(filename))
elif len(tf.gfile.Glob(obj + "*")):
# Assume to be a TF checkpoint.
# A TF checkpoint must be a prefix of an actual file.
return (SaverRestoreRelaxed if ignore_mismatch else SaverRestore)(obj)
else: else:
return SaverRestore(filename) raise ValueError("Invalid argument to SmartRestore: " + obj)
if isinstance(obj, dict):
return DictRestore(obj, ignore_mismatch=ignore_mismatch)
raise ValueError("Invalid argument to SmartRestore: " + type(obj))
get_model_loader = SmartRestore
...@@ -74,11 +74,11 @@ class QueueInputTrainer(SimpleTrainer): ...@@ -74,11 +74,11 @@ class QueueInputTrainer(SimpleTrainer):
class SyncMultiGPUTrainerParameterServer(SingleCostTrainer): class SyncMultiGPUTrainerParameterServer(SingleCostTrainer):
__doc__ = SyncMultiGPUParameterServerBuilder.__doc__ __doc__ = SyncMultiGPUParameterServerBuilder.__doc__ + """
Attributes:
devices (list[int]): List of GPU ids.
devices = None
"""
List of GPU ids.
""" """
@map_arg(gpus=_int_to_range) @map_arg(gpus=_int_to_range)
...@@ -117,11 +117,11 @@ def SyncMultiGPUTrainer(gpus): ...@@ -117,11 +117,11 @@ def SyncMultiGPUTrainer(gpus):
class AsyncMultiGPUTrainer(SingleCostTrainer): class AsyncMultiGPUTrainer(SingleCostTrainer):
__doc__ = AsyncMultiGPUBuilder.__doc__ __doc__ = AsyncMultiGPUBuilder.__doc__ + """
Attributes:
devices (list[int]): List of GPU ids.
devices = None
"""
List of GPU ids.
""" """
@map_arg(gpus=_int_to_range) @map_arg(gpus=_int_to_range)
...@@ -146,21 +146,20 @@ class AsyncMultiGPUTrainer(SingleCostTrainer): ...@@ -146,21 +146,20 @@ class AsyncMultiGPUTrainer(SingleCostTrainer):
class SyncMultiGPUTrainerReplicated(SingleCostTrainer): class SyncMultiGPUTrainerReplicated(SingleCostTrainer):
__doc__ = SyncMultiGPUReplicatedBuilder.__doc__ __doc__ = SyncMultiGPUReplicatedBuilder.__doc__ + """
devices = None Attributes:
""" devices (list[int]): List of GPU ids.
List of GPU ids.
BROADCAST_EVERY_EPOCH (bool):
Whether to broadcast the variables every epoch.
Theoretically this is a no-op (because the variables
are supposed to be in-sync).
But this cheap operation may help prevent
certain numerical issues in practice.
""" """
BROADCAST_EVERY_EPOCH = True BROADCAST_EVERY_EPOCH = True
"""
Whether to broadcast the variables every epoch.
Theoretically this is a no-op (because the variables
are supposed to be in-sync).
But this cheap operation may help prevent
certain numerical issues in practice.
"""
@map_arg(gpus=_int_to_range) @map_arg(gpus=_int_to_range)
def __init__(self, gpus, average=True, mode=None): def __init__(self, gpus, average=True, mode=None):
...@@ -338,6 +337,10 @@ class HorovodTrainer(SingleCostTrainer): ...@@ -338,6 +337,10 @@ class HorovodTrainer(SingleCostTrainer):
# If using all GPUs, you can always skip the `CUDA_VISIBLE_DEVICES` option. # If using all GPUs, you can always skip the `CUDA_VISIBLE_DEVICES` option.
# There are other MPI options that can potentially improve performance especially on special hardwares. # There are other MPI options that can potentially improve performance especially on special hardwares.
Horovod can also be launched without MPI. See
`its documentation <https://github.com/horovod/horovod#running-horovod>`_
for more details.
Note: Note:
1. To reach the maximum speed in your system, there are many options to tune 1. To reach the maximum speed in your system, there are many options to tune
for Horovod installation and in the MPI command line. for Horovod installation and in the MPI command line.
...@@ -348,9 +351,10 @@ class HorovodTrainer(SingleCostTrainer): ...@@ -348,9 +351,10 @@ class HorovodTrainer(SingleCostTrainer):
must be avoided. must be avoided.
You can, however, use `tf.config.experimental.list_physical_devices('GPU')`, introduced in TF 1.14. You can, however, use `tf.config.experimental.list_physical_devices('GPU')`, introduced in TF 1.14.
2. MPI does not like `fork()`. If your dataflow contains multiprocessing, it may cause problems. 3. Horovod supports both MPI and gloo. There are a few drawbacks of the MPI backend:
3. MPI sometimes fails to kill all processes in the end. Be sure to check it afterwards. + MPI does not like `fork()`. If your code (e.g. dataflow) contains multiprocessing, it may cause problems.
+ MPI sometimes fails to kill all processes in the end. Be sure to check it afterwards.
4. Keep in mind that there is one process running the script per GPU, therefore: 4. Keep in mind that there is one process running the script per GPU, therefore:
...@@ -364,7 +368,8 @@ class HorovodTrainer(SingleCostTrainer): ...@@ -364,7 +368,8 @@ class HorovodTrainer(SingleCostTrainer):
+ Callbacks have an option to be run only in the chief process, or in all processes. + Callbacks have an option to be run only in the chief process, or in all processes.
See :meth:`Callback.set_chief_only()`. Most callbacks have a reasonable See :meth:`Callback.set_chief_only()`. Most callbacks have a reasonable
default already, but certain callbacks may not behave properly by default. Report an issue if you find any. default already, but certain callbacks may need your customization.
Report an issue if you find any bad defaults.
+ You can use Horovod API such as `hvd.rank()` to know which process you are and choose + You can use Horovod API such as `hvd.rank()` to know which process you are and choose
different code path. Chief process has rank 0. different code path. Chief process has rank 0.
...@@ -373,7 +378,18 @@ class HorovodTrainer(SingleCostTrainer): ...@@ -373,7 +378,18 @@ class HorovodTrainer(SingleCostTrainer):
`ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_ `ResNet-Horovod <https://github.com/tensorpack/benchmarks/tree/master/ResNet-Horovod>`_
for a full example which has handled these common issues. for a full example which has handled these common issues.
This example can train ImageNet in roughly an hour following the paper's setup. This example can train ImageNet in roughly an hour following the paper's setup.
Attributes:
BROADCAST_EVERY_EPOCH (bool):
Whether to broadcast the variables every epoch.
Theoretically this is a no-op (because the variables
are supposed to be in-sync).
But this cheap operation may help prevent
certain numerical issues in practice.
""" """
BROADCAST_EVERY_EPOCH = True
def __init__(self, average=True, compression=None): def __init__(self, average=True, compression=None):
""" """
Args: Args:
...@@ -399,6 +415,16 @@ class HorovodTrainer(SingleCostTrainer): ...@@ -399,6 +415,16 @@ class HorovodTrainer(SingleCostTrainer):
logger.info("[HorovodTrainer] local rank={}".format(self._local_rank)) logger.info("[HorovodTrainer] local rank={}".format(self._local_rank))
super(HorovodTrainer, self).__init__() super(HorovodTrainer, self).__init__()
def mpi_enabled(self):
"""
Returns:
bool: whether hvd is currently running under MPI
"""
try:
return self.hvd.mpi_enabled()
except AttributeError:
return False
def allreduce(self, grads): def allreduce(self, grads):
if self.hvd.size() == 1: if self.hvd.size() == 1:
return grads return grads
...@@ -424,7 +450,10 @@ class HorovodTrainer(SingleCostTrainer): ...@@ -424,7 +450,10 @@ class HorovodTrainer(SingleCostTrainer):
opt = get_opt_fn() opt = get_opt_fn()
self.train_op = opt.apply_gradients(grads, name='train_op') self.train_op = opt.apply_gradients(grads, name='train_op')
cb = CallbackFactory(before_train=self.broadcast, trigger=self.broadcast).set_chief_only(False) cb = CallbackFactory(
before_train=self.broadcast,
trigger=self.broadcast if self.BROADCAST_EVERY_EPOCH else None
).set_chief_only(False)
return [cb] return [cb]
def broadcast(self, _): def broadcast(self, _):
...@@ -502,3 +531,10 @@ class BytePSTrainer(HorovodTrainer): ...@@ -502,3 +531,10 @@ class BytePSTrainer(HorovodTrainer):
self._has_compression = False self._has_compression = False
logger.info("[BytePSTrainer] local rank={}".format(self._local_rank)) logger.info("[BytePSTrainer] local rank={}".format(self._local_rank))
SingleCostTrainer.__init__(self) SingleCostTrainer.__init__(self)
def mpi_enabled(self):
"""
Returns:
bool: whether hvd is currently running under MPI
"""
return False
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment