Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
bed3fa19
Commit
bed3fa19
authored
Nov 23, 2017
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Support horovod distributed training (#422)
parent
9c2e2226
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
22 additions
and
3 deletions
+22
-3
docs/tutorial/intro.rst
docs/tutorial/intro.rst
+1
-1
tensorpack/train/trainers.py
tensorpack/train/trainers.py
+21
-2
No files found.
docs/tutorial/intro.rst
View file @
bed3fa19
...
@@ -15,7 +15,7 @@ which as a result makes people think TensorFlow is slow.
...
@@ -15,7 +15,7 @@ which as a result makes people think TensorFlow is slow.
Tensorpack uses TensorFlow efficiently, and hides these details under its APIs.
Tensorpack uses TensorFlow efficiently, and hides these details under its APIs.
You no longer need to learn about
You no longer need to learn about
multi-GPU model replication,
variables synchronization, queues, tf.data
-- anything that's unrelated to the model itself.
multi-GPU model replication,
device placement, variables synchronization, queues
-- anything that's unrelated to the model itself.
You still need to learn to write models with TF, but everything else is taken care of by tensorpack, in the efficient way.
You still need to learn to write models with TF, but everything else is taken care of by tensorpack, in the efficient way.
A High Level Glance
A High Level Glance
...
...
tensorpack/train/trainers.py
View file @
bed3fa19
...
@@ -212,9 +212,28 @@ class DistributedTrainerReplicated(SingleCostTrainer):
...
@@ -212,9 +212,28 @@ class DistributedTrainerReplicated(SingleCostTrainer):
class
HorovodTrainer
(
SingleCostTrainer
):
class
HorovodTrainer
(
SingleCostTrainer
):
"""
"""
Horovod trainer,
currently support multi-GPU
training.
Horovod trainer,
support multi-GPU and distributed
training.
It will use the first k GPUs in CUDA_VISIBLE_DEVICES.
To use for multi-GPU training:
CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np 4 --output-filename mylog python train.py
To use for distributed training:
/path/to/mpirun -np 8 -H server1:4,server2:4
\
-bind-to none -map-by slot
\
--output-filename mylog -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0,1,2,3
\
python train.py
Note:
1. If using all GPUs, you can always skip the `CUDA_VISIBLE_DEVICES` option.
2. About performance, horovod is expected to be slightly
slower than native tensorflow on multi-GPU training, but faster in distributed training.
3. Due to the use of MPI, training is less informative (no progress bar).
It's recommended to use other multi-GPU trainers for single-node
experiments, and scale to multi nodes by horovod.
"""
"""
def
__init__
(
self
):
def
__init__
(
self
):
hvd
.
init
()
hvd
.
init
()
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment