Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
a12872dc
Commit
a12872dc
authored
Sep 03, 2020
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
remove extra kwargs in TrainConfig
parent
229e991a
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
9 additions
and
15 deletions
+9
-15
tensorpack/callbacks/group.py
tensorpack/callbacks/group.py
+1
-1
tensorpack/train/config.py
tensorpack/train/config.py
+1
-2
tensorpack/train/trainers.py
tensorpack/train/trainers.py
+7
-12
No files found.
tensorpack/callbacks/group.py
View file @
a12872dc
...
@@ -33,7 +33,7 @@ class CallbackTimeLogger(object):
...
@@ -33,7 +33,7 @@ class CallbackTimeLogger(object):
def
log
(
self
):
def
log
(
self
):
""" log the time of some heavy callbacks """
""" log the time of some heavy callbacks """
if
self
.
tot
<
3
:
if
self
.
tot
<
2
:
return
return
msgs
=
[]
msgs
=
[]
for
name
,
t
in
self
.
times
:
for
name
,
t
in
self
.
times
:
...
...
tensorpack/train/config.py
View file @
a12872dc
...
@@ -61,8 +61,7 @@ class TrainConfig(object):
...
@@ -61,8 +61,7 @@ class TrainConfig(object):
model
=
None
,
model
=
None
,
callbacks
=
None
,
extra_callbacks
=
None
,
monitors
=
None
,
callbacks
=
None
,
extra_callbacks
=
None
,
monitors
=
None
,
session_creator
=
None
,
session_config
=
None
,
session_init
=
None
,
session_creator
=
None
,
session_config
=
None
,
session_init
=
None
,
starting_epoch
=
1
,
steps_per_epoch
=
None
,
max_epoch
=
99999
,
starting_epoch
=
1
,
steps_per_epoch
=
None
,
max_epoch
=
99999
):
**
kwargs
):
"""
"""
Args:
Args:
dataflow (DataFlow):
dataflow (DataFlow):
...
...
tensorpack/train/trainers.py
View file @
a12872dc
...
@@ -344,28 +344,19 @@ class HorovodTrainer(SingleCostTrainer):
...
@@ -344,28 +344,19 @@ class HorovodTrainer(SingleCostTrainer):
.. code-block:: bash
.. code-block:: bash
# First, change trainer to HorovodTrainer(), then
# First, change trainer to HorovodTrainer(), then
CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO
mpi
run -np 4 --output-filename mylog python train.py
CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO
horovod
run -np 4 --output-filename mylog python train.py
To use for distributed training:
To use for distributed training:
.. code-block:: bash
.. code-block:: bash
# First, change trainer to HorovodTrainer(), then
# First, change trainer to HorovodTrainer(), then
mpirun -np 8 -H server1:4,server2:4
\\
horovodrun -np 8 -H server1:4,server2:4 --output-filename mylog
\\
-bind-to none -map-by slot
\\
--output-filename mylog -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH
\\
python train.py
python train.py
# Add other environment variables you need by -x, e.g. PYTHONPATH, PATH.
# If using all GPUs, you can always skip the `CUDA_VISIBLE_DEVICES` option.
# There are other MPI options that can potentially improve performance especially on special hardwares.
Horovod can also be launched without MPI. See
`its documentation <https://github.com/horovod/horovod#running-horovod>`_
for more details.
Note:
Note:
1. To reach the maximum speed in your system, there are many options to tune
1. To reach the maximum speed in your system, there are many options to tune
for Horovod installation
and in the MPI command line.
in Horovod installation, horovodrun arguments,
and in the MPI command line.
See Horovod docs for details.
See Horovod docs for details.
2. Due to a TF bug (#8136), you must not initialize CUDA context before the trainer starts training.
2. Due to a TF bug (#8136), you must not initialize CUDA context before the trainer starts training.
...
@@ -378,6 +369,10 @@ class HorovodTrainer(SingleCostTrainer):
...
@@ -378,6 +369,10 @@ class HorovodTrainer(SingleCostTrainer):
+ MPI does not like `fork()`. If your code (e.g. dataflow) contains multiprocessing, it may cause problems.
+ MPI does not like `fork()`. If your code (e.g. dataflow) contains multiprocessing, it may cause problems.
+ MPI sometimes fails to kill all processes in the end. Be sure to check it afterwards.
+ MPI sometimes fails to kill all processes in the end. Be sure to check it afterwards.
The gloo backend is recommended though it may come with very minor slow down.
To use gloo backend, see
`horovod documentation <https://github.com/horovod/horovod#running-horovod>`_ for more details.
4. Keep in mind that there is one process running the script per GPU, therefore:
4. Keep in mind that there is one process running the script per GPU, therefore:
+ Make sure your InputSource has reasonable randomness.
+ Make sure your InputSource has reasonable randomness.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment