Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
d7a13cb7
Commit
d7a13cb7
authored
Apr 26, 2019
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Use experimental.list_physical_devices to avoid side effects
parent
88ce4a90
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
39 additions
and
12 deletions
+39
-12
docs/tutorial/performance-tuning.md
docs/tutorial/performance-tuning.md
+5
-4
tensorpack/libinfo.py
tensorpack/libinfo.py
+21
-3
tensorpack/train/trainers.py
tensorpack/train/trainers.py
+1
-0
tensorpack/utils/gpu.py
tensorpack/utils/gpu.py
+12
-5
No files found.
docs/tutorial/performance-tuning.md
View file @
d7a13cb7
...
@@ -2,15 +2,16 @@
...
@@ -2,15 +2,16 @@
# Performance Tuning
# Performance Tuning
__We do not know why your training is slow__
__We do not know why your training is slow__
(and most of the times it's not due to issues in tensorpack).
(and most of the times it's not due to issues in tensorpack),
unless we can reproduce the slowness with your instsructions.
Tensorpack is designed to be high-performance, as can be seen in the
[
benchmarks
](
https://github.com/tensorpack/benchmarks
)
.
Tensorpack is designed to be high-performance, as can be seen in the
[
benchmarks
](
https://github.com/tensorpack/benchmarks
)
.
But performance is different across machines and tasks,
But performance is different across machines and tasks,
and it requires knowledge of the entire stack to understand what might be wrong.
and it requires knowledge of the entire stack to understand what might be wrong.
Therefore when you have a performance issue,
If you need help from others to understand a performance issue you saw, you have to either
it's not easy to let others understand what goes wrong without doing some investigations by
your own.
allow others to reproduce your slowness, or do some investigations on
your own.
Tensorpack has some tools to make it easier to
understand
the performance.
Tensorpack has some tools to make it easier to
investigate
the performance.
Here we provide a list of things you can do to understand why your training is slow.
Here we provide a list of things you can do to understand why your training is slow.
If you ask for help to understand and improve the speed, PLEASE do the
If you ask for help to understand and improve the speed, PLEASE do the
...
...
tensorpack/libinfo.py
View file @
d7a13cb7
...
@@ -54,14 +54,32 @@ try:
...
@@ -54,14 +54,32 @@ try:
_version
=
tf
.
__version__
.
split
(
'.'
)
_version
=
tf
.
__version__
.
split
(
'.'
)
assert
(
int
(
_version
[
0
]),
int
(
_version
[
1
]))
>=
(
1
,
3
),
"TF>=1.3 is required!"
assert
(
int
(
_version
[
0
]),
int
(
_version
[
1
]))
>=
(
1
,
3
),
"TF>=1.3 is required!"
_HAS_TF
=
True
_HAS_TF
=
True
except
ImportError
:
print
(
"Failed to import tensorflow."
)
_HAS_TF
=
False
else
:
# Install stacktrace handler
try
:
try
:
from
tensorflow.python.framework
import
test_util
from
tensorflow.python.framework
import
test_util
test_util
.
InstallStackTraceHandler
()
test_util
.
InstallStackTraceHandler
()
except
Exception
:
except
Exception
:
pass
pass
except
ImportError
:
print
(
"Failed to import tensorflow."
)
# Monkey-patch tf.test.is_gpu_available to avoid side effects:
_HAS_TF
=
False
# https://github.com/tensorflow/tensorflow/issues/26460
try
:
list_dev
=
tf
.
config
.
experimental
.
list_physical_devices
except
AttributeError
:
pass
else
:
old_is_gpu_available
=
tf
.
test
.
is_gpu_available
def
is_gpu_available
(
*
args
,
**
kwargs
):
if
len
(
args
)
==
0
and
len
(
kwargs
)
==
0
:
return
len
(
list_dev
(
'GPU'
))
>
0
return
old_is_gpu_available
(
*
args
,
**
kwargs
)
tf
.
test
.
is_gpu_available
=
is_gpu_available
# These lines will be programatically read/write by setup.py
# These lines will be programatically read/write by setup.py
...
...
tensorpack/train/trainers.py
View file @
d7a13cb7
...
@@ -340,6 +340,7 @@ class HorovodTrainer(SingleCostTrainer):
...
@@ -340,6 +340,7 @@ class HorovodTrainer(SingleCostTrainer):
2. Due to a TF bug (#8136), you must not initialize CUDA context before the trainer starts training.
2. Due to a TF bug (#8136), you must not initialize CUDA context before the trainer starts training.
Therefore TF functions like `is_gpu_available()` or `list_local_devices()`
Therefore TF functions like `is_gpu_available()` or `list_local_devices()`
must be avoided.
must be avoided.
You can, however, use `tf.config.experimental.list_physical_devices('GPU')`, introduced in TF 1.14.
2. MPI does not like `fork()`. If your dataflow contains multiprocessing, it may cause problems.
2. MPI does not like `fork()`. If your dataflow contains multiprocessing, it may cause problems.
...
...
tensorpack/utils/gpu.py
View file @
d7a13cb7
...
@@ -56,12 +56,19 @@ def get_num_gpu():
...
@@ -56,12 +56,19 @@ def get_num_gpu():
return
warn_return
(
ctx
.
num_devices
(),
"NVML found nvidia devices. "
)
return
warn_return
(
ctx
.
num_devices
(),
"NVML found nvidia devices. "
)
except
Exception
:
except
Exception
:
# Fallback
# Fallback
# Note this will initialize all GPUs and therefore has side effect
# https://github.com/tensorflow/tensorflow/issues/8136
logger
.
info
(
"Loading local devices by TensorFlow ..."
)
logger
.
info
(
"Loading local devices by TensorFlow ..."
)
from
tensorflow.python.client
import
device_lib
local_device_protos
=
device_lib
.
list_local_devices
()
try
:
return
len
([
x
.
name
for
x
in
local_device_protos
if
x
.
device_type
==
'GPU'
])
import
tensorflow
as
tf
# available since TF 1.14
gpu_devices
=
tf
.
config
.
experimental
.
list_physical_devices
(
'GPU'
)
except
AttributeError
:
from
tensorflow.python.client
import
device_lib
local_device_protos
=
device_lib
.
list_local_devices
()
# Note this will initialize all GPUs and therefore has side effect
# https://github.com/tensorflow/tensorflow/issues/8136
gpu_devices
=
[
x
.
name
for
x
in
local_device_protos
if
x
.
device_type
==
'GPU'
]
return
len
(
gpu_devices
)
get_nr_gpu
=
get_num_gpu
get_nr_gpu
=
get_num_gpu
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment