Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
c667b1de
Commit
c667b1de
authored
Dec 22, 2018
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
CheckNumerics Callback
parent
ac9ac2a4
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
24 additions
and
7 deletions
+24
-7
examples/ImageNetModels/README.md
examples/ImageNetModels/README.md
+3
-3
tensorpack/callbacks/graph.py
tensorpack/callbacks/graph.py
+15
-1
tensorpack/models/regularize.py
tensorpack/models/regularize.py
+4
-1
tensorpack/train/trainers.py
tensorpack/train/trainers.py
+2
-2
No files found.
examples/ImageNetModels/README.md
View file @
c667b1de
...
@@ -42,13 +42,13 @@ See `./alexnet.py --help` for usage.
...
@@ -42,13 +42,13 @@ See `./alexnet.py --help` for usage.
### VGG16
### VGG16
This VGG16 script, when trained with 8 GPUs and 32 batch size per GPU, reaches the following
This VGG16 script, when trained with 8 GPUs and 32 batch size per GPU, reaches the following
validation error after 100 epochs (30h with 8 P100s). This
is the code for
the VGG
validation error after 100 epochs (30h with 8 P100s). This
reproduces
the VGG
experiments in the paper
[
Group Normalization
](
https://arxiv.org/abs/1803.08494
)
.
experiments in the paper
[
Group Normalization
](
https://arxiv.org/abs/1803.08494
)
.
See
`./vgg16.py --help`
for usage.
See
`./vgg16.py --help`
for usage.
| No Normalization | Batch Normalization | Group Normalization |
| No Normalization | Batch Normalization | Group Normalization |
|:------------------------------------------|
---------------------|-
-------------------:|
|:------------------------------------------|
:-------------------:|:
-------------------:|
| 29~30% (large variation with random seed) | 28% |
27.6%
|
| 29~30% (large variation with random seed) | 28% |
27.6%
|
Note that the purpose of this experiment in the paper is not to claim GroupNorm
Note that the purpose of this experiment in the paper is not to claim GroupNorm
has better performance than BatchNorm.
has better performance than BatchNorm.
...
...
tensorpack/callbacks/graph.py
View file @
c667b1de
...
@@ -14,7 +14,7 @@ from ..utils import logger
...
@@ -14,7 +14,7 @@ from ..utils import logger
from
.base
import
Callback
from
.base
import
Callback
__all__
=
[
'RunOp'
,
'RunUpdateOps'
,
'ProcessTensors'
,
'DumpTensors'
,
__all__
=
[
'RunOp'
,
'RunUpdateOps'
,
'ProcessTensors'
,
'DumpTensors'
,
'DumpTensor'
,
'DumpTensorAsImage'
,
'DumpParamAsImage'
]
'DumpTensor'
,
'DumpTensorAsImage'
,
'DumpParamAsImage'
,
'CheckNumerics'
]
class
RunOp
(
Callback
):
class
RunOp
(
Callback
):
...
@@ -213,6 +213,20 @@ class DumpTensorAsImage(Callback):
...
@@ -213,6 +213,20 @@ class DumpTensorAsImage(Callback):
cv2
.
imwrite
(
fname
,
res
.
astype
(
'uint8'
))
cv2
.
imwrite
(
fname
,
res
.
astype
(
'uint8'
))
class
CheckNumerics
(
Callback
):
"""
When triggered, check variables in the graph for NaN and Inf.
Raise exceptions if such an error is found.
"""
def
_setup_graph
(
self
):
vars
=
tf
.
trainable_variables
()
ops
=
[
tf
.
check_numerics
(
v
,
"CheckNumerics['{}']"
.
format
(
v
.
op
.
name
))
.
op
for
v
in
vars
]
self
.
_check_op
=
tf
.
group
(
*
ops
)
def
_trigger
(
self
):
self
.
_check_op
.
run
()
try
:
try
:
import
cv2
import
cv2
except
ImportError
:
except
ImportError
:
...
...
tensorpack/models/regularize.py
View file @
c667b1de
...
@@ -167,4 +167,7 @@ def Dropout(x, *args, **kwargs):
...
@@ -167,4 +167,7 @@ def Dropout(x, *args, **kwargs):
if
kwargs
.
get
(
'training'
,
None
)
is
None
:
if
kwargs
.
get
(
'training'
,
None
)
is
None
:
kwargs
[
'training'
]
=
get_current_tower_context
()
.
is_training
kwargs
[
'training'
]
=
get_current_tower_context
()
.
is_training
return
tf
.
layers
.
dropout
(
x
,
rate
=
rate
,
**
kwargs
)
if
get_tf_version_tuple
()
<=
(
1
,
12
):
return
tf
.
layers
.
dropout
(
x
,
rate
=
rate
,
**
kwargs
)
else
:
return
tf
.
nn
.
dropout
(
x
,
rate
=
rate
if
kwargs
[
'training'
]
else
0.
)
tensorpack/train/trainers.py
View file @
c667b1de
...
@@ -341,14 +341,14 @@ class HorovodTrainer(SingleCostTrainer):
...
@@ -341,14 +341,14 @@ class HorovodTrainer(SingleCostTrainer):
+ Make sure your InputSource has reasonable randomness.
+ Make sure your InputSource has reasonable randomness.
+ If your data processing is heavy, doing it in a s
eparat
e dedicated process might be
+ If your data processing is heavy, doing it in a s
ingl
e dedicated process might be
a better choice than doing them repeatedly in each process.
a better choice than doing them repeatedly in each process.
+ You need to make sure log directories in each process won't conflict.
+ You need to make sure log directories in each process won't conflict.
You can set it only for the chief process, or set a different one for each process.
You can set it only for the chief process, or set a different one for each process.
+ Callbacks have an option to be run only in the chief process, or in all processes.
+ Callbacks have an option to be run only in the chief process, or in all processes.
See :meth:`
c
allback.set_chief_only()`. Most callbacks have a reasonable
See :meth:`
C
allback.set_chief_only()`. Most callbacks have a reasonable
default already, but certain callbacks may not behave properly by default. Report an issue if you find any.
default already, but certain callbacks may not behave properly by default. Report an issue if you find any.
+ You can use Horovod API such as `hvd.rank()` to know which process you are and choose
+ You can use Horovod API such as `hvd.rank()` to know which process you are and choose
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment