Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
f63e0ee4
Commit
f63e0ee4
authored
Sep 01, 2018
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update docs; Mask R-CNN horovod mode eval only on master machine
parent
7b8728f9
Changes
10
Show whitespace changes
Inline
Side-by-side
Showing
10 changed files
with
61 additions
and
45 deletions
+61
-45
.github/ISSUE_TEMPLATE.md
.github/ISSUE_TEMPLATE.md
+9
-3
.github/ISSUE_TEMPLATE/feature-requests.md
.github/ISSUE_TEMPLATE/feature-requests.md
+2
-1
CHANGES.md
CHANGES.md
+1
-1
examples/FasterRCNN/NOTES.md
examples/FasterRCNN/NOTES.md
+2
-3
examples/FasterRCNN/README.md
examples/FasterRCNN/README.md
+2
-2
examples/FasterRCNN/config.py
examples/FasterRCNN/config.py
+4
-0
examples/FasterRCNN/train.py
examples/FasterRCNN/train.py
+29
-25
tensorpack/dataflow/parallel.py
tensorpack/dataflow/parallel.py
+8
-6
tensorpack/libinfo.py
tensorpack/libinfo.py
+3
-3
tensorpack/utils/serialize.py
tensorpack/utils/serialize.py
+1
-1
No files found.
.github/ISSUE_TEMPLATE.md
View file @
f63e0ee4
## DO NOT post an issue if you're seeing this. You're at the wrong place.
To post an issue, please:
1.
Click the "New Issue" button
2.
__Choose your category__!
3.
__Read instructions there__!
An issue has to be one of the following:
An issue has to be one of the following:
-
Unexpected Problems / Potential Bugs
-
Unexpected Problems / Potential Bugs
-
Feature Requests
-
Feature Requests
-
Questions on Using/Understanding Tensorpack
-
Questions on Using/Understanding Tensorpack
To post an issue, please click "New Issue", choose your category, and read
instructions there.
.github/ISSUE_TEMPLATE/feature-requests.md
View file @
f63e0ee4
...
@@ -7,8 +7,9 @@ about: Suggest an idea for Tensorpack
...
@@ -7,8 +7,9 @@ about: Suggest an idea for Tensorpack
+
Note that you can implement a lot of features by extending Tensorpack
+
Note that you can implement a lot of features by extending Tensorpack
(See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack).
(See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack).
It does not have to be added to Tensorpack unless you have a good reason.
It does not have to be added to Tensorpack unless you have a good reason.
+
"Could you improve/implement an example/paper ?"
+
"Could you improve/implement an example/paper ?"
-- The answer is: we have no plans to do so. We don't consider feature
-- The answer is: we have no plans to do so. We don't consider feature
requests for examples or implement a paper for you, unless it demonstrates
requests for examples or implement a paper for you, unless it demonstrates
some Tensorpack features not yet demonstrated in the existing examples.
some Tensorpack features not yet demonstrated in the existing examples.
If you don't know how to do
it
, you may ask a usage question.
If you don't know how to do
something yourself
, you may ask a usage question.
CHANGES.md
View file @
f63e0ee4
...
@@ -11,7 +11,7 @@ TensorFlow itself also changes API and those are not listed here.
...
@@ -11,7 +11,7 @@ TensorFlow itself also changes API and those are not listed here.
+
[2018/08/27] msgpack is used again for "serialization to disk", because pyarrow
+
[2018/08/27] msgpack is used again for "serialization to disk", because pyarrow
has no compatibility between versions. To use pyarrow instead,
`export TENSORPACK_COMPATIBLE_SERIALIZE=pyarrow`
.
has no compatibility between versions. To use pyarrow instead,
`export TENSORPACK_COMPATIBLE_SERIALIZE=pyarrow`
.
+
[2018/04/05] msgpack is replaced by pyarrow in favor of its speed. If you want old behavior,
+
[2018/04/05] msgpack is replaced by pyarrow in favor of its speed. If you want old behavior,
`export TENSORPACK_SERIALIZE=msgpack`
.
`export TENSORPACK_SERIALIZE=msgpack`
.
It's later found that pyarrow is unstable and may lead to crash.
+
[2018/03/20]
`ModelDesc`
starts to use simplified interfaces:
+
[2018/03/20]
`ModelDesc`
starts to use simplified interfaces:
+
`_get_inputs()`
renamed to
`inputs()`
and returns
`tf.placeholder`
s.
+
`_get_inputs()`
renamed to
`inputs()`
and returns
`tf.placeholder`
s.
+
`build_graph(self, tensor1, tensor2)`
returns the cost tensor directly.
+
`build_graph(self, tensor1, tensor2)`
returns the cost tensor directly.
...
...
examples/FasterRCNN/NOTES.md
View file @
f63e0ee4
...
@@ -46,11 +46,10 @@ Model:
...
@@ -46,11 +46,10 @@ Model:
Speed:
Speed:
1.
The training will start very slowly due to convolution warmup
, until about
1.
If cudnn warmup is on, the training will start very slowly
, until about
10k steps (or more if scale augmentation is used) to reach a maximum speed.
10k steps (or more if scale augmentation is used) to reach a maximum speed.
As a result, the ETA is also inaccurate at the beginning.
As a result, the ETA is also inaccurate at the beginning.
You can disable warmup by
`export TF_CUDNN_USE_AUTOTUNE=0`
, which makes the
Warmup is by default on when no scale augmentation is used.
training faster at the beginning, but perhaps not in the end.
1.
After warmup, the training speed will slowly decrease due to more accurate proposals.
1.
After warmup, the training speed will slowly decrease due to more accurate proposals.
...
...
examples/FasterRCNN/README.md
View file @
f63e0ee4
# Faster
-RCNN / Mask-R
CNN on COCO
# Faster
R-CNN / Mask R-
CNN on COCO
This example provides a minimal (2k lines) and faithful implementation of the following papers:
This example provides a minimal (2k lines) and faithful implementation of the following papers:
+
[
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
](
https://arxiv.org/abs/1506.01497
)
+
[
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
](
https://arxiv.org/abs/1506.01497
)
...
@@ -73,7 +73,7 @@ prediction will need to be run with the corresponding training configs.
...
@@ -73,7 +73,7 @@ prediction will need to be run with the corresponding training configs.
These models are trained with different configurations on trainval35k and evaluated on minival using mAP@IoU=0.50:0.95.
These models are trained with different configurations on trainval35k and evaluated on minival using mAP@IoU=0.50:0.95.
Performance in
[
Detectron
](
https://github.com/facebookresearch/Detectron/
)
can be roughly reproduced.
Performance in
[
Detectron
](
https://github.com/facebookresearch/Detectron/
)
can be roughly reproduced.
Mask
R
CNN results contain both box and mask mAP.
Mask
R-
CNN results contain both box and mask mAP.
| Backbone | mAP
<br/>
(box;mask) | Detectron mAP
<sup>
[
1
](
#ft1
)
</sup><br/>
(box;mask) | Time on 8 V100s | Configurations
<br/>
(click to expand) |
| Backbone | mAP
<br/>
(box;mask) | Detectron mAP
<sup>
[
1
](
#ft1
)
</sup><br/>
(box;mask) | Time on 8 V100s | Configurations
<br/>
(click to expand) |
| - | - | - | - | - |
| - | - | - | - | - |
...
...
examples/FasterRCNN/config.py
View file @
f63e0ee4
...
@@ -215,6 +215,10 @@ def finalize_configs(is_training):
...
@@ -215,6 +215,10 @@ def finalize_configs(is_training):
assert
len
(
_C
.
CASCADE
.
BBOX_REG_WEIGHTS
)
==
num_cascade
assert
len
(
_C
.
CASCADE
.
BBOX_REG_WEIGHTS
)
==
num_cascade
if
is_training
:
if
is_training
:
train_scales
=
_C
.
PREPROC
.
TRAIN_SHORT_EDGE_SIZE
if
train_scales
[
1
]
-
train_scales
[
0
]
>
100
:
# don't warmup if augmentation is on
os
.
environ
[
'TF_CUDNN_USE_AUTOTUNE'
]
=
'0'
os
.
environ
[
'TF_AUTOTUNE_THRESHOLD'
]
=
'1'
os
.
environ
[
'TF_AUTOTUNE_THRESHOLD'
]
=
'1'
assert
_C
.
TRAINER
in
[
'horovod'
,
'replicated'
],
_C
.
TRAINER
assert
_C
.
TRAINER
in
[
'horovod'
,
'replicated'
],
_C
.
TRAINER
...
...
examples/FasterRCNN/train.py
View file @
f63e0ee4
...
@@ -24,7 +24,6 @@ from tensorpack import *
...
@@ -24,7 +24,6 @@ from tensorpack import *
from
tensorpack.tfutils.summary
import
add_moving_summary
from
tensorpack.tfutils.summary
import
add_moving_summary
from
tensorpack.tfutils
import
optimizer
from
tensorpack.tfutils
import
optimizer
from
tensorpack.tfutils.common
import
get_tf_version_tuple
from
tensorpack.tfutils.common
import
get_tf_version_tuple
from
tensorpack.utils.serialize
import
loads
,
dumps
import
tensorpack.utils.viz
as
tpviz
import
tensorpack.utils.viz
as
tpviz
from
coco
import
COCODetection
from
coco
import
COCODetection
...
@@ -417,16 +416,14 @@ class EvalCallback(Callback):
...
@@ -417,16 +416,14 @@ class EvalCallback(Callback):
self
.
dataflows
=
[
get_eval_dataflow
(
shard
=
k
,
num_shards
=
self
.
num_predictor
)
self
.
dataflows
=
[
get_eval_dataflow
(
shard
=
k
,
num_shards
=
self
.
num_predictor
)
for
k
in
range
(
self
.
num_predictor
)]
for
k
in
range
(
self
.
num_predictor
)]
else
:
else
:
if
hvd
.
size
()
>
hvd
.
local_size
():
# Only eval on the first machine.
logger
.
warn
(
"Distributed evaluation with horovod is unstable. Sometimes MPI hangs for unknown reasons."
)
# Alternatively, can eval on all ranks and use allgather, but allgather sometimes hangs
self
.
_horovod_run_eval
=
hvd
.
rank
()
==
hvd
.
local_rank
()
if
self
.
_horovod_run_eval
:
self
.
predictor
=
self
.
_build_coco_predictor
(
0
)
self
.
predictor
=
self
.
_build_coco_predictor
(
0
)
self
.
dataflow
=
get_eval_dataflow
(
shard
=
hvd
.
rank
(),
num_shards
=
hvd
.
size
())
self
.
dataflow
=
get_eval_dataflow
(
shard
=
hvd
.
local_rank
(),
num_shards
=
hvd
.
local_
size
())
# use uint8 to aggregate strings
self
.
barrier
=
hvd
.
allreduce
(
tf
.
random_normal
(
shape
=
[
1
]))
self
.
local_result_tensor
=
tf
.
placeholder
(
tf
.
uint8
,
shape
=
[
None
],
name
=
'local_result_string'
)
self
.
concat_results
=
hvd
.
allgather
(
self
.
local_result_tensor
,
name
=
'concat_results'
)
local_size
=
tf
.
expand_dims
(
tf
.
size
(
self
.
local_result_tensor
),
0
)
self
.
string_lens
=
hvd
.
allgather
(
local_size
,
name
=
'concat_sizes'
)
def
_build_coco_predictor
(
self
,
idx
):
def
_build_coco_predictor
(
self
,
idx
):
graph_func
=
self
.
trainer
.
get_predictor
(
self
.
_in_names
,
self
.
_out_names
,
device
=
idx
)
graph_func
=
self
.
trainer
.
get_predictor
(
self
.
_in_names
,
self
.
_out_names
,
device
=
idx
)
...
@@ -443,6 +440,7 @@ class EvalCallback(Callback):
...
@@ -443,6 +440,7 @@ class EvalCallback(Callback):
logger
.
info
(
"[EvalCallback] Will evaluate every {} epochs"
.
format
(
interval
))
logger
.
info
(
"[EvalCallback] Will evaluate every {} epochs"
.
format
(
interval
))
def
_eval
(
self
):
def
_eval
(
self
):
logdir
=
args
.
logdir
if
cfg
.
TRAINER
==
'replicated'
:
if
cfg
.
TRAINER
==
'replicated'
:
with
ThreadPoolExecutor
(
max_workers
=
self
.
num_predictor
,
thread_name_prefix
=
'EvalWorker'
)
as
executor
,
\
with
ThreadPoolExecutor
(
max_workers
=
self
.
num_predictor
,
thread_name_prefix
=
'EvalWorker'
)
as
executor
,
\
tqdm
.
tqdm
(
total
=
sum
([
df
.
size
()
for
df
in
self
.
dataflows
]))
as
pbar
:
tqdm
.
tqdm
(
total
=
sum
([
df
.
size
()
for
df
in
self
.
dataflows
]))
as
pbar
:
...
@@ -451,23 +449,26 @@ class EvalCallback(Callback):
...
@@ -451,23 +449,26 @@ class EvalCallback(Callback):
futures
.
append
(
executor
.
submit
(
eval_coco
,
dataflow
,
pred
,
pbar
))
futures
.
append
(
executor
.
submit
(
eval_coco
,
dataflow
,
pred
,
pbar
))
all_results
=
list
(
itertools
.
chain
(
*
[
fut
.
result
()
for
fut
in
futures
]))
all_results
=
list
(
itertools
.
chain
(
*
[
fut
.
result
()
for
fut
in
futures
]))
else
:
else
:
if
self
.
_horovod_run_eval
:
local_results
=
eval_coco
(
self
.
dataflow
,
self
.
predictor
)
local_results
=
eval_coco
(
self
.
dataflow
,
self
.
predictor
)
results_as_arr
=
np
.
frombuffer
(
dumps
(
local_results
),
dtype
=
np
.
uint8
)
output_partial
=
os
.
path
.
join
(
sizes
,
concat_arrs
=
tf
.
get_default_session
()
.
run
(
logdir
,
'outputs{}-part{}.json'
.
format
(
self
.
global_step
,
hvd
.
local_rank
()))
[
self
.
string_lens
,
self
.
concat_results
],
with
open
(
output_partial
,
'w'
)
as
f
:
feed_dict
=
{
self
.
local_result_tensor
:
results_as_arr
})
json
.
dump
(
local_results
,
f
)
self
.
barrier
.
eval
()
if
hvd
.
rank
()
>
0
:
if
hvd
.
rank
()
>
0
:
return
return
all_results
=
[]
all_results
=
[]
start
=
0
for
k
in
range
(
hvd
.
local_size
()):
for
size
in
sizes
:
output_partial
=
os
.
path
.
join
(
substr
=
concat_arrs
[
start
:
start
+
size
]
logdir
,
'outputs{}-part{}.json'
.
format
(
self
.
global_step
,
k
))
results
=
loads
(
substr
.
tobytes
())
with
open
(
output_partial
,
'r'
)
as
f
:
all_results
.
extend
(
results
)
obj
=
json
.
load
(
f
)
start
=
start
+
size
all_results
.
extend
(
obj
)
os
.
unlink
(
output_partial
)
output_file
=
os
.
path
.
join
(
output_file
=
os
.
path
.
join
(
log
ger
.
get_logger_dir
()
,
'outputs{}.json'
.
format
(
self
.
global_step
))
log
dir
,
'outputs{}.json'
.
format
(
self
.
global_step
))
with
open
(
output_file
,
'w'
)
as
f
:
with
open
(
output_file
,
'w'
)
as
f
:
json
.
dump
(
all_results
,
f
)
json
.
dump
(
all_results
,
f
)
try
:
try
:
...
@@ -572,6 +573,9 @@ if __name__ == '__main__':
...
@@ -572,6 +573,9 @@ if __name__ == '__main__':
if
not
is_horovod
:
if
not
is_horovod
:
callbacks
.
append
(
GPUUtilizationTracker
())
callbacks
.
append
(
GPUUtilizationTracker
())
if
is_horovod
and
hvd
.
rank
()
>
0
:
session_init
=
None
else
:
if
args
.
load
:
if
args
.
load
:
session_init
=
get_model_loader
(
args
.
load
)
session_init
=
get_model_loader
(
args
.
load
)
else
:
else
:
...
...
tensorpack/dataflow/parallel.py
View file @
f63e0ee4
...
@@ -447,9 +447,11 @@ class PlasmaGetData(ProxyDataFlow):
...
@@ -447,9 +447,11 @@ class PlasmaGetData(ProxyDataFlow):
yield
dp
yield
dp
try
:
plasma
=
None
import
pyarrow.plasma
as
plasma
# These plasma code is only experimental
except
ImportError
:
# try:
from
..utils.develop
import
create_dummy_class
# import pyarrow.plasma as plasma
PlasmaPutData
=
create_dummy_class
(
'PlasmaPutData'
,
'pyarrow'
)
# noqa
# except ImportError:
PlasmaGetData
=
create_dummy_class
(
'PlasmaGetData'
,
'pyarrow'
)
# noqa
# from ..utils.develop import create_dummy_class
# PlasmaPutData = create_dummy_class('PlasmaPutData', 'pyarrow') # noqa
# PlasmaGetData = create_dummy_class('PlasmaGetData', 'pyarrow') # noqa
tensorpack/libinfo.py
View file @
f63e0ee4
...
@@ -37,11 +37,11 @@ os.environ['TF_SYNC_ON_FINISH'] = '0' # will become default
...
@@ -37,11 +37,11 @@ os.environ['TF_SYNC_ON_FINISH'] = '0' # will become default
os
.
environ
[
'TF_GPU_THREAD_MODE'
]
=
'gpu_private'
os
.
environ
[
'TF_GPU_THREAD_MODE'
]
=
'gpu_private'
os
.
environ
[
'TF_GPU_THREAD_COUNT'
]
=
'2'
os
.
environ
[
'TF_GPU_THREAD_COUNT'
]
=
'2'
# Available in TF1.6+. Haven't seen different performance on R50.
# Available in TF1.6+
& cudnn7
. Haven't seen different performance on R50.
# NOTE
TF set it to 0 by default,
because:
# NOTE
we disable it
because:
# this mode may use scaled atomic integer reduction that may cause a numerical
# this mode may use scaled atomic integer reduction that may cause a numerical
# overflow for certain input data range.
# overflow for certain input data range.
# os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '1
'
os
.
environ
[
'TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'
]
=
'0
'
try
:
try
:
import
tensorflow
as
tf
# noqa
import
tensorflow
as
tf
# noqa
...
...
tensorpack/utils/serialize.py
View file @
f63e0ee4
...
@@ -64,7 +64,7 @@ except ImportError:
...
@@ -64,7 +64,7 @@ except ImportError:
dumps_msgpack
=
create_dummy_func
(
# noqa
dumps_msgpack
=
create_dummy_func
(
# noqa
'dumps_msgpack'
,
[
'msgpack'
,
'msgpack_numpy'
])
'dumps_msgpack'
,
[
'msgpack'
,
'msgpack_numpy'
])
if
os
.
environ
.
get
(
'TENSORPACK_SERIALIZE'
,
None
)
==
'msgpack'
:
if
pa
is
None
or
os
.
environ
.
get
(
'TENSORPACK_SERIALIZE'
,
None
)
==
'msgpack'
:
loads
=
loads_msgpack
loads
=
loads_msgpack
dumps
=
dumps_msgpack
dumps
=
dumps_msgpack
else
:
else
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment