Commit 843990f5 authored by Yuxin Wu's avatar Yuxin Wu

misc updates

parent 180a3461
...@@ -32,7 +32,7 @@ Failure to hide the data preparation latency is the major reason why people ...@@ -32,7 +32,7 @@ Failure to hide the data preparation latency is the major reason why people
cannot see good GPU utilization. You should __always choose a framework that enables latency hiding.__ cannot see good GPU utilization. You should __always choose a framework that enables latency hiding.__
However most other TensorFlow wrappers are designed to be `feed_dict` based. However most other TensorFlow wrappers are designed to be `feed_dict` based.
Tensorpack has built-in mechanisms to hide latency of the above stages. Tensorpack has built-in mechanisms to hide latency of the above stages.
This is the major reason why tensorpack is [faster](https://github.com/tensorpack/benchmarks). This is one of the reasons why tensorpack is [faster](https://github.com/tensorpack/benchmarks).
## Python Reader or TF Reader ? ## Python Reader or TF Reader ?
......
...@@ -15,26 +15,32 @@ If you ask for help to understand and improve the speed, PLEASE do them and incl ...@@ -15,26 +15,32 @@ If you ask for help to understand and improve the speed, PLEASE do them and incl
1. If you use feed-based input (unrecommended) and datapoints are large, data is likely to become the bottleneck. 1. If you use feed-based input (unrecommended) and datapoints are large, data is likely to become the bottleneck.
2. If you use queue-based input + DataFlow, always pay attention to the queue size statistics in 2. If you use queue-based input + DataFlow, always pay attention to the queue size statistics in
training log. Ideally the input queue should be nearly full (default size is 50). training log. Ideally the input queue should be nearly full (default size is 50).
__If the queue size is close to zero, data is the bottleneck. Otherwise, it's not.__ __If the queue size is close to zero, data is the bottleneck. Otherwise, it's not.__
3. If GPU utilization is low but queue is full. It's because of the graph.
Either there are some communication inefficiency or some ops you use are inefficient (e.g. CPU ops). Also make sure GPUs are not locked in P8 state. The size is by default printed after every epoch. Set `steps_per_epoch` to a
smaller number (e.g. 100) to see this number earlier.
3. If GPU utilization is low but queue is full, the graph is inefficient.
Either there are some communication inefficiency, or some ops in the graph are inefficient (e.g. CPU ops). Also make sure GPUs are not locked in P8 state.
## Benchmark the components ## Benchmark the components
Whatever benchmarks you're doing, never look at the speed of the first 50 iterations.
Everything is slow at the beginning.
1. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow. 1. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow.
This will benchmark the graph without the possible overhead of DataFlow. This will benchmark the graph, without the possible overhead of DataFlow.
2. (usually not needed) Use `data=DummyConstantInput(shapes)` for training, so that the iterations only take data from a constant tensor. 2. (usually not needed) Use `data=DummyConstantInput(shapes)` for training, so that the iterations only take data from a constant tensor.
No DataFlow is involved in this case. No DataFlow is involved in this case.
3. If you're using a TF-based input pipeline you wrote, you can simply run it in a loop and test its speed. 3. If you're using a TF-based input pipeline you wrote, you can simply run it in a loop and test its speed.
4. Use `TestDataSpeed(mydf).start()` to benchmark your DataFlow. 4. Use `TestDataSpeed(mydf).start()` to benchmark your DataFlow.
A benchmark will give you more precise information about which part you should improve. A benchmark will give you more precise information about which part you should improve.
Note that you should only look at iteration speed after about 50 iterations, since everything is slow at the beginning.
## Investigate DataFlow ## Investigate DataFlow
Understand the [Efficient DataFlow](efficient-dataflow.html) tutorial, so you know what your DataFlow is doing. Understand the [Efficient DataFlow](efficient-dataflow.html) tutorial, so you know what your DataFlow is doing.
Then, make modifications and benchmark to understand which part is the bottleneck. Then, make modifications and benchmark to understand which part of dataflow is the bottleneck.
Use [TestDataSpeed](../modules/dataflow.html#tensorpack.dataflow.TestDataSpeed). Use [TestDataSpeed](../modules/dataflow.html#tensorpack.dataflow.TestDataSpeed).
Do __NOT__ look at training speed when you benchmark a DataFlow. Do __NOT__ look at training speed when you benchmark a DataFlow.
...@@ -76,7 +82,7 @@ But there may be something cheap you can try: ...@@ -76,7 +82,7 @@ But there may be something cheap you can try:
### Cannot scale to multi-GPU ### Cannot scale to multi-GPU
If you're unable to scale to multiple GPUs almost linearly: If you're unable to scale to multiple GPUs almost linearly:
1. First make sure that the ResNet example can scale. Run it with `--fake` to use fake data. 1. First make sure that the ImageNet-ResNet example can scale. Run it with `--fake` to use fake data.
If not, it's a bug or an environment setup problem. If not, it's a bug or an environment setup problem.
2. Then note that your model may have a different communication-computation pattern that affects efficiency. 2. Then note that your model may have a different communication-computation pattern that affects efficiency.
There isn't a simple answer to this. There isn't a simple answer to this.
......
...@@ -8,6 +8,7 @@ This example provides a minimal (<2k lines) and faithful implementation of the f ...@@ -8,6 +8,7 @@ This example provides a minimal (<2k lines) and faithful implementation of the f
with the support of: with the support of:
+ Multi-GPU / distributed training + Multi-GPU / distributed training
+ [Cross-GPU BatchNorm](https://arxiv.org/abs/1711.07240) + [Cross-GPU BatchNorm](https://arxiv.org/abs/1711.07240)
+ [Group Normalization](https://arxiv.org/abs/1803.08494)
## Dependencies ## Dependencies
+ Python 3; TensorFlow >= 1.6 (1.4 or 1.5 can run but may crash due to a TF bug); + Python 3; TensorFlow >= 1.6 (1.4 or 1.5 can run but may crash due to a TF bug);
...@@ -65,12 +66,13 @@ MaskRCNN results contain both box and mask mAP. ...@@ -65,12 +66,13 @@ MaskRCNN results contain both box and mask mAP.
| Backbone | mAP<br/>(box/mask) | Detectron mAP <br/> (box/mask) | Time | Configurations <br/> (click to expand) | | Backbone | mAP<br/>(box/mask) | Detectron mAP <br/> (box/mask) | Time | Configurations <br/> (click to expand) |
| - | - | - | - | - | | - | - | - | - | - |
| R50-C4 | 33.1 | | 18h on 8 V100s | <details><summary>super quick</summary>`MODE_MASK=False FRCNN.BATCH_PER_IM=64`<br/>`PREPROC.SHORT_EDGE_SIZE=600 PREPROC.MAX_SIZE=1024`<br/>`TRAIN.LR_SCHEDULE=[150000,230000,280000]` </details> | | R50-C4 | 33.1 | | 18h on 8 V100s | <details><summary>super quick</summary>`MODE_MASK=False FRCNN.BATCH_PER_IM=64`<br/>`PREPROC.SHORT_EDGE_SIZE=600 PREPROC.MAX_SIZE=1024`<br/>`TRAIN.LR_SCHEDULE=[150000,230000,280000]` </details> |
| R50-C4 | 36.6 | 36.5 | 49h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False` </details> | | R50-C4 | 36.6 | 36.5 | 44h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False` </details> |
| R50-FPN | 37.5 | 37.9<sup>[1](#ft1)</sup> | 28h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False MODE_FPN=True` </details> | | R50-FPN | 37.5 | 37.9<sup>[1](#ft1)</sup> | 28h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False MODE_FPN=True` </details> |
| R50-C4 | 36.8/32.1 | | 39h on 8 P100s | <details><summary>quick</summary>`MODE_MASK=True FRCNN.BATCH_PER_IM=256`<br/>`TRAIN.LR_SCHEDULE=[150000,230000,280000]` </details> | | R50-C4 | 36.8/32.1 | | 39h on 8 P100s | <details><summary>quick</summary>`MODE_MASK=True FRCNN.BATCH_PER_IM=256`<br/>`TRAIN.LR_SCHEDULE=[150000,230000,280000]` </details> |
| R50-C4 | 37.8/33.1 | 37.8/32.8 | 51h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True` </details> | | R50-C4 | 37.8/33.1 | 37.8/32.8 | 45h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True` </details> |
| R50-FPN | 38.1/34.9 | 38.6/34.5<sup>[1](#ft1)</sup> | 32h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True MODE_FPN=True` </details> | | R50-FPN | 38.1/34.9 | 38.6/34.5<sup>[1](#ft1)</sup> | 32h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True MODE_FPN=True` </details> |
| R50-FPN | 38.5/34.8 | 38.6/34.2<sup>[2](#ft2)</sup> | 34h on 8 V100s | <details><summary>standard+convhead</summary>`MODE_MASK=True MODE_FPN=True`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_head` </details> | | R50-FPN | 38.5/34.8 | 38.6/34.2<sup>[2](#ft2)</sup> | 34h on 8 V100s | <details><summary>standard+ConvHead</summary>`MODE_MASK=True MODE_FPN=True`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_head` </details> |
| R50-FPN | 39.5/35.2 | 39.5/34.4<sup>[2](#ft2)</sup> | 34h on 8 V100s | <details><summary>standard+ConvGNHead</summary>`MODE_MASK=True MODE_FPN=True`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head` </details> |
| R101-C4 | 40.8/35.1 | | 63h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True `<br/>`BACKBONE.RESNET_NUM_BLOCK=[3,4,23,3]` </details> | | R101-C4 | 40.8/35.1 | | 63h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True `<br/>`BACKBONE.RESNET_NUM_BLOCK=[3,4,23,3]` </details> |
<a id="ft1">1</a>: Slightly different configurations. <a id="ft1">1</a>: Slightly different configurations.
......
...@@ -281,8 +281,8 @@ def get_train_dataflow(): ...@@ -281,8 +281,8 @@ def get_train_dataflow():
# Valid training images should have at least one fg box. # Valid training images should have at least one fg box.
# But this filter shall not be applied for testing. # But this filter shall not be applied for testing.
num = len(imgs) num = len(imgs)
imgs = list(filter(lambda img: len(img['boxes']) > 0, imgs)) # log invalid training imgs = list(filter(lambda img: len(img['boxes'][img['is_crowd'] == 0]) > 0, imgs))
logger.info("Filtered {} images which contain no groudtruth boxes. Total #images for training: {}".format( logger.info("Filtered {} images which contain no non-crowd groudtruth boxes. Total #images for training: {}".format(
num - len(imgs), len(imgs))) num - len(imgs), len(imgs)))
ds = DataFromList(imgs, shuffle=True) ds = DataFromList(imgs, shuffle=True)
......
...@@ -111,7 +111,8 @@ def crop_and_resize(image, boxes, box_ind, crop_size, pad_border=True): ...@@ -111,7 +111,8 @@ def crop_and_resize(image, boxes, box_ind, crop_size, pad_border=True):
However, what we want is (with fpcoor box): However, what we want is (with fpcoor box):
Spacing: w_box / W_crop Spacing: w_box / W_crop
Initial point: x0_box + spacing/2 - 0.5 Initial point: x0_box + spacing/2 - 0.5
(-0.5 because bilinear sample assumes floating point coordinate (0.0, 0.0) is the same as pixel value (0, 0)) (-0.5 because bilinear sample (in my definition) assumes floating point coordinate
(0.0, 0.0) is the same as pixel value (0, 0))
This function transform fpcoor boxes to a format to be used by tf.image.crop_and_resize This function transform fpcoor boxes to a format to be used by tf.image.crop_and_resize
......
...@@ -214,7 +214,7 @@ def fastrcnn_predictions(boxes, probs): ...@@ -214,7 +214,7 @@ def fastrcnn_predictions(boxes, probs):
""" """
FC Heads: FastRCNN heads for FPN:
""" """
......
...@@ -574,7 +574,7 @@ if __name__ == '__main__': ...@@ -574,7 +574,7 @@ if __name__ == '__main__':
ScheduledHyperParamSetter('learning_rate', lr_schedule), ScheduledHyperParamSetter('learning_rate', lr_schedule),
EvalCallback(*MODEL.get_inference_tensor_names()), EvalCallback(*MODEL.get_inference_tensor_names()),
PeakMemoryTracker(), PeakMemoryTracker(),
EstimatedTimeLeft(), EstimatedTimeLeft(median=True),
SessionRunTimeout(60000).set_chief_only(True), # 1 minute timeout SessionRunTimeout(60000).set_chief_only(True), # 1 minute timeout
] ]
if not is_horovod: if not is_horovod:
......
...@@ -75,13 +75,14 @@ class EstimatedTimeLeft(Callback): ...@@ -75,13 +75,14 @@ class EstimatedTimeLeft(Callback):
""" """
Estimate the time left until completion of training. Estimate the time left until completion of training.
""" """
def __init__(self, last_k_epochs=5): def __init__(self, last_k_epochs=5, median=False):
""" """
Args: Args:
last_k_epochs (int): Use the time spent on last k epochs to last_k_epochs (int): Use the time spent on last k epochs to estimate total time left.
estimate total time left. median (bool): Use mean by default. If True, use the median time spent on last k epochs.
""" """
self._times = deque(maxlen=last_k_epochs) self._times = deque(maxlen=last_k_epochs)
self._median = median
def _before_train(self): def _before_train(self):
self._max_epoch = self.trainer.max_epoch self._max_epoch = self.trainer.max_epoch
...@@ -92,7 +93,7 @@ class EstimatedTimeLeft(Callback): ...@@ -92,7 +93,7 @@ class EstimatedTimeLeft(Callback):
self._last_time = time.time() self._last_time = time.time()
self._times.append(duration) self._times.append(duration)
average_epoch_time = np.mean(self._times) epoch_time = np.median(self._times) if self._median else np.mean(self._times)
time_left = (self._max_epoch - self.epoch_num) * average_epoch_time time_left = (self._max_epoch - self.epoch_num) * epoch_time
if time_left > 0: if time_left > 0:
logger.info("Estimated Time Left: " + humanize_time_delta(time_left)) logger.info("Estimated Time Left: " + humanize_time_delta(time_left))
...@@ -121,7 +121,7 @@ def dump_dataflow_to_tfrecord(df, path): ...@@ -121,7 +121,7 @@ def dump_dataflow_to_tfrecord(df, path):
sz = 0 sz = 0
with get_tqdm(total=sz) as pbar: with get_tqdm(total=sz) as pbar:
for dp in df.get_data(): for dp in df.get_data():
writer.write(dumps(dp)) writer.write(dumps(dp).to_pybytes())
pbar.update() pbar.update()
......
...@@ -230,4 +230,6 @@ def is_training_name(name): ...@@ -230,4 +230,6 @@ def is_training_name(name):
return True return True
if name.startswith('AccumGrad') or name.endswith('/AccumGrad'): if name.startswith('AccumGrad') or name.endswith('/AccumGrad'):
return True return True
if name.startswith('apply_gradients'):
return True
return False return False
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment