add benchmark

9e8e9fb8 · Yuxin Wu · a7f4094d · 9e8e9fb8 · 9e8e9fb8
Commit 9e8e9fb8 authored Aug 30, 2019 by Yuxin Wu
Show whitespace changes
Inline Side-by-side

Showing with 25 additions and 13 deletions

examples/FasterRCNN/NOTES.md examples/FasterRCNN/NOTES.md +19 -9

examples/FasterRCNN/README.md examples/FasterRCNN/README.md +6 -4

No files found.
--- a/examples/FasterRCNN/NOTES.md
+++ b/examples/FasterRCNN/NOTES.md
@@ -16,7 +16,7 @@ This is a minimal implementation that simply contains these files:

 ### Implementation Notes

-Data:
+#### Data:

 1. It's easy to train on your own data, by calling `DatasetRegistry.register(name, lambda: YourDatasetSplit())`,
 	 and modify `cfg.DATA.*` accordingly. Afterwards, "name" can be used in `cfg.DATA.TRAIN`.
@@ -38,7 +38,7 @@ Data:
   which is probably not the optimal way.
   A TODO is to generate bounding box from segmentation, so more augmentations can be naturally supported.

-Model:
+#### Model:

 1. Floating-point boxes are defined like this:

@@ -54,15 +54,25 @@ Model:
   GPUs (the `BACKBONE.NORM=SyncBN` option).
   Another alternative to BatchNorm is GroupNorm (`BACKBONE.NORM=GN`) which has better performance.

-Efficiency:
+#### Efficiency:

-1. This implementation does not use specialized CUDA ops (e.g. NMS, ROIAlign).
+Training throughput (larger is better) of standard R50-FPN Mask R-CNN, on 8 V100s:
+
+| Implementation                                                                                                                                   | Throughput (img/s) |
+| -                                                                                                                                                | -                  |
+| [torchvision](https://pytorch.org/blog/torchvision03/#segmentation-models)                                                                       | 59                 |
+| tensorpack                                                                                                                                       | 50                 |
+| [maskrcnn-benchmark](https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/MODEL_ZOO.md#end-to-end-faster-and-mask-r-cnn-baselines) | 35                 |
+| [mmdetection](https://github.com/open-mmlab/mmdetection/blob/master/docs/MODEL_ZOO.md#mask-r-cnn)                                                | 35                 |
+| [Detectron](https://github.com/facebookresearch/Detectron)                                                                                       | 19                 |
+| [matterport/Mask_RCNN](https://github.com/matterport/Mask_RCNN/)                                                                                 | 11                 |
+
+1. This implementation does not use specialized CUDA ops (e.g. ROIAlign), 
+   and does not use batch of images.
   Therefore it might be slower than other highly-optimized implementations.
-   With CUDA kernel of NMS (available only in TF master) and `HorovodTrainer`,
-   this implementation can train a standard R50-FPN at 50 img/s on 8 V100s,
-   compared to 35 img/s in [maskrcnn-benchmark](https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/MODEL_ZOO.md#end-to-end-faster-and-mask-r-cnn-baselines)
-   and [mmdetection](https://github.com/open-mmlab/mmdetection/blob/master/docs/MODEL_ZOO.md#mask-r-cnn),
-   and 59 img/s in [torchvision](https://pytorch.org/blog/torchvision03/#detection-models).
+   Our number in the table above uses CUDA kernel of NMS (available only in TF
+   master with [PR30893](https://github.com/tensorflow/tensorflow/pull/30893)),
+   and `TRAINER=horovod`.

 1. If CuDNN warmup is on, the training will start very slowly, until about
   10k steps (or more if scale augmentation is used) to reach a maximum speed.

--- a/examples/FasterRCNN/README.md
+++ b/examples/FasterRCNN/README.md
@@ -99,8 +99,8 @@ Performance in [Detectron](https://github.com/facebookresearch/Detectron/) can b
 | R50-FPN-GN                     | 40.4;36.3 [:arrow_down:][R50FPN2xGN]                                    | 40.3;35.7                                          | 29h                    | <details><summary>2x+GN</summary>`FPN.NORM=GN BACKBONE.NORM=GN`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head`<br/>`FPN.MRCNN_HEAD_FUNC=maskrcnn_up4conv_gn_head` <br/>`TRAIN.LR_SCHEDULE=2x`                                                                                                                                                                        |
 | R50-FPN                        | 41.7;36.2 [:arrow_down:][R50FPN1xCas]                                   |                                                    | 16h                    | <details><summary>+Cascade</summary>`FPN.CASCADE=True` </details>                                                                                                                                                                                                                                                                                                         |
 | R101-C4                        | 40.1;34.6 [:arrow_down:][R101C41x]                                      |                                                    | 27h                    | <details><summary>standard</summary>`MODE_FPN=False`<br/>`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]` </details>                                                                                                                                                                                                                                                               |
- | R101-FPN                       | 40.7;36.8 [:arrow_down:][R101FPN1x]                                     | 40.0;35.9                                          | 17h                    | <details><summary>standard</summary>`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]` </details>                                                                                                                                                                                                                                                                                    |
- | R101-FPN                       | 46.6;40.3 [:arrow_down:][R101FPN3xCasAug] <sup>[2](#ft2)</sup>          |                                                    | 64h                    | <details><summary>3x+Cascade+TrainAug</summary>` FPN.CASCADE=True`<br/>`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`<br/>`TEST.RESULT_SCORE_THRESH=1e-4`<br/>`PREPROC.TRAIN_SHORT_EDGE_SIZE=[640,800]`<br/>`TRAIN.LR_SCHEDULE=3x` </details>                                                                                                                                    |
+ | R101-FPN                       | 40.7;36.8 [:arrow_down:][R101FPN1x] <sup>[2](#ft2)</sup>                | 40.0;35.9                                          | 17h                    | <details><summary>standard</summary>`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]` </details>                                                                                                                                                                                                                                                                                    |
+ | R101-FPN                       | 46.6;40.3 [:arrow_down:][R101FPN3xCasAug]                               |                                                    | 64h                    | <details><summary>3x+Cascade+TrainAug</summary>` FPN.CASCADE=True`<br/>`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`<br/>`TEST.RESULT_SCORE_THRESH=1e-4`<br/>`PREPROC.TRAIN_SHORT_EDGE_SIZE=[640,800]`<br/>`TRAIN.LR_SCHEDULE=3x` </details>                                                                                                                                    |
 | R101-FPN-GN<br/>(From Scratch) | 47.7;41.7 [:arrow_down:][R101FPN9xGNCasAugScratch] <sup>[3](#ft3)</sup> | 47.4;40.5                                          | 28h (on 64 V100s)      | <details><summary>9x+GN+Cascade+TrainAug</summary>` FPN.CASCADE=True`<br/>`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`<br/>`FPN.NORM=GN BACKBONE.NORM=GN`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head`<br/>`FPN.MRCNN_HEAD_FUNC=maskrcnn_up4conv_gn_head`<br/>`PREPROC.TRAIN_SHORT_EDGE_SIZE=[640,800]`<br/>`TRAIN.LR_SCHEDULE=9x`<br/>`BACKBONE.FREEZE_AT=0`</details> |

 [R50C41x]: http://models.tensorpack.com/FasterRCNN/COCO-MaskRCNN-R50C41x.npz
@@ -116,7 +116,9 @@ Performance in [Detectron](https://github.com/facebookresearch/Detectron/) can b
 We compare models that have identical training & inference cost between the two implementations.
 Their numbers can be different due to small implementation details.

- <a id="ft2">2</a>: Our mAP is __10+ point__ better than the official model in [matterport/Mask_RCNN](https://github.com/matterport/Mask_RCNN/releases/tag/v2.0) with the same R101-FPN backbone.
+ <a id="ft2">2</a>: Our mAP is __7 point__ better than the official model in
+ [matterport/Mask_RCNN](https://github.com/matterport/Mask_RCNN/releases/tag/v2.0) which has the same architecture.
+ Our implementation is also [5x faster](https://github.com/tensorpack/benchmarks/tree/master/MaskRCNN).

 <a id="ft3">3</a>: This entry does not use ImageNet pre-training. Detectron numbers are taken from Fig. 5 in [Rethinking ImageNet Pre-training](https://arxiv.org/abs/1811.08883).
 Note that our training strategy is slightly different: we enable cascade throughout the entire training.