misc updates

843990f5 · Yuxin Wu · 180a3461 · 843990f5 · 843990f5 · 843990f5
Commit 843990f5 authored Jul 09, 2018 by Yuxin Wu
10 changed files
--- a/docs/tutorial/input-source.md
+++ b/docs/tutorial/input-source.md
@@ -32,7 +32,7 @@ Failure to hide the data preparation latency is the major reason why people
 cannot see good GPU utilization. You should __always choose a framework that enables latency hiding.__
 However most other TensorFlow wrappers are designed to be `feed_dict` based.
 Tensorpack has built-in mechanisms to hide latency of the above stages.
-This is the major reason why tensorpack is [faster](https://github.com/tensorpack/benchmarks).
+This is one of the reasons why tensorpack is [faster](https://github.com/tensorpack/benchmarks).

 ## Python Reader or TF Reader ?


--- a/docs/tutorial/performance-tuning.md
+++ b/docs/tutorial/performance-tuning.md
@@ -15,26 +15,32 @@ If you ask for help to understand and improve the speed, PLEASE do them and incl

 1. If you use feed-based input (unrecommended) and datapoints are large, data is likely to become the bottleneck.
 2. If you use queue-based input + DataFlow, always pay attention to the queue size statistics in
-	 training log. Ideally the input queue should be nearly full (default size is 50).
- 	 __If the queue size is close to zero, data is the bottleneck. Otherwise, it's not.__
-3. If GPU utilization is low but queue is full. It's because of the graph.
-	Either there are some communication inefficiency or some ops you use are inefficient (e.g. CPU ops). Also make sure GPUs are not locked in P8 state.
+   training log. Ideally the input queue should be nearly full (default size is 50).
+   __If the queue size is close to zero, data is the bottleneck. Otherwise, it's not.__
+
+   The size is by default printed after every epoch. Set `steps_per_epoch` to a
+   smaller number (e.g. 100) to see this number earlier.
+3. If GPU utilization is low but queue is full, the graph is inefficient.
+   Either there are some communication inefficiency, or some ops in the graph are inefficient (e.g. CPU ops). Also make sure GPUs are not locked in P8 state.

 ## Benchmark the components
+
+Whatever benchmarks you're doing, never look at the speed of the first 50 iterations.
+Everything is slow at the beginning.
+
 1. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow.
-	This will benchmark the graph without the possible overhead of DataFlow.
+	This will benchmark the graph, without the possible overhead of DataFlow.
 2. (usually not needed) Use `data=DummyConstantInput(shapes)` for training, so that the iterations only take data from a constant tensor.
 	No DataFlow is involved in this case.
 3. If you're using a TF-based input pipeline you wrote, you can simply run it in a loop and test its speed.
 4. Use `TestDataSpeed(mydf).start()` to benchmark your DataFlow.

 A benchmark will give you more precise information about which part you should improve.
-Note that you should only look at iteration speed after about 50 iterations, since everything is slow at the beginning.

 ## Investigate DataFlow

 Understand the [Efficient DataFlow](efficient-dataflow.html) tutorial, so you know what your DataFlow is doing.
-Then, make modifications and benchmark to understand which part is the bottleneck.
+Then, make modifications and benchmark to understand which part of dataflow is the bottleneck.
 Use [TestDataSpeed](../modules/dataflow.html#tensorpack.dataflow.TestDataSpeed).
 Do __NOT__ look at training speed when you benchmark a DataFlow.

@@ -76,7 +82,7 @@ But there may be something cheap you can try:

 ### Cannot scale to multi-GPU
 If you're unable to scale to multiple GPUs almost linearly:
-1. First make sure that the ResNet example can scale. Run it with `--fake` to use fake data.
+1. First make sure that the ImageNet-ResNet example can scale. Run it with `--fake` to use fake data.
 	If not, it's a bug or an environment setup problem.
 2. Then note that your model may have a different communication-computation pattern that affects efficiency.
 	 There isn't a simple answer to this.

--- a/examples/FasterRCNN/README.md
+++ b/examples/FasterRCNN/README.md
@@ -8,6 +8,7 @@ This example provides a minimal (<2k lines) and faithful implementation of the f
 with the support of:
 + Multi-GPU / distributed training
 + [Cross-GPU BatchNorm](https://arxiv.org/abs/1711.07240)
+ [Group Normalization](https://arxiv.org/abs/1803.08494)

 ## Dependencies
 + Python 3; TensorFlow >= 1.6 (1.4 or 1.5 can run but may crash due to a TF bug);
@@ -65,12 +66,13 @@ MaskRCNN results contain both box and mask mAP.
 | Backbone | mAP<br/>(box/mask) | Detectron mAP <br/> (box/mask) | Time           | Configurations <br/> (click to expand)                                                                                                                                                           |
 | -        | -                  | -                              | -              | -                                                                                                                                                                                                |
 | R50-C4   | 33.1               |                                | 18h on 8 V100s | <details><summary>super quick</summary>`MODE_MASK=False FRCNN.BATCH_PER_IM=64`<br/>`PREPROC.SHORT_EDGE_SIZE=600 PREPROC.MAX_SIZE=1024`<br/>`TRAIN.LR_SCHEDULE=[150000,230000,280000]` </details> |
- | R50-C4   | 36.6               | 36.5                           | 49h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False` </details>                                                                                                                                 |
+ | R50-C4   | 36.6               | 36.5                           | 44h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False` </details>                                                                                                                                 |
 | R50-FPN  | 37.5               | 37.9<sup>[1](#ft1)</sup>       | 28h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False MODE_FPN=True` </details>                                                                                                                   |
 | R50-C4   | 36.8/32.1          |                                | 39h on 8 P100s | <details><summary>quick</summary>`MODE_MASK=True FRCNN.BATCH_PER_IM=256`<br/>`TRAIN.LR_SCHEDULE=[150000,230000,280000]` </details>                                                               |
- | R50-C4   | 37.8/33.1          | 37.8/32.8                      | 51h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True` </details>                                                                                                                                  |
+ | R50-C4   | 37.8/33.1          | 37.8/32.8                      | 45h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True` </details>                                                                                                                                  |
 | R50-FPN  | 38.1/34.9          | 38.6/34.5<sup>[1](#ft1)</sup>  | 32h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True MODE_FPN=True` </details>                                                                                                                    |
- | R50-FPN  | 38.5/34.8          | 38.6/34.2<sup>[2](#ft2)</sup>  | 34h on 8 V100s | <details><summary>standard+convhead</summary>`MODE_MASK=True MODE_FPN=True`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_head` </details>                                                          |
+ | R50-FPN  | 38.5/34.8          | 38.6/34.2<sup>[2](#ft2)</sup>  | 34h on 8 V100s | <details><summary>standard+ConvHead</summary>`MODE_MASK=True MODE_FPN=True`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_head` </details>                                                          |
+ | R50-FPN  | 39.5/35.2          | 39.5/34.4<sup>[2](#ft2)</sup>  | 34h on 8 V100s | <details><summary>standard+ConvGNHead</summary>`MODE_MASK=True MODE_FPN=True`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head` </details>                                                          |
 | R101-C4  | 40.8/35.1          |                                | 63h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True `<br/>`BACKBONE.RESNET_NUM_BLOCK=[3,4,23,3]` </details>                                                                                      |
 
 <a id="ft1">1</a>: Slightly different configurations.

--- a/examples/FasterRCNN/data.py
+++ b/examples/FasterRCNN/data.py
@@ -281,8 +281,8 @@ def get_train_dataflow():
    # Valid training images should have at least one fg box.
    # But this filter shall not be applied for testing.
    num = len(imgs)
-    imgs = list(filter(lambda img: len(img['boxes']) > 0, imgs))    # log invalid training
-    logger.info("Filtered {} images which contain no groudtruth boxes. Total #images for training: {}".format(
+    imgs = list(filter(lambda img: len(img['boxes'][img['is_crowd'] == 0]) > 0, imgs))
+    logger.info("Filtered {} images which contain no non-crowd groudtruth boxes. Total #images for training: {}".format(
        num - len(imgs), len(imgs)))

    ds = DataFromList(imgs, shuffle=True)

--- a/examples/FasterRCNN/model_box.py
+++ b/examples/FasterRCNN/model_box.py
@@ -111,7 +111,8 @@ def crop_and_resize(image, boxes, box_ind, crop_size, pad_border=True):
        However, what we want is (with fpcoor box):
        Spacing: w_box / W_crop
        Initial point: x0_box + spacing/2 - 0.5
-        (-0.5 because bilinear sample assumes floating point coordinate (0.0, 0.0) is the same as pixel value (0, 0))
+        (-0.5 because bilinear sample (in my definition) assumes floating point coordinate
+         (0.0, 0.0) is the same as pixel value (0, 0))

        This function transform fpcoor boxes to a format to be used by tf.image.crop_and_resize


--- a/examples/FasterRCNN/model_frcnn.py
+++ b/examples/FasterRCNN/model_frcnn.py
@@ -214,7 +214,7 @@ def fastrcnn_predictions(boxes, probs):


 """
-FC Heads:
+FastRCNN heads for FPN:
 """



--- a/examples/FasterRCNN/train.py
+++ b/examples/FasterRCNN/train.py
@@ -574,7 +574,7 @@ if __name__ == '__main__':
            ScheduledHyperParamSetter('learning_rate', lr_schedule),
            EvalCallback(*MODEL.get_inference_tensor_names()),
            PeakMemoryTracker(),
-            EstimatedTimeLeft(),
+            EstimatedTimeLeft(median=True),
            SessionRunTimeout(60000).set_chief_only(True),   # 1 minute timeout
        ]
        if not is_horovod:

--- a/tensorpack/callbacks/misc.py
+++ b/tensorpack/callbacks/misc.py
@@ -75,13 +75,14 @@ class EstimatedTimeLeft(Callback):
    """
    Estimate the time left until completion of training.
    """
-    def __init__(self, last_k_epochs=5):
+    def __init__(self, last_k_epochs=5, median=False):
        """
        Args:
-            last_k_epochs (int): Use the time spent on last k epochs to
-                estimate total time left.
+            last_k_epochs (int): Use the time spent on last k epochs to estimate total time left.
+            median (bool): Use mean by default. If True, use the median time spent on last k epochs.
        """
        self._times = deque(maxlen=last_k_epochs)
+        self._median = median

    def _before_train(self):
        self._max_epoch = self.trainer.max_epoch
@@ -92,7 +93,7 @@ class EstimatedTimeLeft(Callback):
        self._last_time = time.time()
        self._times.append(duration)

-        average_epoch_time = np.mean(self._times)
-        time_left = (self._max_epoch - self.epoch_num) * average_epoch_time
+        epoch_time = np.median(self._times) if self._median else np.mean(self._times)
+        time_left = (self._max_epoch - self.epoch_num) * epoch_time
        if time_left > 0:
            logger.info("Estimated Time Left: " + humanize_time_delta(time_left))
--- a/tensorpack/dataflow/dftools.py
+++ b/tensorpack/dataflow/dftools.py
@@ -121,7 +121,7 @@ def dump_dataflow_to_tfrecord(df, path):
            sz = 0
        with get_tqdm(total=sz) as pbar:
            for dp in df.get_data():
-                writer.write(dumps(dp))
+                writer.write(dumps(dp).to_pybytes())
                pbar.update()



--- a/tensorpack/tfutils/varmanip.py
+++ b/tensorpack/tfutils/varmanip.py
@@ -230,4 +230,6 @@ def is_training_name(name):
        return True
    if name.startswith('AccumGrad') or name.endswith('/AccumGrad'):
        return True
+    if name.startswith('apply_gradients'):
+        return True
    return False