Commit 843990f5 authored by Yuxin Wu's avatar Yuxin Wu

misc updates

parent 180a3461
......@@ -32,7 +32,7 @@ Failure to hide the data preparation latency is the major reason why people
cannot see good GPU utilization. You should __always choose a framework that enables latency hiding.__
However most other TensorFlow wrappers are designed to be `feed_dict` based.
Tensorpack has built-in mechanisms to hide latency of the above stages.
This is the major reason why tensorpack is [faster](https://github.com/tensorpack/benchmarks).
This is one of the reasons why tensorpack is [faster](https://github.com/tensorpack/benchmarks).
## Python Reader or TF Reader ?
......
......@@ -17,24 +17,30 @@ If you ask for help to understand and improve the speed, PLEASE do them and incl
2. If you use queue-based input + DataFlow, always pay attention to the queue size statistics in
training log. Ideally the input queue should be nearly full (default size is 50).
__If the queue size is close to zero, data is the bottleneck. Otherwise, it's not.__
3. If GPU utilization is low but queue is full. It's because of the graph.
Either there are some communication inefficiency or some ops you use are inefficient (e.g. CPU ops). Also make sure GPUs are not locked in P8 state.
The size is by default printed after every epoch. Set `steps_per_epoch` to a
smaller number (e.g. 100) to see this number earlier.
3. If GPU utilization is low but queue is full, the graph is inefficient.
Either there are some communication inefficiency, or some ops in the graph are inefficient (e.g. CPU ops). Also make sure GPUs are not locked in P8 state.
## Benchmark the components
Whatever benchmarks you're doing, never look at the speed of the first 50 iterations.
Everything is slow at the beginning.
1. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow.
This will benchmark the graph without the possible overhead of DataFlow.
This will benchmark the graph, without the possible overhead of DataFlow.
2. (usually not needed) Use `data=DummyConstantInput(shapes)` for training, so that the iterations only take data from a constant tensor.
No DataFlow is involved in this case.
3. If you're using a TF-based input pipeline you wrote, you can simply run it in a loop and test its speed.
4. Use `TestDataSpeed(mydf).start()` to benchmark your DataFlow.
A benchmark will give you more precise information about which part you should improve.
Note that you should only look at iteration speed after about 50 iterations, since everything is slow at the beginning.
## Investigate DataFlow
Understand the [Efficient DataFlow](efficient-dataflow.html) tutorial, so you know what your DataFlow is doing.
Then, make modifications and benchmark to understand which part is the bottleneck.
Then, make modifications and benchmark to understand which part of dataflow is the bottleneck.
Use [TestDataSpeed](../modules/dataflow.html#tensorpack.dataflow.TestDataSpeed).
Do __NOT__ look at training speed when you benchmark a DataFlow.
......@@ -76,7 +82,7 @@ But there may be something cheap you can try:
### Cannot scale to multi-GPU
If you're unable to scale to multiple GPUs almost linearly:
1. First make sure that the ResNet example can scale. Run it with `--fake` to use fake data.
1. First make sure that the ImageNet-ResNet example can scale. Run it with `--fake` to use fake data.
If not, it's a bug or an environment setup problem.
2. Then note that your model may have a different communication-computation pattern that affects efficiency.
There isn't a simple answer to this.
......
......@@ -8,6 +8,7 @@ This example provides a minimal (<2k lines) and faithful implementation of the f
with the support of:
+ Multi-GPU / distributed training
+ [Cross-GPU BatchNorm](https://arxiv.org/abs/1711.07240)
+ [Group Normalization](https://arxiv.org/abs/1803.08494)
## Dependencies
+ Python 3; TensorFlow >= 1.6 (1.4 or 1.5 can run but may crash due to a TF bug);
......@@ -65,12 +66,13 @@ MaskRCNN results contain both box and mask mAP.
| Backbone | mAP<br/>(box/mask) | Detectron mAP <br/> (box/mask) | Time | Configurations <br/> (click to expand) |
| - | - | - | - | - |
| R50-C4 | 33.1 | | 18h on 8 V100s | <details><summary>super quick</summary>`MODE_MASK=False FRCNN.BATCH_PER_IM=64`<br/>`PREPROC.SHORT_EDGE_SIZE=600 PREPROC.MAX_SIZE=1024`<br/>`TRAIN.LR_SCHEDULE=[150000,230000,280000]` </details> |
| R50-C4 | 36.6 | 36.5 | 49h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False` </details> |
| R50-C4 | 36.6 | 36.5 | 44h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False` </details> |
| R50-FPN | 37.5 | 37.9<sup>[1](#ft1)</sup> | 28h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=False MODE_FPN=True` </details> |
| R50-C4 | 36.8/32.1 | | 39h on 8 P100s | <details><summary>quick</summary>`MODE_MASK=True FRCNN.BATCH_PER_IM=256`<br/>`TRAIN.LR_SCHEDULE=[150000,230000,280000]` </details> |
| R50-C4 | 37.8/33.1 | 37.8/32.8 | 51h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True` </details> |
| R50-C4 | 37.8/33.1 | 37.8/32.8 | 45h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True` </details> |
| R50-FPN | 38.1/34.9 | 38.6/34.5<sup>[1](#ft1)</sup> | 32h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True MODE_FPN=True` </details> |
| R50-FPN | 38.5/34.8 | 38.6/34.2<sup>[2](#ft2)</sup> | 34h on 8 V100s | <details><summary>standard+convhead</summary>`MODE_MASK=True MODE_FPN=True`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_head` </details> |
| R50-FPN | 38.5/34.8 | 38.6/34.2<sup>[2](#ft2)</sup> | 34h on 8 V100s | <details><summary>standard+ConvHead</summary>`MODE_MASK=True MODE_FPN=True`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_head` </details> |
| R50-FPN | 39.5/35.2 | 39.5/34.4<sup>[2](#ft2)</sup> | 34h on 8 V100s | <details><summary>standard+ConvGNHead</summary>`MODE_MASK=True MODE_FPN=True`<br/>`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head` </details> |
| R101-C4 | 40.8/35.1 | | 63h on 8 V100s | <details><summary>standard</summary>`MODE_MASK=True `<br/>`BACKBONE.RESNET_NUM_BLOCK=[3,4,23,3]` </details> |
<a id="ft1">1</a>: Slightly different configurations.
......
......@@ -281,8 +281,8 @@ def get_train_dataflow():
# Valid training images should have at least one fg box.
# But this filter shall not be applied for testing.
num = len(imgs)
imgs = list(filter(lambda img: len(img['boxes']) > 0, imgs)) # log invalid training
logger.info("Filtered {} images which contain no groudtruth boxes. Total #images for training: {}".format(
imgs = list(filter(lambda img: len(img['boxes'][img['is_crowd'] == 0]) > 0, imgs))
logger.info("Filtered {} images which contain no non-crowd groudtruth boxes. Total #images for training: {}".format(
num - len(imgs), len(imgs)))
ds = DataFromList(imgs, shuffle=True)
......
......@@ -111,7 +111,8 @@ def crop_and_resize(image, boxes, box_ind, crop_size, pad_border=True):
However, what we want is (with fpcoor box):
Spacing: w_box / W_crop
Initial point: x0_box + spacing/2 - 0.5
(-0.5 because bilinear sample assumes floating point coordinate (0.0, 0.0) is the same as pixel value (0, 0))
(-0.5 because bilinear sample (in my definition) assumes floating point coordinate
(0.0, 0.0) is the same as pixel value (0, 0))
This function transform fpcoor boxes to a format to be used by tf.image.crop_and_resize
......
......@@ -214,7 +214,7 @@ def fastrcnn_predictions(boxes, probs):
"""
FC Heads:
FastRCNN heads for FPN:
"""
......
......@@ -574,7 +574,7 @@ if __name__ == '__main__':
ScheduledHyperParamSetter('learning_rate', lr_schedule),
EvalCallback(*MODEL.get_inference_tensor_names()),
PeakMemoryTracker(),
EstimatedTimeLeft(),
EstimatedTimeLeft(median=True),
SessionRunTimeout(60000).set_chief_only(True), # 1 minute timeout
]
if not is_horovod:
......
......@@ -75,13 +75,14 @@ class EstimatedTimeLeft(Callback):
"""
Estimate the time left until completion of training.
"""
def __init__(self, last_k_epochs=5):
def __init__(self, last_k_epochs=5, median=False):
"""
Args:
last_k_epochs (int): Use the time spent on last k epochs to
estimate total time left.
last_k_epochs (int): Use the time spent on last k epochs to estimate total time left.
median (bool): Use mean by default. If True, use the median time spent on last k epochs.
"""
self._times = deque(maxlen=last_k_epochs)
self._median = median
def _before_train(self):
self._max_epoch = self.trainer.max_epoch
......@@ -92,7 +93,7 @@ class EstimatedTimeLeft(Callback):
self._last_time = time.time()
self._times.append(duration)
average_epoch_time = np.mean(self._times)
time_left = (self._max_epoch - self.epoch_num) * average_epoch_time
epoch_time = np.median(self._times) if self._median else np.mean(self._times)
time_left = (self._max_epoch - self.epoch_num) * epoch_time
if time_left > 0:
logger.info("Estimated Time Left: " + humanize_time_delta(time_left))
......@@ -121,7 +121,7 @@ def dump_dataflow_to_tfrecord(df, path):
sz = 0
with get_tqdm(total=sz) as pbar:
for dp in df.get_data():
writer.write(dumps(dp))
writer.write(dumps(dp).to_pybytes())
pbar.update()
......
......@@ -230,4 +230,6 @@ def is_training_name(name):
return True
if name.startswith('AccumGrad') or name.endswith('/AccumGrad'):
return True
if name.startswith('apply_gradients'):
return True
return False
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment