Add AlexNet script; Memory tracker only at the end of epoch.

0ee4d8b0 · Yuxin Wu · ef3bceff · 0ee4d8b0 · 0ee4d8b0 · 0ee4d8b0
Commit 0ee4d8b0 authored May 27, 2018 by Yuxin Wu
4 changed files
--- a/examples/DoReFa-Net/README.md
+++ b/examples/DoReFa-Net/README.md
-Code and model for the paper:
+Official code and model for the paper:

-[DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](http://arxiv.org/abs/1606.06160), by Zhou et al.
+ [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](http://arxiv.org/abs/1606.06160).

 It also contains an implementation of the following papers:
 + [Binary Weight Network](https://arxiv.org/abs/1511.00363), with (W,A,G)=(1,32,32).
 + [Trained Ternary Quantization](https://arxiv.org/abs/1612.01064), with (W,A,G)=(t,32,32).
 + [Binarized Neural Networks](https://arxiv.org/abs/1602.02830), with (W,A,G)=(1,1,32).

-This is a solid baseline for research in model quantization.
+This is a good set of baselines for research in model quantization.
 These quantization techniques achieves the following ImageNet performance in this implementation:

-| Model              | W,A,G       | Top 1 Error |
-|:-------------------|-------------|------------:|
+| Model          | W,A,G    | Top 1 Validation Error |
+|:---------------|----------|-----------------------:|
 | Full Precision | 32,32,32 |                  40.3% |
 | TTQ            | t,32,32  |                  42.0% |
 | BWN            | 1,32,32  |                  44.6% |
@@ -26,16 +26,16 @@ more sophisticated augmentations.

 We hosted a demo at CVPR16 on behalf of Megvii, Inc, running a real-time 1/4-VGG size DoReFa-Net on ARM and half-VGG size DoReFa-Net on FPGA.
 We're not planning to release our C++ runtime for bit-operations.
-In this repo, bit operations are performed through `tf.float32`.
+In this repo, quantized operations are all performed through `tf.float32`.

 Pretrained model for (1,4,32)-ResNet18 and (1,2,6)-AlexNet are available at
 [tensorpack model zoo](http://models.tensorpack.com/DoReFa-Net/).
-They're provided in the format of numpy dictionary, so it should be very easy to port into other applications.
+They're provided in the format of numpy dictionary.
 The __binary-weight 4-bit-activation ResNet-18__ model has 59.2% top-1 validation accuracy.

 Alternative link to this page: [http://dorefa.net](http://dorefa.net)

-## Preparation:
+## Use

 + Install [tensorpack](https://github.com/ppwwyyxx/tensorpack) and scipy.


--- a/examples/ImageNetModels/README.md
+++ b/examples/ImageNetModels/README.md

-ImageNet training code of ResNet, Inception, VGG, ShuffleNet, DoReFa-Net with tensorpack.
+ImageNet training code of ResNet, ShuffleNet, DoReFa-Net, AlexNet, Inception, VGG with tensorpack.

 To train any of the models, just do `./{model}.py --data /path/to/ilsvrc`.
 Expected format of data directory is described in [docs](http://tensorpack.readthedocs.io/en/latest/modules/dataflow.dataset.html#tensorpack.dataflow.dataset.ILSVRC12).
@@ -10,8 +10,10 @@ Pretrained models can be downloaded at [tensorpack model zoo](http://models.tens
 Reproduce [ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices](https://arxiv.org/abs/1707.01083)
 on ImageNet.

-This is a 38Mflops ShuffleNet, corresponding to `ShuffleNet 0.5x g=3` in [version 2](https://arxiv.org/pdf/1707.01083v2) of the paper.
-After 240 epochs (36 hours on 8 P100s) it reaches top-1 error of 42.32%, better than the paper's number.
+This is a 38Mflops ShuffleNet, corresponding to `ShuffleNet 0.5x g=3` in __the
+2nd arxiv version__ of the paper.
+After 240 epochs (36 hours on 8 P100s) it reaches top-1 error of 42.32%,
+matching the paper's number.

 To print flops:
 ```bash
@@ -24,19 +26,35 @@ Evaluate the [pretrained model](http://models.tensorpack.com/ShuffleNet/):
 ./shufflenet.py --eval --data /path/to/ilsvrc --load /path/to/model
 ```

+### AlexNet
+
+This AlexNet script is quite close to the setting in its [original
+paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks).
+Trained with 64x2 batch size, the script reaches 58% single-crop validation
+accuracy after 100 epochs. It also generates first-layer filter visualizations
+similar to the paper in tensorboard.
+
 ### Inception-BN, VGG16

-This Inception-BN script reaches 27% single-crop error after 300k steps with 6 GPUs.
+This Inception-BN script reaches 27% single-crop validation error after 300k steps with 6 GPUs.
+The training recipe is very different from the original paper because the paper
+is a bit vague on these details.

 This VGG16 script, when trained with 32x8 batch size, reaches the following
-error rate after 100 epochs (30h with 8 P100s). This reproduces the VGG
+validation error after 100 epochs (30h with 8 P100s). This is the code for the VGG
 experiments in the paper [Group Normalization](https://arxiv.org/abs/1803.08494).

 | No Normalization                          | Batch Normalization | Group Normalization |
- |:---------------------------------|---------------------|--------------------:|
- | 29~30% (varies with random seed) | 28%                 |               27.6% |
+ |:------------------------------------------|---------------------|--------------------:|
+ | 29~30% (large variation with random seed) | 28%                 |               27.6% |
+
+### ResNet
+
+See [ResNet examples](../ResNet). It includes variants like pre-activation
+ResNet, squeeze-and-excitation networks.

+### DoReFa-Net

-### ResNet, DoReFa-Net
+See [DoReFa-Net examples](../DoReFa-Net).
+It includes other quantization methods such as Binary Weight Network, Trained Ternary Quantization. 

-See [ResNet examples](../ResNet) and [DoReFa-Net examples](../DoReFa-Net).
--- a/examples/ImageNetModels/alexnet.py
+++ b/examples/ImageNetModels/alexnet.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# File: alexnet.py
+
+import argparse
+import os
+import cv2
+import numpy as np
+
+import tensorflow as tf
+
+from tensorpack import *
+from tensorpack.dataflow import imgaug
+from tensorpack.tfutils import argscope
+from tensorpack.utils.gpu import get_num_gpu
+
+from imagenet_utils import ImageNetModel, get_imagenet_dataflow
+
+
+def visualize_conv1_weights(filters):
+    ctx = get_current_tower_context()
+    if not ctx.is_main_training_tower:
+        return
+    with tf.name_scope('visualize_conv1'):
+        filters = tf.reshape(filters, [11, 11, 3, 8, 12])
+        filters = tf.transpose(filters, [3, 0, 4, 1, 2])    # 8,11,12,11,3
+        filters = tf.reshape(filters, [1, 88, 132, 3])
+    tf.summary.image('visualize_conv1', filters, max_outputs=1, collections=['AAA'])
+
+
+class Model(ImageNetModel):
+    weight_decay = 5e-4
+    data_format = 'NHWC'  # LRN only supports NHWC
+
+    def get_logits(self, image):
+        gauss_init = tf.random_normal_initializer(stddev=0.01)
+        with argscope(Conv2D,
+                      kernel_initializer=tf.variance_scaling_initializer(scale=2.)), \
+                argscope([Conv2D, FullyConnected], activation=tf.nn.relu), \
+                argscope([Conv2D, MaxPooling], data_format='channels_last'):
+            # necessary padding to get 55x55 after conv1
+            image = tf.pad(image, [[0, 0], [2, 2], [2, 2], [0, 0]])
+            l = Conv2D('conv1', image, filters=96, kernel_size=11, strides=4, padding='VALID')
+            # size: 55
+            visualize_conv1_weights(l.variables.W)
+            l = tf.nn.lrn(l, 2, bias=1.0, alpha=2e-5, beta=0.75, name='norm1')
+            l = MaxPooling('pool1', l, 3, strides=2, padding='VALID')
+            # 27
+            l = Conv2D('conv2', l, filters=256, kernel_size=5, split=2)
+            l = tf.nn.lrn(l, 2, bias=1.0, alpha=2e-5, beta=0.75, name='norm2')
+            l = MaxPooling('pool2', l, 3, strides=2, padding='VALID')
+            # 13
+            l = Conv2D('conv3', l, filters=384, kernel_size=3)
+            l = Conv2D('conv4', l, filters=384, kernel_size=3, split=2)
+            l = Conv2D('conv5', l, filters=256, kernel_size=3, split=2)
+            l = MaxPooling('pool3', l, 3, strides=2, padding='VALID')
+
+            l = FullyConnected('fc6', l, 4096,
+                               kernel_initializer=gauss_init,
+                               bias_initializer=tf.ones_initializer())
+            l = Dropout(l, rate=0.5)
+            l = FullyConnected('fc7', l, 4096, kernel_initializer=gauss_init)
+            l = Dropout(l, rate=0.5)
+        logits = FullyConnected('fc8', l, 1000, kernel_initializer=gauss_init)
+        return logits
+
+
+def get_data(name, batch):
+    isTrain = name == 'train'
+    if isTrain:
+        augmentors = [
+            imgaug.ResizeShortestEdge(256, cv2.INTER_CUBIC),
+            imgaug.RandomCrop(224),
+            imgaug.Lighting(0.1,
+                            eigval=np.asarray(
+                                [0.2175, 0.0188, 0.0045][::-1]) * 255.0,
+                            eigvec=np.array(
+                                [[-0.5675, 0.7192, 0.4009],
+                                 [-0.5808, -0.0045, -0.8140],
+                                 [-0.5836, -0.6948, 0.4203]],
+                                dtype='float32')[::-1, ::-1]),
+            imgaug.Flip(horiz=True)]
+    else:
+        augmentors = [
+            imgaug.ResizeShortestEdge(256, cv2.INTER_CUBIC),
+            imgaug.CenterCrop((224, 224))]
+    return get_imagenet_dataflow(args.data, name, batch, augmentors)
+
+
+def get_config():
+    nr_tower = max(get_nr_gpu(), 1)
+    batch = args.batch
+    total_batch = batch * nr_tower
+    if total_batch != 128:
+        logger.warn("AlexNet needs to be trained with a total batch size of 128.")
+    BASE_LR = 0.01 * (total_batch / 128.)
+
+    logger.info("Running on {} towers. Batch size per tower: {}".format(nr_tower, batch))
+    dataset_train = get_data('train', batch)
+    dataset_val = get_data('val', batch)
+
+    infs = [ClassificationError('wrong-top1', 'val-error-top1'),
+            ClassificationError('wrong-top5', 'val-error-top5')]
+    callbacks = [
+        ModelSaver(),
+        GPUUtilizationTracker(),
+        EstimatedTimeLeft(),
+        ScheduledHyperParamSetter(
+            'learning_rate',
+            [(30, BASE_LR * 1e-1), (60, BASE_LR * 1e-2), (80, BASE_LR * 1e-3)]),
+        DataParallelInferenceRunner(
+            dataset_val, infs, list(range(nr_tower))),
+    ]
+
+    return TrainConfig(
+        model=Model(),
+        data=StagingInput(QueueInput(dataset_train)),
+        callbacks=callbacks,
+        steps_per_epoch=1281167 // total_batch,
+        max_epoch=100,
+    )
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--gpu', help='comma separated list of GPU(s) to use.')
+    parser.add_argument('--data', help='ILSVRC dataset dir')
+    parser.add_argument('--batch', type=int, default=32, help='batch per GPU')
+    parser.add_argument('--load', help='load model')
+    args = parser.parse_args()
+
+    if args.gpu:
+        os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
+
+    logger.set_logger_dir(os.path.join('train_log', 'AlexNet'))
+
+    config = get_config()
+    nr_tower = max(get_num_gpu(), 1)
+    trainer = SyncMultiGPUTrainerReplicated(nr_tower)
+    launch_train_with_config(config, trainer)
--- a/tensorpack/callbacks/prof.py
+++ b/tensorpack/callbacks/prof.py
@@ -177,7 +177,7 @@ class GraphProfiler(Callback):

 class PeakMemoryTracker(Callback):
    """
-    Track peak memory used on each GPU device, by :mod:`tf.contrib.memory_stats`.
+    Track peak memory used on each GPU device every epoch, by :mod:`tf.contrib.memory_stats`.
    The peak memory comes from the `MaxBytesInUse` op, which might span
    multiple session.run.
    See https://github.com/tensorflow/tensorflow/pull/13107.
@@ -203,9 +203,12 @@ class PeakMemoryTracker(Callback):
        self._fetches = tf.train.SessionRunArgs(fetches=ops)

    def _before_run(self, _):
+        if self.local_step == self.trainer.steps_per_epoch - 1:
            return self._fetches
+        return None

    def _after_run(self, _, rv):
        results = rv.results
+        if results is not None:
            for mem, dev in zip(results, self._devices):
                self.trainer.monitors.put_scalar('PeakMemory(MB) ' + dev, mem / 1e6)