Commit 04a64849 authored by Yuxin Wu's avatar Yuxin Wu

update examples

parent 7d0152ca
...@@ -7,16 +7,22 @@ It also contains an implementation of the following papers: ...@@ -7,16 +7,22 @@ It also contains an implementation of the following papers:
+ [Trained Ternary Quantization](https://arxiv.org/abs/1612.01064), with (W,A,G)=(t,32,32). + [Trained Ternary Quantization](https://arxiv.org/abs/1612.01064), with (W,A,G)=(t,32,32).
+ [Binarized Neural Networks](https://arxiv.org/abs/1602.02830), with (W,A,G)=(1,1,32). + [Binarized Neural Networks](https://arxiv.org/abs/1602.02830), with (W,A,G)=(1,1,32).
This is a solid baseline for research in model quantization.
These quantization techniques achieves the following ImageNet performance in this implementation: These quantization techniques achieves the following ImageNet performance in this implementation:
| Model | W,A,G | Top 1 Error | | Model | W,A,G | Top 1 Error |
|:-------------------|-------------|------------:| |:-------------------|-------------|------------:|
| Full Precision | 32,32,32 | 41.4% | | Full Precision | 32,32,32 | 40.9% |
| TTQ | t,32,32 | 41.9% | | TTQ | t,32,32 | 41.5% |
| BWN | 1,32,32 | 44.3% | | BWN | 1,32,32 | 43.7% |
| BNN | 1,1,32 | 53.4% | | BNN | 1,1,32 | 53.4% |
| DoReFa | 1,2,6 | 47.6% | | DoReFa | 1,2,32 | 47.2% |
| DoReFa | 1,2,4 | 58.4% | | DoReFa | 1,2,6 | 47.2% |
| DoReFa | 1,2,4 | 60.9% |
These numbers were obtained by training on 8 GPUs with a total batch size of 256.
The DoReFa-Net models reach slightly better performance than our paper, due to
more sophisticated augmentations.
We hosted a demo at CVPR16 on behalf of Megvii, Inc, running a real-time 1/4-VGG size DoReFa-Net on ARM and half-VGG size DoReFa-Net on FPGA. We hosted a demo at CVPR16 on behalf of Megvii, Inc, running a real-time 1/4-VGG size DoReFa-Net on ARM and half-VGG size DoReFa-Net on FPGA.
We're not planning to release our C++ runtime for bit-operations. We're not planning to release our C++ runtime for bit-operations.
......
...@@ -29,27 +29,7 @@ http://arxiv.org/abs/1606.06160 ...@@ -29,27 +29,7 @@ http://arxiv.org/abs/1606.06160
The original experiements are performed on a proprietary framework. The original experiements are performed on a proprietary framework.
This is our attempt to reproduce it on tensorpack & TensorFlow. This is our attempt to reproduce it on tensorpack & TensorFlow.
Accuracy: To Train:
Trained with 4 GPUs and (W,A,G)=(1,2,6), it can reach top-1 single-crop validation error of 47.6%,
after 70 epochs. This number is better than what's in the paper due to more sophisticated augmentations.
With (W,A,G)=(32,32,32) -- full precision baseline, 41.4% error.
With (W,A,G)=(t,32,32) -- TTQ, 41.9% error
With (W,A,G)=(1,32,32) -- BWN, 44.3% error
With (W,A,G)=(1,1,32) -- BNN, 53.4% error
With (W,A,G)=(1,2,6), 47.6% error
With (W,A,G)=(1,2,4), 58.4% error
Training with 2 or 8 GPUs is supported but the result may get slightly
different, due to limited per-GPU batch size.
You may want to adjust total batch size and learning rate accordingly.
Speed:
About 11 iteration/s on 4 P100s. (Each epoch is set to 10000 iterations)
Note that this code was written early without using NCHW format. You
should expect a speed up if the code is ported to NCHW format.
To Train, for example:
./alexnet-dorefa.py --dorefa 1,2,6 --data PATH --gpu 0,1 ./alexnet-dorefa.py --dorefa 1,2,6 --data PATH --gpu 0,1
PATH should look like: PATH should look like:
...@@ -75,7 +55,7 @@ To run pretrained model: ...@@ -75,7 +55,7 @@ To run pretrained model:
BITW = 1 BITW = 1
BITA = 2 BITA = 2
BITG = 6 BITG = 6
TOTAL_BATCH_SIZE = 128 TOTAL_BATCH_SIZE = 256
BATCH_SIZE = None BATCH_SIZE = None
...@@ -86,6 +66,7 @@ class Model(ModelDesc): ...@@ -86,6 +66,7 @@ class Model(ModelDesc):
def build_graph(self, image, label): def build_graph(self, image, label):
image = image / 255.0 image = image / 255.0
image = tf.transpose(image, [0, 3, 1, 2])
if BITW == 't': if BITW == 't':
fw, fa, fg = get_dorefa(32, 32, 32) fw, fa, fg = get_dorefa(32, 32, 32)
...@@ -112,6 +93,7 @@ class Model(ModelDesc): ...@@ -112,6 +93,7 @@ class Model(ModelDesc):
return fa(nonlin(x)) return fa(nonlin(x))
with remap_variables(new_get_variable), \ with remap_variables(new_get_variable), \
argscope([Conv2D, BatchNorm, MaxPooling], data_format='channels_first'), \
argscope(BatchNorm, momentum=0.9, epsilon=1e-4), \ argscope(BatchNorm, momentum=0.9, epsilon=1e-4), \
argscope(Conv2D, use_bias=False): argscope(Conv2D, use_bias=False):
logits = (LinearWrap(image) logits = (LinearWrap(image)
...@@ -170,7 +152,7 @@ class Model(ModelDesc): ...@@ -170,7 +152,7 @@ class Model(ModelDesc):
return total_cost return total_cost
def optimizer(self): def optimizer(self):
lr = tf.get_variable('learning_rate', initializer=1e-4, trainable=False) lr = tf.get_variable('learning_rate', initializer=2e-4, trainable=False)
return tf.train.AdamOptimizer(lr, epsilon=1e-5) return tf.train.AdamOptimizer(lr, epsilon=1e-5)
...@@ -189,17 +171,15 @@ def get_config(): ...@@ -189,17 +171,15 @@ def get_config():
dataflow=data_train, dataflow=data_train,
callbacks=[ callbacks=[
ModelSaver(), ModelSaver(),
# HumanHyperParamSetter('learning_rate'),
ScheduledHyperParamSetter( ScheduledHyperParamSetter(
'learning_rate', [(56, 2e-5), (64, 4e-6)]), 'learning_rate', [(60, 4e-5), (75, 8e-6)]),
InferenceRunner(data_test, InferenceRunner(data_test,
[ScalarStats('cost'), [ClassificationError('wrong-top1', 'val-error-top1'),
ClassificationError('wrong-top1', 'val-error-top1'),
ClassificationError('wrong-top5', 'val-error-top5')]) ClassificationError('wrong-top5', 'val-error-top5')])
], ],
model=Model(), model=Model(),
steps_per_epoch=10000, steps_per_epoch=1280000 // TOTAL_BATCH_SIZE,
max_epoch=100, max_epoch=90,
) )
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment