Commit f636a657 authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent 16c04d1f
...@@ -26,51 +26,52 @@ For example, these are the callbacks I used when training a ResNet: ...@@ -26,51 +26,52 @@ For example, these are the callbacks I used when training a ResNet:
```python ```python
callbacks=[ callbacks=[
# save the model every epoch # save the model every epoch
ModelSaver(), ModelSaver(),
# backup the model with best validation error # backup the model with best validation error
MinSaver('val-error-top1'), MinSaver('val-error-top1'),
# run inference on another Dataflow every epoch, compute classification error and log to monitors # run inference on another Dataflow every epoch, compute classification error and log to monitors
InferenceRunner(dataset_val, [ InferenceRunner(dataset_val, [
ClassificationError('wrong-top1', 'val-error-top1'), ClassificationError('wrong-top1', 'val-error-top1'),
ClassificationError('wrong-top5', 'val-error-top5')]), ClassificationError('wrong-top5', 'val-error-top5')]),
# schedule the learning rate based on epoch number # schedule the learning rate based on epoch number
ScheduledHyperParamSetter('learning_rate', ScheduledHyperParamSetter('learning_rate',
[(30, 1e-2), (60, 1e-3), (85, 1e-4), (95, 1e-5)]), [(30, 1e-2), (60, 1e-3), (85, 1e-4), (95, 1e-5)]),
# can manually change the learning rate through a file during training # can manually change the learning rate through a file during training
HumanHyperParamSetter('learning_rate'), HumanHyperParamSetter('learning_rate'),
# send validation error to my phone through pushbullet # send validation error to my phone through pushbullet
SendStat('curl -u your_id_xxx: https://api.pushbullet.com/v2/pushes \\ SendStat('curl -u your_id_xxx: https://api.pushbullet.com/v2/pushes \\
-d type=note -d title="validation error" \\ -d type=note -d title="validation error" \\
-d body={val-error-top1} > /dev/null 2>&1', -d body={val-error-top1} > /dev/null 2>&1',
'val-error-top1'), 'val-error-top1'),
# record GPU utilizations during training # record GPU utilizations during training
GPUUtilizationTracker(), GPUUtilizationTracker(),
# can pause the training and start a debug shell, to observe what's going on # can pause the training and start a debug shell, to observe what's going on
InjectShell(shell='ipython') InjectShell(shell='ipython')
] + [ # these callbacks are enabled by default already, though you can customize them ] + [ # these callbacks are enabled by default already, though you can customize them
# maintain those moving average summaries already defined in the model (e.g. training loss, training error) # maintain those moving average summaries already defined in the model (e.g. training loss, training error)
MovingAverageSummary(), MovingAverageSummary(),
# draw a nice progress bar # draw a nice progress bar
ProgressBar(), ProgressBar(),
# run `tf.summary.merge_all` every epoch and log to monitors # run `tf.summary.merge_all` every epoch and log to monitors
MergeAllSummaries(), MergeAllSummaries(),
# run ops in GraphKeys.UPDATE_OPS collection along with training, if any # run ops in GraphKeys.UPDATE_OPS collection along with training, if any
RunUpdateOps(), RunUpdateOps(),
], ],
monitors=[ # monitors are a special kind of callbacks. these are also enabled by default monitors=[ # monitors are a special kind of callbacks. these are also enabled by default
# write everything to tensorboard # write everything to tensorboard
TFEventWriter(), TFEventWriter(),
# write all scalar data to a json file, for easy parsing # write all scalar data to a json file, for easy parsing
JSONWriter(), JSONWriter(),
# print all scalar data every epoch (can be configured differently) # print all scalar data every epoch (can be configured differently)
ScalarPrinter(), ScalarPrinter(),
] ]
``` ```
Notice that callbacks cover every detail of training, ranging from graph operations to the progress bar. Notice that callbacks cover every detail of training, ranging from graph operations to the progress bar.
This means you can customize every part of the training to your preference, e.g. display something This means you can customize every part of the training to your preference, e.g. display something
different in the progress bar, evaluating part of the summaries at a different frequency, etc. different in the progress bar, evaluating part of the summaries at a different frequency, etc.
These features may not be always useful, but think about how messy the main loop would look like if you These features may not be always useful, but think about how messy the main loop would look like if you
were to write these logic together with the loops, and how easy your life will be if you could enable were to write these logic together with the loops, and how easy your life will be if you could enable
these features with one line when you need them. these features with one line when you need them.
......
...@@ -54,7 +54,7 @@ the rest of the data pipeline. ...@@ -54,7 +54,7 @@ the rest of the data pipeline.
Nevertheless, tensorpack support data loading with native TF operators / TF datasets as well. Nevertheless, tensorpack support data loading with native TF operators / TF datasets as well.
### Use DataFlow (outside Tensorpack) ### Use DataFlow (outside Tensorpack)
Existing tensorpack trainers work with DataFlow out-of-the-box. tensorpack `InputSource` interface works with DataFlow out-of-the-box.
If you use DataFlow in some custom code, call `reset_state()` first to initialize it, If you use DataFlow in some custom code, call `reset_state()` first to initialize it,
and then use the generator however you like: and then use the generator however you like:
```python ```python
......
...@@ -18,7 +18,7 @@ We will need to reach a speed of, roughly **1k ~ 2k images per second**, to keep ...@@ -18,7 +18,7 @@ We will need to reach a speed of, roughly **1k ~ 2k images per second**, to keep
Some things to know before reading: Some things to know before reading:
1. For smaller datasets (e.g. several GBs of images with lightweight preprocessing), a simple reader plus some prefetch should usually work well enough. 1. For smaller datasets (e.g. several GBs of images with lightweight preprocessing), a simple reader plus some prefetch should usually work well enough.
Therefore you don't have to understand this tutorial in depth unless you really find your data being the bottleneck. Therefore you don't have to understand this tutorial in depth unless you really find your data being the bottleneck.
This tutorial could be a bit complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-sized dataset. This tutorial could be a bit complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-scale dataset.
2. Having a fast Python generator **alone** may or may not improve your overall training speed. 2. Having a fast Python generator **alone** may or may not improve your overall training speed.
You need mechanisms to hide the latency of **all** preprocessing stages, as mentioned in the You need mechanisms to hide the latency of **all** preprocessing stages, as mentioned in the
[previous tutorial](input-source.html). [previous tutorial](input-source.html).
......
...@@ -20,31 +20,32 @@ def train(self): ...@@ -20,31 +20,32 @@ def train(self):
``` ```
Note that at each place, each callback will be called in the order they are given to the trainer. Note that at each place, each callback will be called in the order they are given to the trainer.
### Explain the Callback Methods ### Explain the Callback Methods
You can override any of the following methods to define a new callback: To write a callback, subclass `Callback` and implement the corresponding underscore-prefixed methods.
You can overwrite any of the following methods to define a new callback:
* `_setup_graph(self)` * `_setup_graph(self)`
Setup the ops / tensors in the graph which you might need to use in the callback. Create any ops / tensors in the graph which you might need to use in the callback.
You can use TF methods such as
[`graph.get_tensor_by_name`](https://www.tensorflow.org/api_docs/python/tf/Graph#get_tensor_by_name)
to access those already defined in the training tower.
If you're using a `TowerTrainer` instance, more tools are available:
* Use `self.trainer.tower_func.towers` to access the
[tower handles](../modules/tfutils.html#tensorpack.tfutils.tower.TowerTensorHandles),
and therefore the tensors in each tower.
* [self.get_tensors_maybe_in_tower()](../modules/callbacks.html#tensorpack.callbacks.Callback.get_tensors_maybe_in_tower)
is a helper function to access tensors in the first training tower.
* [self.trainer.get_predictor()](../modules/train.html#tensorpack.train.TowerTrainer.get_predictor)
is a helper function to create a callable under inference mode.
This method is to separate between "define" and "run", and also to This method is to separate between "define" and "run", and also to
avoid the common mistake to create ops inside avoid the common mistake to create ops inside
loops. All changes to the graph should be made in this method. loops. All changes to the graph should be made in this method.
To access ops which are already defined,
you can use TF methods such as
[`graph.get_tensor_by_name`](https://www.tensorflow.org/api_docs/python/tf/Graph#get_tensor_by_name).
If you're using a `TowerTrainer` instance, more tools are available:
* Use `self.trainer.tower_func.towers` to access the
[tower handles](../modules/tfutils.html#tensorpack.tfutils.tower.TowerTensorHandles),
and therefore the tensors in each tower.
* [self.get_tensors_maybe_in_tower()](../modules/callbacks.html#tensorpack.callbacks.Callback.get_tensors_maybe_in_tower)
is a helper function to access tensors in the first training tower.
* [self.trainer.get_predictor()](../modules/train.html#tensorpack.train.TowerTrainer.get_predictor)
is a helper function to create a callable under inference mode.
* `_before_train(self)` * `_before_train(self)`
Can be used to run some manual initialization of variables, or start some services for the training. Can be used to run some manual initialization of variables, or start some services for the training.
...@@ -60,7 +61,7 @@ Otherwise, `_trigger_epoch` should be enough. ...@@ -60,7 +61,7 @@ Otherwise, `_trigger_epoch` should be enough.
* `_before_run(self, ctx)`, `_after_run(self, ctx, values)` * `_before_run(self, ctx)`, `_after_run(self, ctx, values)`
This two are the equivlent of [tf.train.SessionRunHook](https://www.tensorflow.org/api_docs/python/tf/train/SessionRunHook). These are the equivalence of [tf.train.SessionRunHook](https://www.tensorflow.org/api_docs/python/tf/train/SessionRunHook).
Please refer to TensorFlow documentation for detailed API. Please refer to TensorFlow documentation for detailed API.
They are used to run extra ops / eval extra tensors / feed extra values __along with__ the actual training iterations. They are used to run extra ops / eval extra tensors / feed extra values __along with__ the actual training iterations.
......
...@@ -25,7 +25,7 @@ The reasons are: ...@@ -25,7 +25,7 @@ The reasons are:
Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks), Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks),
4 P100 GPUs can train ResNet50 at 852 images/sec, and the size of those images are 852\*224\*224\*3\*4bytes = 489MB. 4 P100 GPUs can train ResNet50 at 852 images/sec, and the size of those images are 852\*224\*224\*3\*4bytes = 489MB.
Assuming you have 5GB/s `memcpy` bandwidth, simply copying the data once would take 0.1s -- slowing Assuming you have 5GB/s `memcpy` bandwidth (roughly like this if you run single-thread copy), simply copying the data once would take 0.1s -- slowing
down your training by 10%. Think about how many more copies are made during your preprocessing. down your training by 10%. Think about how many more copies are made during your preprocessing.
Failure to hide the data preparation latency is the major reason why people Failure to hide the data preparation latency is the major reason why people
...@@ -74,6 +74,7 @@ Let's take a look at what users are asking for: ...@@ -74,6 +74,7 @@ Let's take a look at what users are asking for:
* [Handle dataset that's not a multiple of batch size](https://github.com/tensorflow/tensorflow/issues/13745) * [Handle dataset that's not a multiple of batch size](https://github.com/tensorflow/tensorflow/issues/13745)
* [Take variable-length np array](https://github.com/tensorflow/tensorflow/issues/13018) * [Take variable-length np array](https://github.com/tensorflow/tensorflow/issues/13018)
* [Different levels of determinism](https://github.com/tensorflow/tensorflow/issues/13932) * [Different levels of determinism](https://github.com/tensorflow/tensorflow/issues/13932)
To support these features which could've been done with 3 lines of code in Python, you need either a new TF To support these features which could've been done with 3 lines of code in Python, you need either a new TF
API, or ask [Dataset.from_generator](https://www.tensorflow.org/versions/r1.4/api_docs/python/tf/contrib/data/Dataset#from_generator) API, or ask [Dataset.from_generator](https://www.tensorflow.org/versions/r1.4/api_docs/python/tf/contrib/data/Dataset#from_generator)
(i.e. Python again) to the rescue. (i.e. Python again) to the rescue.
......
...@@ -15,7 +15,7 @@ This is how TensorFlow summaries eventually get logged/saved/printed: ...@@ -15,7 +15,7 @@ This is how TensorFlow summaries eventually get logged/saved/printed:
It runs ops in the `SUMMARIES` collection (by default) every epoch (by default), It runs ops in the `SUMMARIES` collection (by default) every epoch (by default),
and writes results to the monitors. and writes results to the monitors.
3. __Where to Log__: 3. __Where to Log__:
Several monitors are [default monitors](../modules/train.html#tensorpack.train.DEFAULT_MONITORS). Several monitors are [enabled by default](../modules/train.html#tensorpack.train.DEFAULT_MONITORS).
* A [TFEventWriter](../modules/callbacks.html#tensorpack.callbacks.TFEventWriter) * A [TFEventWriter](../modules/callbacks.html#tensorpack.callbacks.TFEventWriter)
writes things to an event file used by tensorboard. writes things to an event file used by tensorboard.
* A [ScalarPrinter](../modules/callbacks.html#tensorpack.callbacks.ScalarPrinter) * A [ScalarPrinter](../modules/callbacks.html#tensorpack.callbacks.ScalarPrinter)
...@@ -36,7 +36,7 @@ are likely to have too much variance. To address this issue, you can: ...@@ -36,7 +36,7 @@ are likely to have too much variance. To address this issue, you can:
[MovingAverageSummary](../modules/callbacks.html#tensorpack.callbacks.MovingAverageSummary) [MovingAverageSummary](../modules/callbacks.html#tensorpack.callbacks.MovingAverageSummary)
callback (enabled by default). callback (enabled by default).
### Other Data ### Other Logging Data
Besides TensorFlow summaries, Besides TensorFlow summaries,
a callback can also write other data to the monitor backend anytime once the training has started. a callback can also write other data to the monitor backend anytime once the training has started.
......
...@@ -6,7 +6,7 @@ Tensorpack follows the "define-and-run" paradigm. A training has two steps: ...@@ -6,7 +6,7 @@ Tensorpack follows the "define-and-run" paradigm. A training has two steps:
1. __Define__: Build graph for the model. 1. __Define__: Build graph for the model.
Users can call whatever tensorflow functions to setup the graph. Users can call whatever tensorflow functions to setup the graph.
Users may or may not use tensorpack `InputSource`, `ModelDesc` or other utilities to build the graph. Users may or may not use tensorpack `InputSource`, `ModelDesc` or other utilities to build the graph.
This goal of this step is to define "what to run" in later training steps, The goal of this step is to define "what to run" in later training steps,
and it can happen either inside or outside tensorpack trainer. and it can happen either inside or outside tensorpack trainer.
2. __Run__: Train the model (the [Trainer.train() method](../modules/train.html#tensorpack.train.Trainer.train)): 2. __Run__: Train the model (the [Trainer.train() method](../modules/train.html#tensorpack.train.Trainer.train)):
...@@ -58,7 +58,7 @@ Existing multi-GPU trainers include the logic of single-cost data-parallel train ...@@ -58,7 +58,7 @@ Existing multi-GPU trainers include the logic of single-cost data-parallel train
You can enable them by just one line, and all the necessary logic to achieve the best performance was baked into the trainers already. You can enable them by just one line, and all the necessary logic to achieve the best performance was baked into the trainers already.
The trainers can reach the same performance as the [official tensorflow benchmark](https://www.tensorflow.org/performance/benchmarks). The trainers can reach the same performance as the [official tensorflow benchmark](https://www.tensorflow.org/performance/benchmarks).
Please note that in data-parallel training, in each iteration all towers (all replicates of the model) will take Please note that in data-parallel training, in each iteration all GPUs (all replicates of the model) will take
tensors from the `InputSource` (instead of taking one for all and split). So the total batch size tensors from the `InputSource` (instead of taking one for all and split). So the total batch size
would be ``(batch size of InputSource/DataFlow) * #GPU``. would be ``(batch size of InputSource/DataFlow) * #GPU``.
......
...@@ -29,16 +29,16 @@ expects 4 arguments in `setup_graph`: `InputDesc`, `InputSource`, get_cost funct ...@@ -29,16 +29,16 @@ expects 4 arguments in `setup_graph`: `InputDesc`, `InputSource`, get_cost funct
```python ```python
class MyModel(ModelDesc): class MyModel(ModelDesc):
def _get_inputs(self): def _get_inputs(self):
return [InputDesc(...), InputDesc(...)] return [InputDesc(...), InputDesc(...)]
def _build_graph(self, inputs): def _build_graph(self, inputs):
tensorA, tensorB = inputs tensorA, tensorB = inputs
# build the graph # build the graph
self.cost = xxx # define the cost tensor self.cost = xxx # define the cost tensor
def _get_optimizer(self): def _get_optimizer(self):
return tf.train.GradientDescentOptimizer(0.1) return tf.train.GradientDescentOptimizer(0.1)
``` ```
`_get_inputs` should define the metainfo of all the inputs your graph will take to build. `_get_inputs` should define the metainfo of all the inputs your graph will take to build.
...@@ -59,9 +59,9 @@ config = TrainConfig( ...@@ -59,9 +59,9 @@ config = TrainConfig(
model=MyModel() model=MyModel()
dataflow=my_dataflow, dataflow=my_dataflow,
# data=my_inputsource, # alternatively, use a customized InputSource # data=my_inputsource, # alternatively, use a customized InputSource
callbacks=[...], # some default callbacks are automatically applied callbacks=[...], # some default callbacks are automatically applied
# some default monitors are automatically applied # some default monitors are automatically applied
steps_per_epoch=300, # default to the size of your InputSource/DataFlow steps_per_epoch=300, # default to the size of your InputSource/DataFlow
) )
trainer = SomeTrainer() trainer = SomeTrainer()
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment