Commit f636a657 authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent 16c04d1f
......@@ -26,51 +26,52 @@ For example, these are the callbacks I used when training a ResNet:
```python
callbacks=[
# save the model every epoch
ModelSaver(),
# backup the model with best validation error
MinSaver('val-error-top1'),
# run inference on another Dataflow every epoch, compute classification error and log to monitors
InferenceRunner(dataset_val, [
ClassificationError('wrong-top1', 'val-error-top1'),
ClassificationError('wrong-top5', 'val-error-top5')]),
# schedule the learning rate based on epoch number
ScheduledHyperParamSetter('learning_rate',
[(30, 1e-2), (60, 1e-3), (85, 1e-4), (95, 1e-5)]),
# can manually change the learning rate through a file during training
HumanHyperParamSetter('learning_rate'),
# send validation error to my phone through pushbullet
SendStat('curl -u your_id_xxx: https://api.pushbullet.com/v2/pushes \\
-d type=note -d title="validation error" \\
-d body={val-error-top1} > /dev/null 2>&1',
'val-error-top1'),
# record GPU utilizations during training
GPUUtilizationTracker(),
# can pause the training and start a debug shell, to observe what's going on
InjectShell(shell='ipython')
# save the model every epoch
ModelSaver(),
# backup the model with best validation error
MinSaver('val-error-top1'),
# run inference on another Dataflow every epoch, compute classification error and log to monitors
InferenceRunner(dataset_val, [
ClassificationError('wrong-top1', 'val-error-top1'),
ClassificationError('wrong-top5', 'val-error-top5')]),
# schedule the learning rate based on epoch number
ScheduledHyperParamSetter('learning_rate',
[(30, 1e-2), (60, 1e-3), (85, 1e-4), (95, 1e-5)]),
# can manually change the learning rate through a file during training
HumanHyperParamSetter('learning_rate'),
# send validation error to my phone through pushbullet
SendStat('curl -u your_id_xxx: https://api.pushbullet.com/v2/pushes \\
-d type=note -d title="validation error" \\
-d body={val-error-top1} > /dev/null 2>&1',
'val-error-top1'),
# record GPU utilizations during training
GPUUtilizationTracker(),
# can pause the training and start a debug shell, to observe what's going on
InjectShell(shell='ipython')
] + [ # these callbacks are enabled by default already, though you can customize them
# maintain those moving average summaries already defined in the model (e.g. training loss, training error)
MovingAverageSummary(),
# draw a nice progress bar
ProgressBar(),
# run `tf.summary.merge_all` every epoch and log to monitors
MergeAllSummaries(),
# run ops in GraphKeys.UPDATE_OPS collection along with training, if any
RunUpdateOps(),
# maintain those moving average summaries already defined in the model (e.g. training loss, training error)
MovingAverageSummary(),
# draw a nice progress bar
ProgressBar(),
# run `tf.summary.merge_all` every epoch and log to monitors
MergeAllSummaries(),
# run ops in GraphKeys.UPDATE_OPS collection along with training, if any
RunUpdateOps(),
],
monitors=[ # monitors are a special kind of callbacks. these are also enabled by default
# write everything to tensorboard
TFEventWriter(),
# write all scalar data to a json file, for easy parsing
JSONWriter(),
# print all scalar data every epoch (can be configured differently)
ScalarPrinter(),
# write everything to tensorboard
TFEventWriter(),
# write all scalar data to a json file, for easy parsing
JSONWriter(),
# print all scalar data every epoch (can be configured differently)
ScalarPrinter(),
]
```
Notice that callbacks cover every detail of training, ranging from graph operations to the progress bar.
This means you can customize every part of the training to your preference, e.g. display something
different in the progress bar, evaluating part of the summaries at a different frequency, etc.
These features may not be always useful, but think about how messy the main loop would look like if you
were to write these logic together with the loops, and how easy your life will be if you could enable
these features with one line when you need them.
......
......@@ -54,7 +54,7 @@ the rest of the data pipeline.
Nevertheless, tensorpack support data loading with native TF operators / TF datasets as well.
### Use DataFlow (outside Tensorpack)
Existing tensorpack trainers work with DataFlow out-of-the-box.
tensorpack `InputSource` interface works with DataFlow out-of-the-box.
If you use DataFlow in some custom code, call `reset_state()` first to initialize it,
and then use the generator however you like:
```python
......
......@@ -18,7 +18,7 @@ We will need to reach a speed of, roughly **1k ~ 2k images per second**, to keep
Some things to know before reading:
1. For smaller datasets (e.g. several GBs of images with lightweight preprocessing), a simple reader plus some prefetch should usually work well enough.
Therefore you don't have to understand this tutorial in depth unless you really find your data being the bottleneck.
This tutorial could be a bit complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-sized dataset.
This tutorial could be a bit complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-scale dataset.
2. Having a fast Python generator **alone** may or may not improve your overall training speed.
You need mechanisms to hide the latency of **all** preprocessing stages, as mentioned in the
[previous tutorial](input-source.html).
......
......@@ -20,31 +20,32 @@ def train(self):
```
Note that at each place, each callback will be called in the order they are given to the trainer.
### Explain the Callback Methods
You can override any of the following methods to define a new callback:
To write a callback, subclass `Callback` and implement the corresponding underscore-prefixed methods.
You can overwrite any of the following methods to define a new callback:
* `_setup_graph(self)`
Setup the ops / tensors in the graph which you might need to use in the callback.
You can use TF methods such as
[`graph.get_tensor_by_name`](https://www.tensorflow.org/api_docs/python/tf/Graph#get_tensor_by_name)
to access those already defined in the training tower.
If you're using a `TowerTrainer` instance, more tools are available:
* Use `self.trainer.tower_func.towers` to access the
[tower handles](../modules/tfutils.html#tensorpack.tfutils.tower.TowerTensorHandles),
and therefore the tensors in each tower.
* [self.get_tensors_maybe_in_tower()](../modules/callbacks.html#tensorpack.callbacks.Callback.get_tensors_maybe_in_tower)
is a helper function to access tensors in the first training tower.
* [self.trainer.get_predictor()](../modules/train.html#tensorpack.train.TowerTrainer.get_predictor)
is a helper function to create a callable under inference mode.
Create any ops / tensors in the graph which you might need to use in the callback.
This method is to separate between "define" and "run", and also to
avoid the common mistake to create ops inside
loops. All changes to the graph should be made in this method.
To access ops which are already defined,
you can use TF methods such as
[`graph.get_tensor_by_name`](https://www.tensorflow.org/api_docs/python/tf/Graph#get_tensor_by_name).
If you're using a `TowerTrainer` instance, more tools are available:
* Use `self.trainer.tower_func.towers` to access the
[tower handles](../modules/tfutils.html#tensorpack.tfutils.tower.TowerTensorHandles),
and therefore the tensors in each tower.
* [self.get_tensors_maybe_in_tower()](../modules/callbacks.html#tensorpack.callbacks.Callback.get_tensors_maybe_in_tower)
is a helper function to access tensors in the first training tower.
* [self.trainer.get_predictor()](../modules/train.html#tensorpack.train.TowerTrainer.get_predictor)
is a helper function to create a callable under inference mode.
* `_before_train(self)`
Can be used to run some manual initialization of variables, or start some services for the training.
......@@ -60,7 +61,7 @@ Otherwise, `_trigger_epoch` should be enough.
* `_before_run(self, ctx)`, `_after_run(self, ctx, values)`
This two are the equivlent of [tf.train.SessionRunHook](https://www.tensorflow.org/api_docs/python/tf/train/SessionRunHook).
These are the equivalence of [tf.train.SessionRunHook](https://www.tensorflow.org/api_docs/python/tf/train/SessionRunHook).
Please refer to TensorFlow documentation for detailed API.
They are used to run extra ops / eval extra tensors / feed extra values __along with__ the actual training iterations.
......
......@@ -25,7 +25,7 @@ The reasons are:
Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks),
4 P100 GPUs can train ResNet50 at 852 images/sec, and the size of those images are 852\*224\*224\*3\*4bytes = 489MB.
Assuming you have 5GB/s `memcpy` bandwidth, simply copying the data once would take 0.1s -- slowing
Assuming you have 5GB/s `memcpy` bandwidth (roughly like this if you run single-thread copy), simply copying the data once would take 0.1s -- slowing
down your training by 10%. Think about how many more copies are made during your preprocessing.
Failure to hide the data preparation latency is the major reason why people
......@@ -74,6 +74,7 @@ Let's take a look at what users are asking for:
* [Handle dataset that's not a multiple of batch size](https://github.com/tensorflow/tensorflow/issues/13745)
* [Take variable-length np array](https://github.com/tensorflow/tensorflow/issues/13018)
* [Different levels of determinism](https://github.com/tensorflow/tensorflow/issues/13932)
To support these features which could've been done with 3 lines of code in Python, you need either a new TF
API, or ask [Dataset.from_generator](https://www.tensorflow.org/versions/r1.4/api_docs/python/tf/contrib/data/Dataset#from_generator)
(i.e. Python again) to the rescue.
......
......@@ -15,7 +15,7 @@ This is how TensorFlow summaries eventually get logged/saved/printed:
It runs ops in the `SUMMARIES` collection (by default) every epoch (by default),
and writes results to the monitors.
3. __Where to Log__:
Several monitors are [default monitors](../modules/train.html#tensorpack.train.DEFAULT_MONITORS).
Several monitors are [enabled by default](../modules/train.html#tensorpack.train.DEFAULT_MONITORS).
* A [TFEventWriter](../modules/callbacks.html#tensorpack.callbacks.TFEventWriter)
writes things to an event file used by tensorboard.
* A [ScalarPrinter](../modules/callbacks.html#tensorpack.callbacks.ScalarPrinter)
......@@ -36,7 +36,7 @@ are likely to have too much variance. To address this issue, you can:
[MovingAverageSummary](../modules/callbacks.html#tensorpack.callbacks.MovingAverageSummary)
callback (enabled by default).
### Other Data
### Other Logging Data
Besides TensorFlow summaries,
a callback can also write other data to the monitor backend anytime once the training has started.
......
......@@ -6,7 +6,7 @@ Tensorpack follows the "define-and-run" paradigm. A training has two steps:
1. __Define__: Build graph for the model.
Users can call whatever tensorflow functions to setup the graph.
Users may or may not use tensorpack `InputSource`, `ModelDesc` or other utilities to build the graph.
This goal of this step is to define "what to run" in later training steps,
The goal of this step is to define "what to run" in later training steps,
and it can happen either inside or outside tensorpack trainer.
2. __Run__: Train the model (the [Trainer.train() method](../modules/train.html#tensorpack.train.Trainer.train)):
......@@ -58,7 +58,7 @@ Existing multi-GPU trainers include the logic of single-cost data-parallel train
You can enable them by just one line, and all the necessary logic to achieve the best performance was baked into the trainers already.
The trainers can reach the same performance as the [official tensorflow benchmark](https://www.tensorflow.org/performance/benchmarks).
Please note that in data-parallel training, in each iteration all towers (all replicates of the model) will take
Please note that in data-parallel training, in each iteration all GPUs (all replicates of the model) will take
tensors from the `InputSource` (instead of taking one for all and split). So the total batch size
would be ``(batch size of InputSource/DataFlow) * #GPU``.
......
......@@ -29,16 +29,16 @@ expects 4 arguments in `setup_graph`: `InputDesc`, `InputSource`, get_cost funct
```python
class MyModel(ModelDesc):
def _get_inputs(self):
return [InputDesc(...), InputDesc(...)]
def _get_inputs(self):
return [InputDesc(...), InputDesc(...)]
def _build_graph(self, inputs):
tensorA, tensorB = inputs
# build the graph
self.cost = xxx # define the cost tensor
def _build_graph(self, inputs):
tensorA, tensorB = inputs
# build the graph
self.cost = xxx # define the cost tensor
def _get_optimizer(self):
return tf.train.GradientDescentOptimizer(0.1)
def _get_optimizer(self):
return tf.train.GradientDescentOptimizer(0.1)
```
`_get_inputs` should define the metainfo of all the inputs your graph will take to build.
......@@ -59,9 +59,9 @@ config = TrainConfig(
model=MyModel()
dataflow=my_dataflow,
# data=my_inputsource, # alternatively, use a customized InputSource
callbacks=[...], # some default callbacks are automatically applied
# some default monitors are automatically applied
steps_per_epoch=300, # default to the size of your InputSource/DataFlow
callbacks=[...], # some default callbacks are automatically applied
# some default monitors are automatically applied
steps_per_epoch=300, # default to the size of your InputSource/DataFlow
)
trainer = SomeTrainer()
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment