Commit f636a657 authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent 16c04d1f
...@@ -71,6 +71,7 @@ monitors=[ # monitors are a special kind of callbacks. these are also ena ...@@ -71,6 +71,7 @@ monitors=[ # monitors are a special kind of callbacks. these are also ena
Notice that callbacks cover every detail of training, ranging from graph operations to the progress bar. Notice that callbacks cover every detail of training, ranging from graph operations to the progress bar.
This means you can customize every part of the training to your preference, e.g. display something This means you can customize every part of the training to your preference, e.g. display something
different in the progress bar, evaluating part of the summaries at a different frequency, etc. different in the progress bar, evaluating part of the summaries at a different frequency, etc.
These features may not be always useful, but think about how messy the main loop would look like if you These features may not be always useful, but think about how messy the main loop would look like if you
were to write these logic together with the loops, and how easy your life will be if you could enable were to write these logic together with the loops, and how easy your life will be if you could enable
these features with one line when you need them. these features with one line when you need them.
......
...@@ -54,7 +54,7 @@ the rest of the data pipeline. ...@@ -54,7 +54,7 @@ the rest of the data pipeline.
Nevertheless, tensorpack support data loading with native TF operators / TF datasets as well. Nevertheless, tensorpack support data loading with native TF operators / TF datasets as well.
### Use DataFlow (outside Tensorpack) ### Use DataFlow (outside Tensorpack)
Existing tensorpack trainers work with DataFlow out-of-the-box. tensorpack `InputSource` interface works with DataFlow out-of-the-box.
If you use DataFlow in some custom code, call `reset_state()` first to initialize it, If you use DataFlow in some custom code, call `reset_state()` first to initialize it,
and then use the generator however you like: and then use the generator however you like:
```python ```python
......
...@@ -18,7 +18,7 @@ We will need to reach a speed of, roughly **1k ~ 2k images per second**, to keep ...@@ -18,7 +18,7 @@ We will need to reach a speed of, roughly **1k ~ 2k images per second**, to keep
Some things to know before reading: Some things to know before reading:
1. For smaller datasets (e.g. several GBs of images with lightweight preprocessing), a simple reader plus some prefetch should usually work well enough. 1. For smaller datasets (e.g. several GBs of images with lightweight preprocessing), a simple reader plus some prefetch should usually work well enough.
Therefore you don't have to understand this tutorial in depth unless you really find your data being the bottleneck. Therefore you don't have to understand this tutorial in depth unless you really find your data being the bottleneck.
This tutorial could be a bit complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-sized dataset. This tutorial could be a bit complicated for people new to system architectures, but you do need these to be able to run fast enough on ImageNet-scale dataset.
2. Having a fast Python generator **alone** may or may not improve your overall training speed. 2. Having a fast Python generator **alone** may or may not improve your overall training speed.
You need mechanisms to hide the latency of **all** preprocessing stages, as mentioned in the You need mechanisms to hide the latency of **all** preprocessing stages, as mentioned in the
[previous tutorial](input-source.html). [previous tutorial](input-source.html).
......
...@@ -20,31 +20,32 @@ def train(self): ...@@ -20,31 +20,32 @@ def train(self):
``` ```
Note that at each place, each callback will be called in the order they are given to the trainer. Note that at each place, each callback will be called in the order they are given to the trainer.
### Explain the Callback Methods ### Explain the Callback Methods
You can override any of the following methods to define a new callback: To write a callback, subclass `Callback` and implement the corresponding underscore-prefixed methods.
You can overwrite any of the following methods to define a new callback:
* `_setup_graph(self)` * `_setup_graph(self)`
Setup the ops / tensors in the graph which you might need to use in the callback. Create any ops / tensors in the graph which you might need to use in the callback.
You can use TF methods such as This method is to separate between "define" and "run", and also to
[`graph.get_tensor_by_name`](https://www.tensorflow.org/api_docs/python/tf/Graph#get_tensor_by_name) avoid the common mistake to create ops inside
to access those already defined in the training tower. loops. All changes to the graph should be made in this method.
To access ops which are already defined,
you can use TF methods such as
[`graph.get_tensor_by_name`](https://www.tensorflow.org/api_docs/python/tf/Graph#get_tensor_by_name).
If you're using a `TowerTrainer` instance, more tools are available: If you're using a `TowerTrainer` instance, more tools are available:
* Use `self.trainer.tower_func.towers` to access the * Use `self.trainer.tower_func.towers` to access the
[tower handles](../modules/tfutils.html#tensorpack.tfutils.tower.TowerTensorHandles), [tower handles](../modules/tfutils.html#tensorpack.tfutils.tower.TowerTensorHandles),
and therefore the tensors in each tower. and therefore the tensors in each tower.
* [self.get_tensors_maybe_in_tower()](../modules/callbacks.html#tensorpack.callbacks.Callback.get_tensors_maybe_in_tower) * [self.get_tensors_maybe_in_tower()](../modules/callbacks.html#tensorpack.callbacks.Callback.get_tensors_maybe_in_tower)
is a helper function to access tensors in the first training tower. is a helper function to access tensors in the first training tower.
* [self.trainer.get_predictor()](../modules/train.html#tensorpack.train.TowerTrainer.get_predictor) * [self.trainer.get_predictor()](../modules/train.html#tensorpack.train.TowerTrainer.get_predictor)
is a helper function to create a callable under inference mode. is a helper function to create a callable under inference mode.
This method is to separate between "define" and "run", and also to
avoid the common mistake to create ops inside
loops. All changes to the graph should be made in this method.
* `_before_train(self)` * `_before_train(self)`
Can be used to run some manual initialization of variables, or start some services for the training. Can be used to run some manual initialization of variables, or start some services for the training.
...@@ -60,7 +61,7 @@ Otherwise, `_trigger_epoch` should be enough. ...@@ -60,7 +61,7 @@ Otherwise, `_trigger_epoch` should be enough.
* `_before_run(self, ctx)`, `_after_run(self, ctx, values)` * `_before_run(self, ctx)`, `_after_run(self, ctx, values)`
This two are the equivlent of [tf.train.SessionRunHook](https://www.tensorflow.org/api_docs/python/tf/train/SessionRunHook). These are the equivalence of [tf.train.SessionRunHook](https://www.tensorflow.org/api_docs/python/tf/train/SessionRunHook).
Please refer to TensorFlow documentation for detailed API. Please refer to TensorFlow documentation for detailed API.
They are used to run extra ops / eval extra tensors / feed extra values __along with__ the actual training iterations. They are used to run extra ops / eval extra tensors / feed extra values __along with__ the actual training iterations.
......
...@@ -25,7 +25,7 @@ The reasons are: ...@@ -25,7 +25,7 @@ The reasons are:
Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks), Let's do some simple math: according to [tensorflow/benchmarks](https://www.tensorflow.org/performance/benchmarks),
4 P100 GPUs can train ResNet50 at 852 images/sec, and the size of those images are 852\*224\*224\*3\*4bytes = 489MB. 4 P100 GPUs can train ResNet50 at 852 images/sec, and the size of those images are 852\*224\*224\*3\*4bytes = 489MB.
Assuming you have 5GB/s `memcpy` bandwidth, simply copying the data once would take 0.1s -- slowing Assuming you have 5GB/s `memcpy` bandwidth (roughly like this if you run single-thread copy), simply copying the data once would take 0.1s -- slowing
down your training by 10%. Think about how many more copies are made during your preprocessing. down your training by 10%. Think about how many more copies are made during your preprocessing.
Failure to hide the data preparation latency is the major reason why people Failure to hide the data preparation latency is the major reason why people
...@@ -74,6 +74,7 @@ Let's take a look at what users are asking for: ...@@ -74,6 +74,7 @@ Let's take a look at what users are asking for:
* [Handle dataset that's not a multiple of batch size](https://github.com/tensorflow/tensorflow/issues/13745) * [Handle dataset that's not a multiple of batch size](https://github.com/tensorflow/tensorflow/issues/13745)
* [Take variable-length np array](https://github.com/tensorflow/tensorflow/issues/13018) * [Take variable-length np array](https://github.com/tensorflow/tensorflow/issues/13018)
* [Different levels of determinism](https://github.com/tensorflow/tensorflow/issues/13932) * [Different levels of determinism](https://github.com/tensorflow/tensorflow/issues/13932)
To support these features which could've been done with 3 lines of code in Python, you need either a new TF To support these features which could've been done with 3 lines of code in Python, you need either a new TF
API, or ask [Dataset.from_generator](https://www.tensorflow.org/versions/r1.4/api_docs/python/tf/contrib/data/Dataset#from_generator) API, or ask [Dataset.from_generator](https://www.tensorflow.org/versions/r1.4/api_docs/python/tf/contrib/data/Dataset#from_generator)
(i.e. Python again) to the rescue. (i.e. Python again) to the rescue.
......
...@@ -15,7 +15,7 @@ This is how TensorFlow summaries eventually get logged/saved/printed: ...@@ -15,7 +15,7 @@ This is how TensorFlow summaries eventually get logged/saved/printed:
It runs ops in the `SUMMARIES` collection (by default) every epoch (by default), It runs ops in the `SUMMARIES` collection (by default) every epoch (by default),
and writes results to the monitors. and writes results to the monitors.
3. __Where to Log__: 3. __Where to Log__:
Several monitors are [default monitors](../modules/train.html#tensorpack.train.DEFAULT_MONITORS). Several monitors are [enabled by default](../modules/train.html#tensorpack.train.DEFAULT_MONITORS).
* A [TFEventWriter](../modules/callbacks.html#tensorpack.callbacks.TFEventWriter) * A [TFEventWriter](../modules/callbacks.html#tensorpack.callbacks.TFEventWriter)
writes things to an event file used by tensorboard. writes things to an event file used by tensorboard.
* A [ScalarPrinter](../modules/callbacks.html#tensorpack.callbacks.ScalarPrinter) * A [ScalarPrinter](../modules/callbacks.html#tensorpack.callbacks.ScalarPrinter)
...@@ -36,7 +36,7 @@ are likely to have too much variance. To address this issue, you can: ...@@ -36,7 +36,7 @@ are likely to have too much variance. To address this issue, you can:
[MovingAverageSummary](../modules/callbacks.html#tensorpack.callbacks.MovingAverageSummary) [MovingAverageSummary](../modules/callbacks.html#tensorpack.callbacks.MovingAverageSummary)
callback (enabled by default). callback (enabled by default).
### Other Data ### Other Logging Data
Besides TensorFlow summaries, Besides TensorFlow summaries,
a callback can also write other data to the monitor backend anytime once the training has started. a callback can also write other data to the monitor backend anytime once the training has started.
......
...@@ -6,7 +6,7 @@ Tensorpack follows the "define-and-run" paradigm. A training has two steps: ...@@ -6,7 +6,7 @@ Tensorpack follows the "define-and-run" paradigm. A training has two steps:
1. __Define__: Build graph for the model. 1. __Define__: Build graph for the model.
Users can call whatever tensorflow functions to setup the graph. Users can call whatever tensorflow functions to setup the graph.
Users may or may not use tensorpack `InputSource`, `ModelDesc` or other utilities to build the graph. Users may or may not use tensorpack `InputSource`, `ModelDesc` or other utilities to build the graph.
This goal of this step is to define "what to run" in later training steps, The goal of this step is to define "what to run" in later training steps,
and it can happen either inside or outside tensorpack trainer. and it can happen either inside or outside tensorpack trainer.
2. __Run__: Train the model (the [Trainer.train() method](../modules/train.html#tensorpack.train.Trainer.train)): 2. __Run__: Train the model (the [Trainer.train() method](../modules/train.html#tensorpack.train.Trainer.train)):
...@@ -58,7 +58,7 @@ Existing multi-GPU trainers include the logic of single-cost data-parallel train ...@@ -58,7 +58,7 @@ Existing multi-GPU trainers include the logic of single-cost data-parallel train
You can enable them by just one line, and all the necessary logic to achieve the best performance was baked into the trainers already. You can enable them by just one line, and all the necessary logic to achieve the best performance was baked into the trainers already.
The trainers can reach the same performance as the [official tensorflow benchmark](https://www.tensorflow.org/performance/benchmarks). The trainers can reach the same performance as the [official tensorflow benchmark](https://www.tensorflow.org/performance/benchmarks).
Please note that in data-parallel training, in each iteration all towers (all replicates of the model) will take Please note that in data-parallel training, in each iteration all GPUs (all replicates of the model) will take
tensors from the `InputSource` (instead of taking one for all and split). So the total batch size tensors from the `InputSource` (instead of taking one for all and split). So the total batch size
would be ``(batch size of InputSource/DataFlow) * #GPU``. would be ``(batch size of InputSource/DataFlow) * #GPU``.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment