Commit e0c1ee77 authored by Yuxin Wu's avatar Yuxin Wu

update docs about trainer

parent fa69c70a
...@@ -51,10 +51,10 @@ the rest of the data pipeline. ...@@ -51,10 +51,10 @@ the rest of the data pipeline.
If you're using DataFlow with tensorpack, also see [Input Pipeline tutorial](input-source.html) If you're using DataFlow with tensorpack, also see [Input Pipeline tutorial](input-source.html)
on how tensorpack further accelerates data loading in the graph. on how tensorpack further accelerates data loading in the graph.
Nevertheless, tensorpack support data loading with native TF operators / TF datasets as well. Nevertheless, tensorpack supports data loading with native TF operators / TF datasets as well.
### Use DataFlow (outside Tensorpack) ### Use DataFlow (outside Tensorpack)
tensorpack `InputSource` interface works with DataFlow out-of-the-box. Normally, tensorpack `InputSource` interface links DataFlow to the graph for training.
If you use DataFlow in some custom code, call `reset_state()` first to initialize it, If you use DataFlow in some custom code, call `reset_state()` first to initialize it,
and then use the generator however you like: and then use the generator however you like:
```python ```python
......
## Understand Trainer
## Write a Trainer ### Role of Trainer
Tensorpack follows the "define-and-run" paradigm. A training has two steps:
1. __Define__: Build graph for the model.
Users can call whatever tensorflow functions to setup the graph.
Users may or may not use tensorpack `InputSource`, `ModelDesc` or other utilities to build the graph.
The goal of this step is to define "what to run" in later training steps,
and it can happen __either inside or outside__ tensorpack trainer.
2. __Run__: Train the model (the [Trainer.train() method](../modules/train.html#tensorpack.train.Trainer.train)):
1. Setup callbacks/monitors.
2. Finalize graph, initialize session.
3. Run the training loop.
### Assumptions of Base Trainer
* Q: What types of training can you do with tensorpack?
* A: Anything that runs in a loop.
In research we do training of various kind.
Tensorpack trainers avoid making assumptions on what type of training
you want to do (e.g., it doesn't have to be batched, SGD-like, or have `X`(inputs) and `y`(outputs)).
The only assumption is that your training follows this pattern:
```python
for epoch_num in range(starting_epoch, max_epoch):
for local_step in range(steps_per_epoch):
run_step()
```
1. Training is **running some iterations**.
Tensorpack base trainer implements the logic of __running the iteration__.
Users or derived trainers should implement __what the iteration is__.
2. Trainer assumes the existence of __"epoch"__, i.e. that the iterations run in double for-loops.
But `steps_per_epoch` can be any number you set
and it only affects the [schedule of callbacks](extend/callback.html).
In other words, an "epoch" in tensorpack is the __default period to run callbacks__ (validation, summary, checkpoint, etc.).
### How Existing (Single-Cost) Trainers Work
Most neural network training tasks are single-cost optimization.
Tensorpack provides some trainer implementations for such tasks.
These trainers will take care of step 1 (define the graph), with the following arguments:
1. Some `InputDesc`, the metadata about the input.
2. An `InputSource`, where the input come from. See [Input Pipeline](input-source.html).
3. A function which takes input tensors and returns the cost.
4. A function which returns an optimizer.
These are documented in [SingleCostTrainer.setup_graph](../modules/train.html#tensorpack.train.SingleCostTrainer.setup_graph).
In practice you'll not use this method directly, but use [high-level interface](training-interface.html#with-modeldesc-and-trainconfig) instead.
### Write a Trainer
The existing trainers should be enough for single-tower single-cost optimization tasks. The existing trainers should be enough for single-tower single-cost optimization tasks.
If you just want to do some extra work during training, first consider writing it as a callback, If you just want to do some extra work during training, first consider writing it as a callback,
or write an issue to see if there is a better solution than creating new trainers. or write an issue to see if there is a better solution than creating new trainers.
If your task is fundamentally different from single-cost optimization, you will need to write a trainer. If your task is fundamentally different from single-cost optimization, you will need to write a trainer.
You can do customize training by either using or inheriting the base `Trainer` class.
Trainers just run __some__ iterations, so there is no limit in where the data come from or what to do in an iteration.
The existing common trainers all implement two things:
1. Setup the graph and input pipeline, using the given `InputSource` and `get_cost_fn`.
2. Minimize `model.cost` in each iteration.
But you can customize it by using or inheriting the base `Trainer` class.
You will need to define two things for a new Trainer: You will need to define two things for a new Trainer:
1. What is the graph. 1. Define the graph.
Add any tensors and ops you like, either before creating the trainer or inside `Trainer.__init__`. Add any tensors and ops you like, either before creating the trainer or inside `Trainer.__init__`.
2. What is the iteration. There are 2 ways to define an iteration: 2. What is the iteration. There are 2 ways to define the iteration:
1. Set `Trainer.train_op`. This op will be run by default. 1. Set `Trainer.train_op`. This op will be run by default.
2. Subclass `Trainer` and override the `run_step()` method. This way you can do something more than running an op. 2. Subclass `Trainer` and override the `run_step()` method. This way you can do something more than running an op.
......
...@@ -13,7 +13,7 @@ A High Level Glance ...@@ -13,7 +13,7 @@ A High Level Glance
They will eventually be wrapped under the same ``InputSource`` interface and go through prefetching. They will eventually be wrapped under the same ``InputSource`` interface and go through prefetching.
* You can use any TF-based symbolic function library to define a model, including * You can use any TF-based symbolic function library to define a model, including
a small set of models within tensorpack. ``ModelDesc`` is an interface to connect the graph with the a small set of functions within tensorpack. ``ModelDesc`` is an interface to connect the graph with the
``InputSource`` interface. ``InputSource`` interface.
* tensorpack trainers manage the training loops for you. * tensorpack trainers manage the training loops for you.
...@@ -38,7 +38,6 @@ User Tutorials ...@@ -38,7 +38,6 @@ User Tutorials
dataflow dataflow
input-source input-source
efficient-dataflow
symbolic symbolic
trainer trainer
training-interface training-interface
...@@ -47,8 +46,19 @@ User Tutorials ...@@ -47,8 +46,19 @@ User Tutorials
summary summary
faq faq
Performance
============
.. toctree::
:maxdepth: 1
efficient-dataflow
performance-tuning
Extend Tensorpack Extend Tensorpack
================= ==================
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
...@@ -58,10 +68,3 @@ Extend Tensorpack ...@@ -58,10 +68,3 @@ Extend Tensorpack
extend/model extend/model
extend/callback extend/callback
extend/trainer extend/trainer
Notes
======
.. toctree::
:maxdepth: 1
performance-tuning
...@@ -102,6 +102,6 @@ For example, ...@@ -102,6 +102,6 @@ For example,
Come from some `InputSource`, then prefetched on GPU by a TF StagingArea. Come from some `InputSource`, then prefetched on GPU by a TF StagingArea.
4. Come from a DataFlow, and further processed by `tf.data.Dataset`. 4. Come from a DataFlow, and further processed by `tf.data.Dataset`.
5. [TensorInput](../modules/input_source.html#tensorpack.input_source.TensorInput): 5. [TensorInput](../modules/input_source.html#tensorpack.input_source.TensorInput):
Come from some TF reading ops. (See the [PTB example](../examples/PennTreebank)) Come from some TF reading ops.
6. Come from some ZMQ pipe, where the load/preprocessing may happen on a different machine. 6. Come from some ZMQ pipe, where the load/preprocessing may happen on a different machine.
# Trainer # Trainers
Tensorpack follows the "define-and-run" paradigm. A training has two steps: Tensorpack trainers contain logic of:
1. __Define__: Build graph for the model. 1. Building the graph.
Users can call whatever tensorflow functions to setup the graph. 2. Running the iterations (with callbacks).
Users may or may not use tensorpack `InputSource`, `ModelDesc` or other utilities to build the graph.
The goal of this step is to define "what to run" in later training steps,
and it can happen either inside or outside tensorpack trainer.
2. __Run__: Train the model (the [Trainer.train() method](../modules/train.html#tensorpack.train.Trainer.train)): Usually you won't touch these methods directly, but use
[higher-level interface](training-interface.html) on trainers.
You'll only need to __select__ what trainer to use.
1. Setup callbacks/monitors. ### Tower Trainer
2. Finalize graph, initialize session.
3. Run the training loop.
Following the terminology in TensorFlow,
a "tower" function is something that takes input tensors and adds __one replicate__ of the model to the graph.
Most types of neural-network training could fall into this category.
This concept is used mainly to support:
## Assumptions of Base Trainer 1. Data-parallel multi-GPU training, where a replicate is built on each GPU.
2. Automatically building the graph for inference, where a replicate is built under inference mode.
In research we do training of various kind.
Tensorpack trainers try to avoid making assumptions on what type of training
you want to do (e.g., it doesn't have to be batched, SGD-like, or have `X`(inputs) and `y`(outputs)).
The only assumption tensorpack `Trainer` class makes about your training, is that your training
follows this pattern:
```python
for epoch_num in range(starting_epoch, max_epoch):
for local_step in range(steps_per_epoch):
run_step()
```
1. Training is **running some iterations**. ### MultiGPU Trainers
Tensorpack base trainer implements the logic of __running the iteration__.
Users or derived trainers should implement __what the iteration is__.
2. Trainer assumes the existence of __"epoch"__, i.e. that the iterations run in double for-loops. For data-parallel multi-GPU training, different [multi-GPU trainers](http://tensorpack.readthedocs.io/en/latest/modules/train.html)
But the epoch size can actually be any number you set implement different parallel logic, all reaching the same performance as the
and it only affects the [schedule of callbacks](extend/callback.html). [official TF benchmark](https://www.tensorflow.org/performance/benchmarks).
In other words, an "epoch" in tensorpack is the __default period to run callbacks__ (validation, summary, checkpoint, etc.). It takes only one line of code change to use them.
Note some common problems when using these trainers:
### Single-Cost Trainers 1. In each iteration all GPUs (all replicates of the model) will take tensors from the `InputSource`,
instead of taking one for all and split.
So the total batch size would become ``(batch size of InputSource/DataFlow) * #GPU``.
Most neural network training tasks are single-cost optimization. Splitting a tensor to GPUs makes no sense at all, only to put unnecessary shape constraints on the data.
Tensorpack provides some trainer implementations for such tasks. By letting each GPU train on its own input tensors, they can train on inputs of different shapes simultaneously.
These trainers will take care of step 1, by building the graph by itself, with the following arguments:
1. Some `InputDesc`, the metadata about the input.
2. An `InputSource`, where the input come from. See [Input Pipeline](input-source.html).
3. A function which takes input tensors and returns the cost.
4. A function which returns an optimizer.
These are documented better in [SingleCostTrainer.setup_graph](../modules/train.html#tensorpack.train.SingleCostTrainer.setup_graph).
Often you'll not use this method directly, but use [high-level interface](training-interface.html#with-modeldesc-and-trainconfig)
instead.
Existing multi-GPU trainers include the logic of single-cost data-parallel training.
You can enable them by just one line, and all the necessary logic to achieve the best performance was baked into the trainers already.
The trainers can reach the same performance as the [official tensorflow benchmark](https://www.tensorflow.org/performance/benchmarks).
Please note that in data-parallel training, in each iteration all GPUs (all replicates of the model) will take
tensors from the `InputSource` (instead of taking one for all and split). So the total batch size
would be ``(batch size of InputSource/DataFlow) * #GPU``.
### Custom Trainers
You can easily write a trainer for other types of training.
See [Write a Trainer](extend/trainer.html).
2. Your model code (the tower function) will get called multipile times.
You'll need to be very careful when modifying global states in those functions, e.g. adding ops to TF collections.
# Training Interface # Training Interface
Tensorpack trainers have an interface for maximum flexibility. Tensorpack trainers have a verbose interface for maximum flexibility.
Then, there are interfaces built on top of trainers to simplify the use, Then, there are interfaces built on top of trainers to simplify the use,
when you don't want to customize too much. when you don't want to customize too much.
### Raw Trainer Interface
__Define__: For general trainer, build the graph by yourself.
For single-cost trainer, build the graph by
[SingleCostTrainer.setup_graph](../modules/train.html#tensorpack.train.SingleCostTrainer.setup_graph).
__Run__: Then, call
[Trainer.train()](../modules/train.html#tensorpack.train.Trainer.train)
or
[Trainer.train_with_defaults()](../modules/train.html#tensorpack.train.Trainer.train_with_defaults)
which applies some defaults options for normal use cases.
### With ModelDesc and TrainConfig ### With ModelDesc and TrainConfig
This is an interface that's most familiar to old tensorpack users, This is an interface that's most familiar to old tensorpack users,
and is now mainly useful for single-cost tasks. and is now mainly useful for single-cost tasks.
A lot of examples are written in this interface. A lot of examples are written in this interface.
[SingleCost trainers](trainer.html#single-cost-trainers) [SingleCost trainers](../modules/train.html#tensorpack.train.SingleCostTrainer)
expects 4 arguments in `setup_graph`: `InputDesc`, `InputSource`, get_cost function, and an optimizer. expects 4 arguments to setup the graph: `InputDesc`, `InputSource`, get_cost function, and an optimizer.
`ModelDesc` describes a model by packing 3 of them together into one object: `ModelDesc` describes a model by packing 3 of them together into one object:
```python ```python
...@@ -65,7 +53,7 @@ config = TrainConfig( ...@@ -65,7 +53,7 @@ config = TrainConfig(
) )
trainer = SomeTrainer() trainer = SomeTrainer()
# trainer = SyncMultiGPUTrainerParameterServer([0, 1, 2]) # trainer = SyncMultiGPUTrainerParameterServer(8)
launch_train_with_config(config, trainer) launch_train_with_config(config, trainer)
``` ```
See the docs of See the docs of
...@@ -73,3 +61,19 @@ See the docs of ...@@ -73,3 +61,19 @@ See the docs of
and and
[launch_train_with_config](../modules/train.html#tensorpack.train.launch_train_with_config) [launch_train_with_config](../modules/train.html#tensorpack.train.launch_train_with_config)
for usage and detailed functionalities. for usage and detailed functionalities.
### Raw Trainer Interface
You can also access methods of trainer directly, to get a finer control:
__Build__ the graph: For general trainer, build the graph by yourself.
For single-cost trainer, build the graph by
[SingleCostTrainer.setup_graph](../modules/train.html#tensorpack.train.SingleCostTrainer.setup_graph).
__Run__ the iterations: Call
[Trainer.train()](../modules/train.html#tensorpack.train.Trainer.train),
or
[Trainer.train_with_defaults()](../modules/train.html#tensorpack.train.Trainer.train_with_defaults)
which applies some defaults options for normal use cases.
Read the API documentation for detail usage.
...@@ -316,7 +316,9 @@ class BatchQueueInput(QueueInput): ...@@ -316,7 +316,9 @@ class BatchQueueInput(QueueInput):
# TODO tensor inputs can be drained? look at the new dataset API. # TODO tensor inputs can be drained? look at the new dataset API.
class TensorInput(FeedfreeInput): class TensorInput(FeedfreeInput):
""" Input from a list of tensors, e.g. a TF data reading pipeline. """ """ Input from a list of tensors, e.g. a TF data reading pipeline.
The PTB training example shows how to use it.
"""
def __init__(self, get_tensor_fn, size=None): def __init__(self, get_tensor_fn, size=None):
""" """
......
...@@ -122,7 +122,7 @@ class SingleCostTrainer(TowerTrainer): ...@@ -122,7 +122,7 @@ class SingleCostTrainer(TowerTrainer):
Note: Note:
1. `get_cost_fn` will always be called under a :class:`TowerContext`. 1. `get_cost_fn` will always be called under a :class:`TowerContext`.
which will contain information abouut reuse, which will contain information about reuse,
training/inference, scope name, etc. training/inference, scope name, etc.
2. `get_cost_fn` might get called multiple times for data-parallel training or inference. 2. `get_cost_fn` might get called multiple times for data-parallel training or inference.
3. To respect variable reuse, use `tf.get_variable` instead of 3. To respect variable reuse, use `tf.get_variable` instead of
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment