Commit f0243500 authored by Yuxin Wu's avatar Yuxin Wu

Merge branch 'model-redesign'

parents d38d22bf a867fa57
Bug Reports/Feature Requests/Usage Questions Only: Bug Reports/Feature Requests/Usage Questions Only:
Bug Reports: Bug Reports (including performance bug):
Some part of code (either the library or examples) doesn't work as expected. Some part of code (either the library or examples) doesn't work as expected.
Always include the following: Always include the following:
1. What you did. (command you run if using examples; post or describe your code if not) 1. What you did. (command you run if using examples; post or describe your code if not)
...@@ -13,7 +13,7 @@ Feature Requests: ...@@ -13,7 +13,7 @@ Feature Requests:
2. Add a new feature. Please note that, you can implement a lot of features by extending tensorpack 2. Add a new feature. Please note that, you can implement a lot of features by extending tensorpack
(See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack). (See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack).
It may not have to be added to tensorpack unless you have a good reason. It may not have to be added to tensorpack unless you have a good reason.
3. Note that we don't implement papers at other's requests. 3. Note that we don't implement papers at others' requests.
Usage Questions, e.g.: Usage Questions, e.g.:
"How do I do [this specific thing] in tensorpack?" "How do I do [this specific thing] in tensorpack?"
......
...@@ -8,6 +8,14 @@ so you won't need to look at here very often. ...@@ -8,6 +8,14 @@ so you won't need to look at here very often.
Here are a list of things that were changed, starting from an early version. Here are a list of things that were changed, starting from an early version.
TensorFlow itself also changed APIs before 1.0 and those are not listed here. TensorFlow itself also changed APIs before 1.0 and those are not listed here.
+ [2017/10/21]
tensorpack is gradually switching to a new Trainer API.
The old API will keep working for a while.
To switch to new API, the easiest way is to:
1. `export TENSORPACK_TRAIN_API=v2` (will be default soon in the future).
2. Replace `SomeTrainer(config, ...).train()` with `launch_train_with_config(config, SomeTrainer(...))`.
+ [2017/10/18] + [2017/10/18]
`TrainConfig(predict_tower)` was deprecated. You can set the inference device directly when creating the `InferenceRunner` callback. `TrainConfig(predict_tower)` was deprecated. You can set the inference device directly when creating the `InferenceRunner` callback.
+ [2017/10/12](https://github.com/ppwwyyxx/tensorpack/commit/7e963996f615b85f7459455596b4ee9bbd0bce8e). + [2017/10/12](https://github.com/ppwwyyxx/tensorpack/commit/7e963996f615b85f7459455596b4ee9bbd0bce8e).
......
...@@ -367,6 +367,7 @@ def autodoc_skip_member(app, what, name, obj, skip, options): ...@@ -367,6 +367,7 @@ def autodoc_skip_member(app, what, name, obj, skip, options):
'VisualQA', 'VisualQA',
'huber_loss', 'huber_loss',
'DumpTensor', 'DumpTensor',
'StagingInputWrapper',
'StepTensorPrinter' 'StepTensorPrinter'
]: ]:
return True return True
......
...@@ -25,58 +25,54 @@ Therefore these features can be reused with one single line, as long as you are ...@@ -25,58 +25,54 @@ Therefore these features can be reused with one single line, as long as you are
For example, these are the callbacks I used when training a ResNet: For example, these are the callbacks I used when training a ResNet:
```python ```python
TrainConfig( callbacks=[
# ... # save the model every epoch
callbacks=[ ModelSaver(),
# save the model every epoch # backup the model with best validation error
ModelSaver(), MinSaver('val-error-top1'),
# backup the model with best validation error # run inference on another Dataflow every epoch, compute classification error and log to monitors
MinSaver('val-error-top1'), InferenceRunner(dataset_val, [
# run inference on another Dataflow every epoch, compute classification error and log to monitors ClassificationError('wrong-top1', 'val-error-top1'),
InferenceRunner(dataset_val, [ ClassificationError('wrong-top5', 'val-error-top5')]),
ClassificationError('wrong-top1', 'val-error-top1'), # schedule the learning rate based on epoch number
ClassificationError('wrong-top5', 'val-error-top5')]), ScheduledHyperParamSetter('learning_rate',
# schedule the learning rate based on epoch number [(30, 1e-2), (60, 1e-3), (85, 1e-4), (95, 1e-5)]),
ScheduledHyperParamSetter('learning_rate', # can manually change the learning rate through a file during training
[(30, 1e-2), (60, 1e-3), (85, 1e-4), (95, 1e-5)]), HumanHyperParamSetter('learning_rate'),
# can manually set the learning rate during training # send validation error to my phone through pushbullet
HumanHyperParamSetter('learning_rate'), SendStat('curl -u your_id_xxx: https://api.pushbullet.com/v2/pushes \\
# send validation error to my phone through pushbullet -d type=note -d title="validation error" \\
SendStat('curl -u your_id_xxx: https://api.pushbullet.com/v2/pushes \\ -d body={val-error-top1} > /dev/null 2>&1',
-d type=note -d title="validation error" \\ 'val-error-top1'),
-d body={val-error-top1} > /dev/null 2>&1', # record GPU utilizations during training
'val-error-top1'), GPUUtilizationTracker(),
# record GPU utilizations during training # can pause the training and start a debug shell, to observe what's going on
GPUUtilizationTracker(), InjectShell(shell='ipython')
# can pause the training and start a debug shell, to observe what's going on ] + [ # these callbacks are enabled by default already, though you can customize them
InjectShell(shell='ipython') # maintain those moving average summaries already defined in the model (e.g. training loss, training error)
], MovingAverageSummary(),
extra_callbacks=[ # these callbacks are enabled by default already # draw a nice progress bar
# maintain those moving average summaries already defined in the model (e.g. training loss, training error) ProgressBar(),
MovingAverageSummary(), # run `tf.summary.merge_all` every epoch and log to monitors
# draw a nice progress bar MergeAllSummaries(),
ProgressBar(), # run ops in GraphKeys.UPDATE_OPS collection along with training, if any
# run `tf.summary.merge_all` every epoch and log to monitors RunUpdateOps(),
MergeAllSummaries(), ],
# run ops in GraphKeys.UPDATE_OPS collection along with training, if any monitors=[ # monitors are a special kind of callbacks. these are also enabled by default
RunUpdateOps(), # write everything to tensorboard
], TFEventWriter(),
monitors=[ # monitors are a special kind of callbacks. these are also enabled by default # write all scalar data to a json file, for easy parsing
# write everything to tensorboard JSONWriter(),
TFEventWriter(), # print all scalar data every epoch (can be configured differently)
# write all scalar data to a json file, for easy parsing ScalarPrinter(),
JSONWriter(), ]
# print all scalar data every epoch (can be configured differently)
ScalarPrinter(),
]
)
``` ```
Notice that callbacks cover every detail of training, ranging from graph operations to the progress bar. Notice that callbacks cover every detail of training, ranging from graph operations to the progress bar.
This means you can customize every part of the training to your preference, e.g. display something This means you can customize every part of the training to your preference, e.g. display something
different in the progress bar, evaluating part of the summaries at a different frequency, etc. different in the progress bar, evaluating part of the summaries at a different frequency, etc.
These features may not be always useful, but think about how messy the main loop would look like if you These features may not be always useful, but think about how messy the main loop would look like if you
were to write the logic together with the loops, and how easy your life will be if you could enable were to write these logic together with the loops, and how easy your life will be if you could enable
these features with one line when you need them. these features with one line when you need them.
See [Write a callback](http://tensorpack.readthedocs.io/en/latest/tutorial/extend/callback.html) See [Write a callback](http://tensorpack.readthedocs.io/en/latest/tutorial/extend/callback.html)
......
## Write a Trainer ## Write a Trainer
**These contents are subject to change in later versions soon**. The existing trainers should be enough for single-tower single-cost optimization tasks.
If you just want to do some extra work during training, first consider writing it as a callback,
The existing trainers should be enough for single-cost optimization tasks.
If you want to do something different during training, first consider writing it as a callback,
or write an issue to see if there is a better solution than creating new trainers. or write an issue to see if there is a better solution than creating new trainers.
If your task is fundamentally different from single-cost optimization, you may need to write a trainer. If your task is fundamentally different from single-cost optimization, you will need to write a trainer.
Trainers are recently being redesigned, the they best wayt to customize the trainer will likely to change.
We leave the tutorial empty for now. Trainers just run __some__ iterations, so there is no limit in where the data come from or what to do in an iteration.
The existing common trainers all implement two things:
<!-- 1. Setup the graph and input pipeline, using the given `InputSource` and `get_cost_fn`.
-Trainers just run __some__ iterations, so there is no limit in where the data come from or what to do in an iteration. 2. Minimize `model.cost` in each iteration.
-The existing common trainers all implement two things:
-1. Setup the graph and input pipeline, using the given `TrainConfig`. But you can customize it by using or inheriting the base `Trainer` class.
-2. Minimize `model.cost` in each iteration. You will need to define two things for a new Trainer:
-
-But you can customize it by using the base `Trainer` class. 1. What is the graph.
- Add any tensors and ops you like, either before creating the trainer or inside `Trainer.__init__`.
-* To customize the graph:
- * What is the iteration. There are 2 ways to define an iteration:
- Add any tensors and ops you like, either before creating the trainer or inside `Trainer.__init__`. 1. Set `Trainer.train_op`. This op will be run by default.
- In this case you don't need to set model/data in `TrainConfig` any more. 2. Subclass `Trainer` and override the `run_step()` method. This way you can do something more than running an op.
-
-* Two ways to customize the iteration: There are several different [GAN trainers](../../examples/GAN/GAN.py) for reference.
-
- 1. Set `Trainer.train_op`. This op will be run by default.
- 2. Subclass `Trainer` and override the `run_step()` method. This way you can do something more than running an op.
-
-There are several different [GAN trainers](../../examples/GAN/GAN.py) for reference.
-The implementation of [SimpleTrainer](../../tensorpack/train/simple.py) may also be helpful.
-->
# Build the Graph
This tutorial explains how a graph is built in tensorpack.
### ModelDesc
`ModelDesc` is an abstraction over the most common type of models people train.
It assumes:
1. Training is a single-cost optimized by a single `tf.train.Optimizer`.
2. The graph can be trivially duplicated for data-parallel training or inference.
If your task is single-cost optimization,
you can subclass `ModelDesc` and implement several methods:
```python
class MyModel(ModelDesc):
def _get_inputs(self):
return [InputDesc(...), InputDesc(...)]
def _build_graph(self, inputs):
tensorA, tensorB = inputs
# build the graph
self.cost = xxx # define the cost tensor
def _get_optimizer(self):
return tf.train.GradientDescentOptimizer(0.1)
```
`_get_inputs` should define the metainfo of all the inputs your graph may need.
`_build_graph` should add tensors/operations to the graph, where
the argument `inputs` is the list of input tensors matching `_get_inputs`.
You can use any symbolic functions in `_build_graph`, including TensorFlow core library
functions and other symbolic libraries.
### How it is Used:
Most tensorpack trainers expect a `ModelDesc`, and use it as a __description
of the TF graph to be built__.
These trainers will use `_get_inputs` to connect the given `InputSource` to the graph.
They'll then use `_build_graph` to create the backbone model, and then `_get_optimizer` to create the minimization op, and run it.
Note that data-parallel multi-GPU trainers will call `_build_graph` __multiple times__ on each GPU.
A trainer may also make __extra calls__ to `_build_graph` for inference, if used by some callbacks.
`_build_graph` will always be called under some `TowerContext` which contains these context information
(e.g. training or inference, reuse or not, scope name) for your access.
Also, to respect variable reuse among multiple calls, use `tf.get_variable()` instead of `tf.Variable` in `_build_graph`,
if you need to create any variables.
### Build It Manually
When you need to deal with complicated graph, it may be easier to build the graph manually.
You are free to do so as long as you tell the trainer what to do in each step.
Check out [Write a Trainer](extend/trainer.html)
for using a custom graph with trainer.
...@@ -39,9 +39,9 @@ User Tutorials ...@@ -39,9 +39,9 @@ User Tutorials
dataflow dataflow
input-source input-source
efficient-dataflow efficient-dataflow
graph
symbolic symbolic
trainer trainer
training-interface
callback callback
summary summary
faq faq
......
# Trainer # Trainer
Tensorpack follows the "define-and-run" paradigm. A training has two steps:
1. Build graph for the model.
Users can call whatever tensorflow functions to setup the graph.
Users may or may not use tensorpack `InputSource`, `ModelDesc` to build the graph.
This step defines "what to run" in every training step.
It can happen either inside or outside the trainer.
2. Train the model (the [Trainer.train() method](http://tensorpack.readthedocs.io/en/latest/modules/train.html#tensorpack.train.Trainer.train)):
1. Setup callbacks/monitors.
2. Finalize the graph, initialize session.
3. Run the main loop.
## Assumptions of Base Trainer
In research we do training of various kind. In research we do training of various kind.
Tensorpack trainers try to avoid making assumptions on what type of training
you want to do (e.g., it doesn't have to be batched, SGD-like, or have `X`(inputs) and `y`(outputs)).
The only assumption tensorpack `Trainer` class makes about your training, is that your training The only assumption tensorpack `Trainer` class makes about your training, is that your training
follows this pattern: follows this pattern:
```python ```python
...@@ -15,50 +34,36 @@ Tensorpack base trainer implements the logic of __running the iteration__. ...@@ -15,50 +34,36 @@ Tensorpack base trainer implements the logic of __running the iteration__.
Users or derived trainers should implement __what the iteration is__. Users or derived trainers should implement __what the iteration is__.
2. Trainer assumes the existence of __"epoch"__, i.e. that the iterations run in double for-loops. 2. Trainer assumes the existence of __"epoch"__, i.e. that the iterations run in double for-loops.
But an epoch doesn't need to be a full pass of your dataset, the size of an epoch can be any number you set But the epoch size can actually be any number you set
and it only affects the [schedule of callbacks](extend/callback.html). and it only affects the [schedule of callbacks](extend/callback.html).
In other words, an "epoch" in tensorpack is the __default period to run callbacks__ (validation, summary, checkpoint, etc.). In other words, an "epoch" in tensorpack is the __default period to run callbacks__ (validation, summary, checkpoint, etc.).
### Common Trainers ### Single-Cost Trainers
Most neural network training tasks are single-cost optimization. Most neural network training tasks are single-cost optimization.
Tensorpack provides some trainer implementations for such tasks. Tensorpack provides some trainer implementations for such tasks.
These trainers will build the graph based on the given `ModelDesc`, and minimizes `ModelDesc.cost`. These trainers will build the graph by itself, with the following arguments:
<!-- 1. Some `InputDesc`, the metadata about the input.
-To use trainers, pass a `TrainConfig` to configure them: 2. An `InputSource`, where the input come from. See [Input Pipeline](input-source.html).
- 3. A function which takes input tensors and returns the cost.
-```python 4. A function which returns an optimizer.
-config = TrainConfig(
- model=MyModel() See [SingleCostTrainer.setup_graph](http://tensorpack.readthedocs.io/en/latest/modules/train.html#tensorpack.train.SingleCostTrainer.setup_graph)
- dataflow=my_dataflow, for details.
- # data=my_inputsource, # alternatively, use a customized InputSource
- callbacks=[...]
- )
-
-# start training:
-SomeTrainer(config, other_arguments).train()
-
-# start multi-GPU training with synchronous update:
-# SyncMultiGPUTrainerParameterServer(config).train()
-```
-
-When you set the DataFlow (rather than the InputSource) in the config,
-tensorpack trainers automatically adopt certain prefetch mechanism, as mentioned
-in the [Input Pipeline](input-source.html) tutorial.
-You can set the InputSource instead, to customize this behavior.
-->
Trainers are being redesigned, so the recommended API will likely be changed soon.
Existing multi-GPU trainers include the logic of data-parallel training. Existing multi-GPU trainers include the logic of data-parallel training.
You can enable them by just one line, and all the necessary logic to achieve the best performance was baked into the trainers already. You can enable them by just one line, and all the necessary logic to achieve the best performance was baked into the trainers already.
The trainers can reach the same performance as the [official tensorflow benchmark](https://www.tensorflow.org/performance/benchmarks). The trainers can reach the same performance as the [official tensorflow benchmark](https://www.tensorflow.org/performance/benchmarks).
Please note that in data-parallel training, in each iteration all towers (all replicates of the model) will take Please note that in data-parallel training, in each iteration all towers (all replicates of the model) will take
tensors from the InputSource (instead of taking one for all and split). So the total batch size tensors from the `InputSource` (instead of taking one for all and split). So the total batch size
would be ``(batch size of InputSource/DataFlow) * #GPU``. would be ``(batch size of InputSource/DataFlow) * #GPU``.
There are also high-level wrappers that have slightly simpler interface (but exist mainly for old users).
See [High-Level Training Interface](training-interface.html)
### Custom Trainers ### Custom Trainers
You can easily write a trainer for other types of training. You can easily write a trainer for other types of training.
......
# Training Interface
Tensorpack trainers have an interface for maximum flexibility.
There are also interfaces built on top of trainers to simplify the use,
when you don't want to customize too much.
### Raw Trainer Interface
For general trainer, build the graph by yourself.
For single-cost trainer, build the graph by
[SingleCostTrainer.setup_graph](http://tensorpack.readthedocs.io/en/latest/modules/train.html#tensorpack.train.SingleCostTrainer.setup_graph).
Then, call
[Trainer.train()](http://tensorpack.readthedocs.io/en/latest/modules/train.html#tensorpack.train.Trainer.train)
or
[Trainer.train_with_defaults()](http://tensorpack.readthedocs.io/en/latest/modules/train.html#tensorpack.train.Trainer.train_with_defaults)
which applies some defaults options for normal use cases.
### With ModelDesc and TrainConfig
[SingleCost trainers](trainer.html#single-cost-trainers)
expects 4 arguments in `setup_graph`: `InputDesc`, `InputSource`, get_cost function, and an optimizer.
`ModelDesc` describes a model by packing 3 of them together into one object:
```python
class MyModel(ModelDesc):
def _get_inputs(self):
return [InputDesc(...), InputDesc(...)]
def _build_graph(self, inputs):
tensorA, tensorB = inputs
# build the graph
self.cost = xxx # define the cost tensor
def _get_optimizer(self):
return tf.train.GradientDescentOptimizer(0.1)
```
`_get_inputs` should define the metainfo of all the inputs your graph will take to build.
`_build_graph` takes a list of `inputs` tensors which will match `_get_inputs`.
You can use any symbolic functions in `_build_graph`, including TensorFlow core library
functions and other symbolic libraries.
But you need to follow the requirement of
[get_cost_fn](http://tensorpack.readthedocs.io/en/latest/modules/train.html#tensorpack.train.SingleCostTrainer.setup_graph),
because this function will be used as part of `get_cost_fn`.
At last you need to set `self.cost`.
After defining such a model, use it with `TrainConfig` and `launch_train_with_config`:
```python
config = TrainConfig(
model=MyModel()
dataflow=my_dataflow,
# data=my_inputsource, # alternatively, use a customized InputSource
callbacks=[...]
)
trainer = SomeTrainer()
# trainer = SyncMultiGPUTrainerParameterServer([0, 1, 2])
launch_train_with_config(config, trainer)
```
See the docs of
[launch_train_with_config](http://tensorpack.readthedocs.io/en/latest/modules/train.html#tensorpack.train.launch_train_with_config)
for its usage and detailed functionalities.
...@@ -18,6 +18,7 @@ import tensorflow as tf ...@@ -18,6 +18,7 @@ import tensorflow as tf
import six import six
from six.moves import queue from six.moves import queue
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.utils.concurrency import * from tensorpack.utils.concurrency import *
from tensorpack.utils.serialize import * from tensorpack.utils.serialize import *
...@@ -303,5 +304,5 @@ if __name__ == '__main__': ...@@ -303,5 +304,5 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = get_model_loader(args.load) config.session_init = get_model_loader(args.load)
trainer = QueueInputTrainer if config.nr_tower == 1 else AsyncMultiGPUTrainer trainer = SimpleTrainer() if config.nr_tower == 1 else AsyncMultiGPUTrainer(config.tower)
trainer(config).train() launch_train_with_config(config, trainer)
...@@ -12,6 +12,7 @@ import operator ...@@ -12,6 +12,7 @@ import operator
import six import six
from six.moves import map, range from six.moves import map, range
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.gradproc import SummaryGradient, GlobalNormClip from tensorpack.tfutils.gradproc import SummaryGradient, GlobalNormClip
from tensorpack.utils.globvars import globalns as param from tensorpack.utils.globvars import globalns as param
...@@ -94,7 +95,7 @@ def get_data(path, isTrain, stat_file): ...@@ -94,7 +95,7 @@ def get_data(path, isTrain, stat_file):
def get_config(ds_train, ds_test): def get_config(ds_train, ds_test):
return TrainConfig( return TrainConfig(
dataflow=ds_train, data=QueueInput(ds_train),
callbacks=[ callbacks=[
ModelSaver(), ModelSaver(),
StatMonitorParamSetter('learning_rate', 'error', StatMonitorParamSetter('learning_rate', 'error',
...@@ -128,4 +129,4 @@ if __name__ == '__main__': ...@@ -128,4 +129,4 @@ if __name__ == '__main__':
config = get_config(ds_train, ds_test) config = get_config(ds_train, ds_test)
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
QueueInputTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -12,6 +12,7 @@ import operator ...@@ -12,6 +12,7 @@ import operator
import six import six
from six.moves import map, range from six.moves import map, range
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils import symbolic_functions, summary, optimizer from tensorpack.tfutils import symbolic_functions, summary, optimizer
from tensorpack.tfutils.gradproc import GlobalNormClip from tensorpack.tfutils.gradproc import GlobalNormClip
...@@ -116,7 +117,7 @@ def get_config(): ...@@ -116,7 +117,7 @@ def get_config():
ds = BatchData(ds, param.batch_size) ds = BatchData(ds, param.batch_size)
return TrainConfig( return TrainConfig(
dataflow=ds, data=QueueInput(ds),
callbacks=[ callbacks=[
ModelSaver(), ModelSaver(),
ScheduledHyperParamSetter('learning_rate', [(25, 2e-4)]) ScheduledHyperParamSetter('learning_rate', [(25, 2e-4)])
...@@ -190,4 +191,4 @@ if __name__ == '__main__': ...@@ -190,4 +191,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
QueueInputTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -16,6 +16,7 @@ import multiprocessing ...@@ -16,6 +16,7 @@ import multiprocessing
import threading import threading
from collections import deque from collections import deque
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.utils.concurrency import * from tensorpack.utils.concurrency import *
import tensorflow as tf import tensorflow as tf
...@@ -105,7 +106,7 @@ def get_config(): ...@@ -105,7 +106,7 @@ def get_config():
) )
return TrainConfig( return TrainConfig(
dataflow=expreplay, data=QueueInput(expreplay),
model=Model(), model=Model(),
callbacks=[ callbacks=[
ModelSaver(), ModelSaver(),
...@@ -166,4 +167,4 @@ if __name__ == '__main__': ...@@ -166,4 +167,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = get_model_loader(args.load) config.session_init = get_model_loader(args.load)
QueueInputTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -79,24 +79,22 @@ def eval_with_funcs(predictors, nr_eval, get_player_fn): ...@@ -79,24 +79,22 @@ def eval_with_funcs(predictors, nr_eval, get_player_fn):
k.start() k.start()
time.sleep(0.1) # avoid simulator bugs time.sleep(0.1) # avoid simulator bugs
stat = StatCounter() stat = StatCounter()
try:
for _ in tqdm(range(nr_eval), **get_tqdm_kwargs()): for _ in tqdm(range(nr_eval), **get_tqdm_kwargs()):
r = q.get() r = q.get()
stat.feed(r) stat.feed(r)
logger.info("Waiting for all the workers to finish the last run...") logger.info("Waiting for all the workers to finish the last run...")
for k in threads: for k in threads:
k.stop() k.stop()
for k in threads: for k in threads:
k.join() k.join()
while q.qsize(): while q.qsize():
r = q.get() r = q.get()
stat.feed(r) stat.feed(r)
except:
logger.exception("Eval") if stat.count > 0:
finally: return (stat.average, stat.max)
if stat.count > 0: return (0, 0)
return (stat.average, stat.max)
return (0, 0)
def eval_model_multithread(pred, nr_eval, get_player_fn): def eval_model_multithread(pred, nr_eval, get_player_fn):
......
...@@ -258,7 +258,7 @@ class ExpReplay(DataFlow, Callback): ...@@ -258,7 +258,7 @@ class ExpReplay(DataFlow, Callback):
mean, max = v.average, v.max mean, max = v.average, v.max
self.trainer.monitors.put_scalar('expreplay/mean_score', mean) self.trainer.monitors.put_scalar('expreplay/mean_score', mean)
self.trainer.monitors.put_scalar('expreplay/max_score', max) self.trainer.monitors.put_scalar('expreplay/max_score', max)
except: except Exception:
logger.exception("Cannot log training scores.") logger.exception("Cannot log training scores.")
v.reset() v.reset()
......
...@@ -8,6 +8,7 @@ import os ...@@ -8,6 +8,7 @@ import os
import sys import sys
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.dataflow import dataset from tensorpack.dataflow import dataset
import tensorflow as tf import tensorflow as tf
...@@ -65,4 +66,4 @@ if __name__ == '__main__': ...@@ -65,4 +66,4 @@ if __name__ == '__main__':
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
config = get_config() config = get_config()
QueueInputTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -8,6 +8,7 @@ import numpy as np ...@@ -8,6 +8,7 @@ import numpy as np
import os import os
import imp import imp
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.dataflow import dataset from tensorpack.dataflow import dataset
...@@ -56,4 +57,4 @@ if __name__ == '__main__': ...@@ -56,4 +57,4 @@ if __name__ == '__main__':
os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['CUDA_VISIBLE_DEVICES'] = '0'
config = get_config(args.prob) config = get_config(args.prob)
QueueInputTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -11,6 +11,7 @@ import multiprocessing ...@@ -11,6 +11,7 @@ import multiprocessing
import os import os
import sys import sys
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.symbolic_functions import * from tensorpack.tfutils.symbolic_functions import *
from tensorpack.tfutils.summary import * from tensorpack.tfutils.summary import *
...@@ -226,7 +227,7 @@ def get_data(dataset_name): ...@@ -226,7 +227,7 @@ def get_data(dataset_name):
ds = AugmentImageComponent(ds, augmentors, copy=False) ds = AugmentImageComponent(ds, augmentors, copy=False)
ds = BatchData(ds, BATCH_SIZE, remainder=not isTrain) ds = BatchData(ds, BATCH_SIZE, remainder=not isTrain)
if isTrain: if isTrain:
ds = PrefetchDataZMQ(ds, min(12, multiprocessing.cpu_count())) ds = PrefetchDataZMQ(ds, min(25, multiprocessing.cpu_count()))
return ds return ds
...@@ -321,5 +322,4 @@ if __name__ == '__main__': ...@@ -321,5 +322,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
config.nr_tower = nr_tower launch_train_with_config(config, SyncMultiGPUTrainer(nr_tower))
SyncMultiGPUTrainer(config).train()
...@@ -7,6 +7,7 @@ import argparse ...@@ -7,6 +7,7 @@ import argparse
import numpy as np import numpy as np
import os import os
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.symbolic_functions import * from tensorpack.tfutils.symbolic_functions import *
from tensorpack.tfutils.summary import * from tensorpack.tfutils.summary import *
...@@ -163,7 +164,7 @@ def get_config(): ...@@ -163,7 +164,7 @@ def get_config():
data_test = BatchData(data_test, 128, remainder=True) data_test = BatchData(data_test, 128, remainder=True)
return TrainConfig( return TrainConfig(
dataflow=data_train, data=QueueInput(data_train),
callbacks=[ callbacks=[
ModelSaver(), ModelSaver(),
InferenceRunner(data_test, InferenceRunner(data_test,
...@@ -183,4 +184,4 @@ if __name__ == '__main__': ...@@ -183,4 +184,4 @@ if __name__ == '__main__':
BITW, BITA, BITG = map(int, args.dorefa.split(',')) BITW, BITA, BITG = map(int, args.dorefa.split(','))
config = get_config() config = get_config()
QueueInputTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -6,10 +6,12 @@ import argparse ...@@ -6,10 +6,12 @@ import argparse
import numpy as np import numpy as np
import tensorflow as tf import tensorflow as tf
import cv2 import cv2
import os
from scipy.signal import convolve2d from scipy.signal import convolve2d
from six.moves import range, zip from six.moves import range, zip
import multiprocessing import multiprocessing
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.utils import logger from tensorpack.utils import logger
from tensorpack.utils.viz import * from tensorpack.utils.viz import *
...@@ -262,5 +264,4 @@ if __name__ == '__main__': ...@@ -262,5 +264,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
config.nr_tower = NR_GPU launch_train_with_config(config, SyncMultiGPUTrainer(NR_GPU))
SyncMultiGPUTrainer(config).train()
...@@ -13,6 +13,7 @@ import numpy as np ...@@ -13,6 +13,7 @@ import numpy as np
import json import json
import tensorflow as tf import tensorflow as tf
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
import tensorpack.tfutils.symbolic_functions as symbf import tensorpack.tfutils.symbolic_functions as symbf
from tensorpack.tfutils.summary import add_moving_summary from tensorpack.tfutils.summary import add_moving_summary
...@@ -222,12 +223,13 @@ class EvalCallback(Callback): ...@@ -222,12 +223,13 @@ class EvalCallback(Callback):
def _setup_graph(self): def _setup_graph(self):
self.pred = self.trainer.get_predictor(['image'], ['fastrcnn_fg_probs', 'fastrcnn_fg_boxes']) self.pred = self.trainer.get_predictor(['image'], ['fastrcnn_fg_probs', 'fastrcnn_fg_boxes'])
self.df = PrefetchDataZMQ(get_eval_dataflow(), 1) self.df = PrefetchDataZMQ(get_eval_dataflow(), 1)
get_tf_nms() # just to make sure the nms part of graph is created
def _before_train(self):
EVAL_TIMES = 5 # eval 5 times during training EVAL_TIMES = 5 # eval 5 times during training
interval = self.trainer.max_epoch // (EVAL_TIMES + 1) interval = self.trainer.max_epoch // (EVAL_TIMES + 1)
self.epochs_to_eval = set([interval * k for k in range(1, EVAL_TIMES)]) self.epochs_to_eval = set([interval * k for k in range(1, EVAL_TIMES)])
self.epochs_to_eval.add(self.trainer.max_epoch) self.epochs_to_eval.add(self.trainer.max_epoch)
get_tf_nms() # just to make sure the nms part of graph is created
def _eval(self): def _eval(self):
all_results = eval_on_dataflow(self.df, lambda img: detect_one_image(img, self.pred)) all_results = eval_on_dataflow(self.df, lambda img: detect_one_image(img, self.pred))
...@@ -300,6 +302,6 @@ if __name__ == '__main__': ...@@ -300,6 +302,6 @@ if __name__ == '__main__':
steps_per_epoch=stepnum, steps_per_epoch=stepnum,
max_epoch=230000 * factor // stepnum, max_epoch=230000 * factor // stepnum,
session_init=get_model_loader(args.load) if args.load else None, session_init=get_model_loader(args.load) if args.load else None,
nr_tower=get_nr_gpu()
) )
SyncMultiGPUTrainerReplicated(cfg, gpu_prefetch=False).train() trainer = SyncMultiGPUTrainerReplicated(get_nr_gpu())
launch_train_with_config(cfg, trainer)
...@@ -6,6 +6,7 @@ ...@@ -6,6 +6,7 @@
import os import os
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.summary import add_moving_summary from tensorpack.tfutils.summary import add_moving_summary
from tensorpack.utils.gpu import get_nr_gpu from tensorpack.utils.gpu import get_nr_gpu
...@@ -145,8 +146,6 @@ if __name__ == '__main__': ...@@ -145,8 +146,6 @@ if __name__ == '__main__':
logger.auto_set_dir() logger.auto_set_dir()
config = TrainConfig( config = TrainConfig(
model=Model(),
dataflow=DCGAN.get_data(args.data),
callbacks=[ callbacks=[
ModelSaver(), ModelSaver(),
StatMonitorParamSetter( StatMonitorParamSetter(
...@@ -155,9 +154,12 @@ if __name__ == '__main__': ...@@ -155,9 +154,12 @@ if __name__ == '__main__':
steps_per_epoch=500, steps_per_epoch=500,
max_epoch=400, max_epoch=400,
session_init=SaverRestore(args.load) if args.load else None, session_init=SaverRestore(args.load) if args.load else None,
nr_tower=max(get_nr_gpu(), 1)
) )
if config.nr_tower == 1: input = QueueInput(DCGAN.get_data(args.data))
GANTrainer(config).train() model = Model()
nr_tower = max(get_nr_gpu(), 1)
if nr_tower == 1:
trainer = GANTrainer(input, model)
else: else:
MultiGPUGANTrainer(config).train() trainer = MultiGPUGANTrainer(nr_tower, input, model)
trainer.train_with_config(config)
...@@ -10,6 +10,7 @@ import sys ...@@ -10,6 +10,7 @@ import sys
import cv2 import cv2
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.utils.viz import * from tensorpack.utils.viz import *
import tensorpack.tfutils.symbolic_functions as symbf import tensorpack.tfutils.symbolic_functions as symbf
...@@ -104,18 +105,6 @@ def get_data(): ...@@ -104,18 +105,6 @@ def get_data():
return BatchData(ds, BATCH) return BatchData(ds, BATCH)
def get_config():
logger.auto_set_dir()
dataset = get_data()
return TrainConfig(
dataflow=dataset,
callbacks=[ModelSaver()],
model=Model(),
steps_per_epoch=500,
max_epoch=100,
)
def sample(model_path): def sample(model_path):
pred = PredictConfig( pred = PredictConfig(
session_init=get_model_loader(model_path), session_init=get_model_loader(model_path),
...@@ -144,7 +133,10 @@ if __name__ == '__main__': ...@@ -144,7 +133,10 @@ if __name__ == '__main__':
if args.sample: if args.sample:
sample(args.load) sample(args.load)
else: else:
config = get_config() logger.auto_set_dir()
if args.load: GANTrainer(QueueInput(get_data()), Model()).train_with_defaults(
config.session_init = SaverRestore(args.load) callbacks=[ModelSaver()],
GANTrainer(config).train() steps_per_epoch=500,
max_epoch=100,
session_init=SaverRestore(args.load) if args.load else None
)
...@@ -9,6 +9,7 @@ import glob ...@@ -9,6 +9,7 @@ import glob
from six.moves import map, zip, range from six.moves import map, zip, range
import numpy as np import numpy as np
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.utils.viz import * from tensorpack.utils.viz import *
import tensorpack.tfutils.symbolic_functions as symbf import tensorpack.tfutils.symbolic_functions as symbf
...@@ -217,9 +218,7 @@ if __name__ == '__main__': ...@@ -217,9 +218,7 @@ if __name__ == '__main__':
data = get_data(args.data) data = get_data(args.data)
data = PrintData(data) data = PrintData(data)
config = TrainConfig( GANTrainer(QueueInput(data), Model()).train_with_defaults(
model=Model(),
dataflow=data,
callbacks=[ callbacks=[
ModelSaver(), ModelSaver(),
ScheduledHyperParamSetter( ScheduledHyperParamSetter(
...@@ -228,7 +227,6 @@ if __name__ == '__main__': ...@@ -228,7 +227,6 @@ if __name__ == '__main__':
PeriodicTrigger(VisualizeTestSet(), every_k_epochs=3), PeriodicTrigger(VisualizeTestSet(), every_k_epochs=3),
], ],
max_epoch=195, max_epoch=195,
steps_per_epoch=data.size(),
session_init=SaverRestore(args.load) if args.load else None session_init=SaverRestore(args.load) if args.load else None
) )
GANTrainer(config).train()
...@@ -8,6 +8,7 @@ import numpy as np ...@@ -8,6 +8,7 @@ import numpy as np
import os, sys import os, sys
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.utils.viz import * from tensorpack.utils.viz import *
from tensorpack.tfutils.summary import add_moving_summary from tensorpack.tfutils.summary import add_moving_summary
...@@ -155,12 +156,11 @@ if __name__ == '__main__': ...@@ -155,12 +156,11 @@ if __name__ == '__main__':
else: else:
assert args.data assert args.data
logger.auto_set_dir() logger.auto_set_dir()
config = TrainConfig( GANTrainer(
model=Model(), input=QueueInput(get_data(args.data)),
dataflow=get_data(args.data), model=Model()).train_with_defaults(
callbacks=[ModelSaver()], callbacks=[ModelSaver()],
steps_per_epoch=300, steps_per_epoch=300,
max_epoch=200, max_epoch=200,
session_init=SaverRestore(args.load) if args.load else None session_init=SaverRestore(args.load) if args.load else None
) )
GANTrainer(config).train()
...@@ -8,6 +8,7 @@ import argparse ...@@ -8,6 +8,7 @@ import argparse
from six.moves import map, zip from six.moves import map, zip
import numpy as np import numpy as np
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.utils.viz import * from tensorpack.utils.viz import *
import tensorpack.tfutils.symbolic_functions as symbf import tensorpack.tfutils.symbolic_functions as symbf
...@@ -216,14 +217,11 @@ if __name__ == '__main__': ...@@ -216,14 +217,11 @@ if __name__ == '__main__':
data = get_celebA_data(args.data, args.style_A, args.style_B) data = get_celebA_data(args.data, args.style_A, args.style_B)
config = TrainConfig( # train 1 D after 2 G
model=Model(), SeparateGANTrainer(
dataflow=data, QueueInput(data), Model(), d_period=3).train_with_defaults(
callbacks=[ModelSaver()], callbacks=[ModelSaver()],
steps_per_epoch=300, steps_per_epoch=300,
max_epoch=250, max_epoch=250,
session_init=SaverRestore(args.load) if args.load else None session_init=SaverRestore(args.load) if args.load else None
) )
# train 1 D after 2 G
SeparateGANTrainer(config, d_period=3).train()
...@@ -6,9 +6,9 @@ ...@@ -6,9 +6,9 @@
import tensorflow as tf import tensorflow as tf
import numpy as np import numpy as np
import time import time
from tensorpack import (Trainer, QueueInput, from tensorpack import (TowerTrainer, QueueInput,
ModelDescBase, DataFlow, StagingInputWrapper, ModelDescBase, DataFlow, StagingInput,
TowerContext) TowerContext, TowerFuncWrapper)
from tensorpack.graph_builder import DataParallelBuilder, LeastLoadedDeviceSetter from tensorpack.graph_builder import DataParallelBuilder, LeastLoadedDeviceSetter
from tensorpack.tfutils.summary import add_moving_summary from tensorpack.tfutils.summary import add_moving_summary
from tensorpack.utils.argtools import memoized from tensorpack.utils.argtools import memoized
...@@ -64,20 +64,17 @@ class GANModelDesc(ModelDescBase): ...@@ -64,20 +64,17 @@ class GANModelDesc(ModelDescBase):
return self._get_optimizer() return self._get_optimizer()
class GANTrainer(Trainer): class GANTrainer(TowerTrainer):
def __init__(self, config): def __init__(self, input, model):
""" super(GANTrainer, self).__init__()
GANTrainer expects a ModelDesc in config which sets the following attribute assert isinstance(model, GANModelDesc), model
after :meth:`_build_graph`: g_loss, d_loss, g_vars, d_vars. inputs_desc = model.get_inputs_desc()
""" cbs = input.setup(inputs_desc)
input = QueueInput(config.dataflow)
model = config.model
cbs = input.setup(model.get_inputs_desc())
config.callbacks.extend(cbs)
tower_func = TowerFuncWrapper(
model.build_graph, inputs_desc)
with TowerContext('', is_training=True): with TowerContext('', is_training=True):
model.build_graph(input) tower_func(*input.get_input_tensors())
opt = model.get_optimizer() opt = model.get_optimizer()
# by default, run one d_min after one g_min # by default, run one d_min after one g_min
...@@ -86,29 +83,29 @@ class GANTrainer(Trainer): ...@@ -86,29 +83,29 @@ class GANTrainer(Trainer):
with tf.control_dependencies([g_min]): with tf.control_dependencies([g_min]):
d_min = opt.minimize(model.d_loss, var_list=model.d_vars, name='d_op') d_min = opt.minimize(model.d_loss, var_list=model.d_vars, name='d_op')
self.train_op = d_min self.train_op = d_min
self.set_tower_func(tower_func)
super(GANTrainer, self).__init__(config) for cb in cbs:
self._register_callback(cb)
class SeparateGANTrainer(Trainer): class SeparateGANTrainer(TowerTrainer):
""" A GAN trainer which runs two optimization ops with a certain ratio, one in each step. """ """ A GAN trainer which runs two optimization ops with a certain ratio."""
def __init__(self, config, d_period=1, g_period=1): def __init__(self, input, model, d_period=1, g_period=1):
""" """
Args: Args:
d_period(int): period of each d_opt run d_period(int): period of each d_opt run
g_period(int): period of each g_opt run g_period(int): period of each g_opt run
""" """
super(SeparateGANTrainer, self).__init__()
self._d_period = int(d_period) self._d_period = int(d_period)
self._g_period = int(g_period) self._g_period = int(g_period)
assert min(d_period, g_period) == 1 assert min(d_period, g_period) == 1
input = QueueInput(config.dataflow)
model = config.model
cbs = input.setup(model.get_inputs_desc()) cbs = input.setup(model.get_inputs_desc())
config.callbacks.extend(cbs) tower_func = TowerFuncWrapper(model.build_graph, model.get_inputs_desc())
with TowerContext('', is_training=True): with TowerContext('', is_training=True):
model.build_graph(input) tower_func(*input.get_input_tensors())
opt = model.get_optimizer() opt = model.get_optimizer()
with tf.name_scope('optimize'): with tf.name_scope('optimize'):
...@@ -117,7 +114,9 @@ class SeparateGANTrainer(Trainer): ...@@ -117,7 +114,9 @@ class SeparateGANTrainer(Trainer):
self.g_min = opt.minimize( self.g_min = opt.minimize(
model.g_loss, var_list=model.g_vars, name='g_min') model.g_loss, var_list=model.g_vars, name='g_min')
super(SeparateGANTrainer, self).__init__(config) self.set_tower_func(tower_func)
for cb in cbs:
self._register_callback(cb)
def run_step(self): def run_step(self):
if self.global_step % (self._d_period) == 0: if self.global_step % (self._d_period) == 0:
...@@ -126,26 +125,28 @@ class SeparateGANTrainer(Trainer): ...@@ -126,26 +125,28 @@ class SeparateGANTrainer(Trainer):
self.hooked_sess.run(self.g_min) self.hooked_sess.run(self.g_min)
class MultiGPUGANTrainer(Trainer): class MultiGPUGANTrainer(TowerTrainer):
""" """
A replacement of GANTrainer (optimize d and g one by one) with multi-gpu support. A replacement of GANTrainer (optimize d and g one by one) with multi-gpu support.
""" """
def __init__(self, config): def __init__(self, nr_gpu, input, model):
nr_gpu = config.nr_tower super(MultiGPUGANTrainer, self).__init__()
assert nr_gpu > 1 assert nr_gpu > 1
raw_devices = ['/gpu:{}'.format(k) for k in config.tower] raw_devices = ['/gpu:{}'.format(k) for k in range(nr_gpu)]
# setup input # setup input
input = StagingInputWrapper(QueueInput(config.dataflow), config.tower) input = StagingInput(input, list(range(nr_gpu)))
model = config.model
cbs = input.setup(model.get_inputs_desc()) cbs = input.setup(model.get_inputs_desc())
config.callbacks.extend(cbs)
def get_cost(): def get_cost(*inputs):
model.build_graph(input) model.build_graph(inputs)
return [model.d_loss, model.g_loss] return [model.d_loss, model.g_loss]
tower_func = TowerFuncWrapper(get_cost, model.get_inputs_desc())
devices = [LeastLoadedDeviceSetter(d, raw_devices) for d in raw_devices] devices = [LeastLoadedDeviceSetter(d, raw_devices) for d in raw_devices]
cost_list = DataParallelBuilder.build_on_towers(config.tower, get_cost, devices) cost_list = DataParallelBuilder.build_on_towers(
list(range(nr_gpu)),
lambda: tower_func(*input.get_input_tensors()),
devices)
# simply average the cost. It might get faster to average the gradients # simply average the cost. It might get faster to average the gradients
with tf.name_scope('optimize'): with tf.name_scope('optimize'):
d_loss = tf.add_n([x[0] for x in cost_list]) * (1.0 / nr_gpu) d_loss = tf.add_n([x[0] for x in cost_list]) * (1.0 / nr_gpu)
...@@ -159,7 +160,9 @@ class MultiGPUGANTrainer(Trainer): ...@@ -159,7 +160,9 @@ class MultiGPUGANTrainer(Trainer):
d_min = opt.minimize(d_loss, var_list=model.d_vars, d_min = opt.minimize(d_loss, var_list=model.d_vars,
colocate_gradients_with_ops=True, name='d_op') colocate_gradients_with_ops=True, name='d_op')
self.train_op = d_min self.train_op = d_min
super(MultiGPUGANTrainer, self).__init__(config) self.set_tower_func(tower_func)
for cb in cbs:
self._register_callback(cb)
class RandomZData(DataFlow): class RandomZData(DataFlow):
......
...@@ -12,6 +12,7 @@ import os ...@@ -12,6 +12,7 @@ import os
import sys import sys
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.utils.viz import * from tensorpack.utils.viz import *
from tensorpack.tfutils.summary import add_moving_summary from tensorpack.tfutils.summary import add_moving_summary
...@@ -168,21 +169,6 @@ def get_data(): ...@@ -168,21 +169,6 @@ def get_data():
return ds return ds
def get_config():
logger.auto_set_dir()
dataset = get_data()
return TrainConfig(
dataflow=dataset,
callbacks=[
PeriodicTrigger(ModelSaver(), every_k_epochs=3),
ScheduledHyperParamSetter('learning_rate', [(200, 1e-4)])
],
model=Model(),
steps_per_epoch=dataset.size(),
max_epoch=300,
)
def sample(datadir, model_path): def sample(datadir, model_path):
pred = PredictConfig( pred = PredictConfig(
session_init=get_model_loader(model_path), session_init=get_model_loader(model_path),
...@@ -218,9 +204,19 @@ if __name__ == '__main__': ...@@ -218,9 +204,19 @@ if __name__ == '__main__':
BATCH = args.batch BATCH = args.batch
if args.sample: if args.sample:
assert args.load
sample(args.data, args.load) sample(args.data, args.load)
else: else:
config = get_config() logger.auto_set_dir()
if args.load:
config.session_init = SaverRestore(args.load) data = QueueInput(get_data())
GANTrainer(config).train()
GANTrainer(data, Model()).train_with_defaults(
callbacks=[
PeriodicTrigger(ModelSaver(), every_k_epochs=3),
ScheduledHyperParamSetter('learning_rate', [(200, 1e-4)])
],
steps_per_epoch=data.size(),
max_epoch=300,
session_init=SaverRestore(args.load) if args.load else None
)
...@@ -6,6 +6,7 @@ ...@@ -6,6 +6,7 @@
import os import os
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.summary import add_moving_summary from tensorpack.tfutils.summary import add_moving_summary
from tensorpack.utils.globvars import globalns as G from tensorpack.utils.globvars import globalns as G
...@@ -94,12 +95,11 @@ if __name__ == '__main__': ...@@ -94,12 +95,11 @@ if __name__ == '__main__':
else: else:
assert args.data assert args.data
logger.auto_set_dir() logger.auto_set_dir()
config = TrainConfig( SeparateGANTrainer(
model=Model(), QueueInput(DCGAN.get_data(args.data)),
dataflow=DCGAN.get_data(args.data), Model(), g_period=6).train_with_defaults(
callbacks=[ModelSaver()], callbacks=[ModelSaver()],
steps_per_epoch=300, steps_per_epoch=300,
max_epoch=200, max_epoch=200,
session_init=SaverRestore(args.load) if args.load else None session_init=SaverRestore(args.load) if args.load else None
) )
SeparateGANTrainer(config, g_period=6).train()
...@@ -10,6 +10,7 @@ import os ...@@ -10,6 +10,7 @@ import os
import sys import sys
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.utils import viz from tensorpack.utils import viz
from tensorpack.tfutils.scope_utils import auto_reuse_variable_scope, under_name_scope from tensorpack.tfutils.scope_utils import auto_reuse_variable_scope, under_name_scope
...@@ -189,17 +190,6 @@ def get_data(): ...@@ -189,17 +190,6 @@ def get_data():
return ds return ds
def get_config():
logger.auto_set_dir('d')
return TrainConfig(
dataflow=get_data(),
callbacks=[ModelSaver(keep_freq=0.1)],
model=Model(),
steps_per_epoch=500,
max_epoch=100,
)
def sample(model_path): def sample(model_path):
pred = OfflinePredictor(PredictConfig( pred = OfflinePredictor(PredictConfig(
session_init=get_model_loader(model_path), session_init=get_model_loader(model_path),
...@@ -254,7 +244,11 @@ if __name__ == '__main__': ...@@ -254,7 +244,11 @@ if __name__ == '__main__':
BATCH = 100 BATCH = 100
sample(args.load) sample(args.load)
else: else:
config = get_config() logger.auto_set_dir()
if args.load: GANTrainer(QueueInput(get_data()),
config.session_init = SaverRestore(args.load) Model()).train_with_defaults(
GANTrainer(config).train() callbacks=[ModelSaver(keep_freq=0.1)],
steps_per_epoch=500,
max_epoch=100,
session_init=SaverRestore(args.load) if args.load else None
)
...@@ -6,6 +6,7 @@ ...@@ -6,6 +6,7 @@
import os import os
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils import optimizer from tensorpack.tfutils import optimizer
from tensorpack.tfutils.summary import add_moving_summary from tensorpack.tfutils.summary import add_moving_summary
...@@ -75,14 +76,15 @@ if __name__ == '__main__': ...@@ -75,14 +76,15 @@ if __name__ == '__main__':
else: else:
assert args.data assert args.data
logger.auto_set_dir() logger.auto_set_dir()
config = TrainConfig(
# The original code uses a different schedule, but this seems to work well.
# Train 1 D after 2 G
SeparateGANTrainer(
input=QueueInput(DCGAN.get_data(args.data)),
model=Model(), model=Model(),
dataflow=DCGAN.get_data(args.data), d_period=3).train_with_defaults(
callbacks=[ModelSaver(), ClipCallback()], callbacks=[ModelSaver(), ClipCallback()],
steps_per_epoch=500, steps_per_epoch=500,
max_epoch=200, max_epoch=200,
session_init=SaverRestore(args.load) if args.load else None session_init=SaverRestore(args.load) if args.load else None
) )
# The original code uses a different schedule, but this seems to work well.
# Train 1 D after 2 G
SeparateGANTrainer(config, d_period=3).train()
...@@ -11,6 +11,7 @@ from six.moves import zip ...@@ -11,6 +11,7 @@ from six.moves import zip
import os import os
import sys import sys
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
import tensorpack.tfutils.symbolic_functions as symbf import tensorpack.tfutils.symbolic_functions as symbf
from tensorpack.dataflow import dataset from tensorpack.dataflow import dataset
...@@ -231,5 +232,6 @@ if __name__ == '__main__': ...@@ -231,5 +232,6 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = get_model_loader(args.load) config.session_init = get_model_loader(args.load)
config.nr_tower = max(get_nr_gpu(), 1) launch_train_with_config(
SyncMultiGPUTrainer(config).train() config,
SyncMultiGPUTrainer(max(get_nr_gpu(), 1)))
...@@ -9,6 +9,7 @@ import numpy as np ...@@ -9,6 +9,7 @@ import numpy as np
import os import os
import tensorflow as tf import tensorflow as tf
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.symbolic_functions import * from tensorpack.tfutils.symbolic_functions import *
from tensorpack.tfutils.summary import * from tensorpack.tfutils.summary import *
...@@ -192,6 +193,6 @@ if __name__ == '__main__': ...@@ -192,6 +193,6 @@ if __name__ == '__main__':
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
if args.gpu: if args.gpu:
config.nr_tower = len(args.gpu.split(',')) nr_tower = len(args.gpu.split(','))
assert config.nr_tower == NR_GPU assert nr_tower == NR_GPU
SyncMultiGPUTrainer(config).train() launch_train_with_config(config, SyncMultiGPUTrainer(NR_GPU))
...@@ -10,6 +10,7 @@ import os ...@@ -10,6 +10,7 @@ import os
import tensorflow as tf import tensorflow as tf
import multiprocessing import multiprocessing
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.symbolic_functions import * from tensorpack.tfutils.symbolic_functions import *
from tensorpack.tfutils.summary import * from tensorpack.tfutils.summary import *
...@@ -298,5 +299,4 @@ if __name__ == '__main__': ...@@ -298,5 +299,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
config.nr_tower = NR_GPU launch_train_with_config(config, SyncMultiGPUTrainer(NR_GPU))
SyncMultiGPUTrainer(config).train()
...@@ -7,6 +7,7 @@ import numpy as np ...@@ -7,6 +7,7 @@ import numpy as np
import os import os
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.gradproc import * from tensorpack.tfutils.gradproc import *
from tensorpack.tfutils import optimizer, summary from tensorpack.tfutils import optimizer, summary
...@@ -174,4 +175,4 @@ if __name__ == '__main__': ...@@ -174,4 +175,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
SimpleTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -7,6 +7,7 @@ import numpy as np ...@@ -7,6 +7,7 @@ import numpy as np
import argparse import argparse
import os import os
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.symbolic_functions import * from tensorpack.tfutils.symbolic_functions import *
from tensorpack.tfutils.summary import * from tensorpack.tfutils.summary import *
...@@ -171,7 +172,7 @@ if __name__ == '__main__': ...@@ -171,7 +172,7 @@ if __name__ == '__main__':
[(1, 0.1), (82, 0.01), (123, 0.001), (300, 0.0002)]) [(1, 0.1), (82, 0.01), (123, 0.001), (300, 0.0002)])
], ],
max_epoch=400, max_epoch=400,
nr_tower=max(get_nr_gpu(), 1),
session_init=SaverRestore(args.load) if args.load else None session_init=SaverRestore(args.load) if args.load else None
) )
SyncMultiGPUTrainerParameterServer(config).train() nr_gpu = max(get_nr_gpu(), 1)
launch_train_with_config(config, SyncMultiGPUTrainerParameterServer(nr_gpu))
...@@ -9,10 +9,12 @@ import os ...@@ -9,10 +9,12 @@ import os
import tensorflow as tf import tensorflow as tf
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import logger, QueueInput from tensorpack import logger, QueueInput
from tensorpack.models import * from tensorpack.models import *
from tensorpack.callbacks import * from tensorpack.callbacks import *
from tensorpack.train import TrainConfig, SyncMultiGPUTrainerParameterServer from tensorpack.train import (
TrainConfig, SyncMultiGPUTrainerParameterServer, launch_train_with_config)
from tensorpack.dataflow import imgaug, FakeData from tensorpack.dataflow import imgaug, FakeData
from tensorpack.tfutils import argscope, get_model_loader from tensorpack.tfutils import argscope, get_model_loader
from tensorpack.utils.gpu import get_nr_gpu from tensorpack.utils.gpu import get_nr_gpu
...@@ -132,4 +134,5 @@ if __name__ == '__main__': ...@@ -132,4 +134,5 @@ if __name__ == '__main__':
config = get_config(model, fake=args.fake) config = get_config(model, fake=args.fake)
if args.load: if args.load:
config.session_init = get_model_loader(args.load) config.session_init = get_model_loader(args.load)
SyncMultiGPUTrainerParameterServer(config).train() trainer = SyncMultiGPUTrainerParameterServer(max(get_nr_gpu(), 1))
launch_train_with_config(config, trainer)
...@@ -152,7 +152,7 @@ def convert_param_name(param): ...@@ -152,7 +152,7 @@ def convert_param_name(param):
for k, v in six.iteritems(param): for k, v in six.iteritems(param):
try: try:
newname = name_conversion(k) newname = name_conversion(k)
except: except Exception:
logger.error("Exception when processing caffe layer {}".format(k)) logger.error("Exception when processing caffe layer {}".format(k))
raise raise
logger.info("Name Transform: " + k + ' --> ' + newname) logger.info("Name Transform: " + k + ' --> ' + newname)
......
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# File: svhn-resnet.py
# Author: Yuxin Wu <ppwwyyxxc@gmail.com>
import argparse
import numpy as np
import os
from tensorpack import *
from tensorpack.tfutils.symbolic_functions import *
from tensorpack.tfutils.summary import *
from tensorpack.dataflow import dataset
from tensorpack.utils.gpu import get_nr_gpu
import tensorflow as tf
"""
ResNet-110 for SVHN Digit Classification.
Reach 1.8% validation error after 70 epochs, with 2 TitanX. 2it/s.
You might need to adjust the learning rate schedule when running with 1 GPU.
"""
import imp
cifar_example = imp.load_source('cifar_example',
os.path.join(os.path.dirname(__file__), 'cifar10-resnet.py'))
Model = cifar_example.Model
BATCH_SIZE = 128
def get_data(train_or_test):
isTrain = train_or_test == 'train'
pp_mean = dataset.SVHNDigit.get_per_pixel_mean()
if isTrain:
d1 = dataset.SVHNDigit('train')
d2 = dataset.SVHNDigit('extra')
ds = RandomMixData([d1, d2])
else:
ds = dataset.SVHNDigit('test')
if isTrain:
augmentors = [
imgaug.CenterPaste((40, 40)),
imgaug.Brightness(10),
imgaug.Contrast((0.8, 1.2)),
imgaug.GaussianDeform( # this is slow. without it, can only reach 1.9% error
[(0.2, 0.2), (0.2, 0.8), (0.8, 0.8), (0.8, 0.2)],
(40, 40), 0.2, 3),
imgaug.RandomCrop((32, 32)),
imgaug.MapImage(lambda x: x - pp_mean),
]
else:
augmentors = [
imgaug.MapImage(lambda x: x - pp_mean)
]
ds = AugmentImageComponent(ds, augmentors)
ds = BatchData(ds, 128, remainder=not isTrain)
if isTrain:
ds = PrefetchData(ds, 5, 5)
return ds
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--gpu', help='comma separated list of GPU(s) to use.')
parser.add_argument('--load', help='load model')
args = parser.parse_args()
if args.gpu:
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
logger.auto_set_dir()
dataset_train = get_data('train')
dataset_test = get_data('test')
config = TrainConfig(
model=Model(n=18),
dataflow=dataset_train,
callbacks=[
ModelSaver(),
InferenceRunner(dataset_test,
[ScalarStats('cost'), ClassificationError()]),
ScheduledHyperParamSetter('learning_rate',
[(1, 0.1), (20, 0.01), (28, 0.001), (50, 0.0001)])
],
nr_tower=max(get_nr_gpu(), 1),
session_init=SaverRestore(args.load) if args.load else None,
max_epoch=500,
)
SyncMultiGPUTrainerParameterServer(config).train()
...@@ -9,6 +9,7 @@ import numpy as np ...@@ -9,6 +9,7 @@ import numpy as np
import os import os
import multiprocessing import multiprocessing
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
import tensorflow as tf import tensorflow as tf
from tensorflow.contrib.layers import variance_scaling_initializer from tensorflow.contrib.layers import variance_scaling_initializer
from tensorpack import * from tensorpack import *
...@@ -19,9 +20,10 @@ from tensorpack.tfutils.summary import * ...@@ -19,9 +20,10 @@ from tensorpack.tfutils.summary import *
from tensorpack.utils.gpu import get_nr_gpu from tensorpack.utils.gpu import get_nr_gpu
from tensorpack.utils import viz from tensorpack.utils import viz
from imagenet_resnet_utils import ( from imagenet_utils import (
fbresnet_augmentor, preresnet_basicblock, preresnet_group, fbresnet_augmentor, image_preprocess, compute_loss_and_error)
image_preprocess, compute_loss_and_error) from resnet_model import (
preresnet_basicblock, preresnet_group)
TOTAL_BATCH_SIZE = 256 TOTAL_BATCH_SIZE = 256
...@@ -90,10 +92,6 @@ def get_data(train_or_test): ...@@ -90,10 +92,6 @@ def get_data(train_or_test):
def get_config(): def get_config():
nr_gpu = get_nr_gpu()
global BATCH_SIZE
BATCH_SIZE = TOTAL_BATCH_SIZE // nr_gpu
dataset_train = get_data('train') dataset_train = get_data('train')
dataset_val = get_data('val') dataset_val = get_data('val')
...@@ -111,7 +109,6 @@ def get_config(): ...@@ -111,7 +109,6 @@ def get_config():
], ],
steps_per_epoch=5000, steps_per_epoch=5000,
max_epoch=105, max_epoch=105,
nr_tower=nr_gpu
) )
...@@ -163,6 +160,9 @@ if __name__ == '__main__': ...@@ -163,6 +160,9 @@ if __name__ == '__main__':
if args.gpu: if args.gpu:
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
nr_gpu = get_nr_gpu()
BATCH_SIZE = TOTAL_BATCH_SIZE // nr_gpu
if args.cam: if args.cam:
BATCH_SIZE = 128 # something that can run on one gpu BATCH_SIZE = 128 # something that can run on one gpu
viz_cam(args.load, args.data) viz_cam(args.load, args.data)
...@@ -172,4 +172,4 @@ if __name__ == '__main__': ...@@ -172,4 +172,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = get_model_loader(args.load) config.session_init = get_model_loader(args.load)
SyncMultiGPUTrainerParameterServer(config).train() launch_train_with_config(config, SyncMultiGPUTrainerParameterServer(nr_gpu))
../ResNet/imagenet_resnet_utils.py
\ No newline at end of file
../ResNet/imagenet_utils.py
\ No newline at end of file
../ResNet/resnet_model.py
\ No newline at end of file
...@@ -10,10 +10,11 @@ import cv2 ...@@ -10,10 +10,11 @@ import cv2
import tensorflow as tf import tensorflow as tf
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import logger, QueueInput, InputDesc, PlaceholderInput, TowerContext from tensorpack import logger, QueueInput, InputDesc, PlaceholderInput, TowerContext
from tensorpack.models import * from tensorpack.models import *
from tensorpack.callbacks import * from tensorpack.callbacks import *
from tensorpack.train import TrainConfig, SyncMultiGPUTrainerParameterServer from tensorpack.train import *
from tensorpack.dataflow import imgaug from tensorpack.dataflow import imgaug
from tensorpack.tfutils import argscope, get_model_loader from tensorpack.tfutils import argscope, get_model_loader
from tensorpack.tfutils.scope_utils import under_name_scope from tensorpack.tfutils.scope_utils import under_name_scope
...@@ -141,8 +142,7 @@ def get_data(name, batch): ...@@ -141,8 +142,7 @@ def get_data(name, batch):
args.data, name, batch, augmentors) args.data, name, batch, augmentors)
def get_config(model): def get_config(model, nr_tower):
nr_tower = max(get_nr_gpu(), 1)
batch = TOTAL_BATCH_SIZE // nr_tower batch = TOTAL_BATCH_SIZE // nr_tower
logger.info("Running on {} towers. Batch size per tower: {}".format(nr_tower, batch)) logger.info("Running on {} towers. Batch size per tower: {}".format(nr_tower, batch))
...@@ -170,7 +170,6 @@ def get_config(model): ...@@ -170,7 +170,6 @@ def get_config(model):
callbacks=callbacks, callbacks=callbacks,
steps_per_epoch=5000, steps_per_epoch=5000,
max_epoch=100, max_epoch=100,
nr_tower=nr_tower
) )
...@@ -205,5 +204,6 @@ if __name__ == '__main__': ...@@ -205,5 +204,6 @@ if __name__ == '__main__':
logger.set_logger_dir( logger.set_logger_dir(
os.path.join('train_log', 'shufflenet')) os.path.join('train_log', 'shufflenet'))
config = get_config(model) nr_tower = max(get_nr_gpu(), 1)
SyncMultiGPUTrainerParameterServer(config).train() config = get_config(model, nr_tower)
launch_train_with_config(config, SyncMultiGPUTrainerParameterServer(nr_tower))
...@@ -9,7 +9,7 @@ import argparse ...@@ -9,7 +9,7 @@ import argparse
import tensorflow as tf import tensorflow as tf
import tensorflow.contrib.slim as slim import tensorflow.contrib.slim as slim
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
import tensorpack.tfutils.symbolic_functions as symbf import tensorpack.tfutils.symbolic_functions as symbf
from tensorpack.tfutils.summary import add_moving_summary from tensorpack.tfutils.summary import add_moving_summary
...@@ -442,4 +442,4 @@ if __name__ == '__main__': ...@@ -442,4 +442,4 @@ if __name__ == '__main__':
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
else: else:
SimpleTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -10,6 +10,7 @@ import os ...@@ -10,6 +10,7 @@ import os
import sys import sys
import argparse import argparse
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.dataflow import dataset from tensorpack.dataflow import dataset
from tensorpack.tfutils import sesscreate, optimizer, summary from tensorpack.tfutils import sesscreate, optimizer, summary
...@@ -186,4 +187,4 @@ if __name__ == '__main__': ...@@ -186,4 +187,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
SimpleTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -5,6 +5,7 @@ ...@@ -5,6 +5,7 @@
import os import os
import argparse import argparse
import tensorflow as tf import tensorflow as tf
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
""" """
...@@ -51,7 +52,7 @@ def get_config(): ...@@ -51,7 +52,7 @@ def get_config():
return TrainConfig( return TrainConfig(
model=Model(), model=Model(),
dataflow=ds_train, data=QueueInput(ds_train),
callbacks=[ callbacks=[
ModelSaver(), ModelSaver(),
InferenceRunner(ds_test, [ScalarStats('total_costs')]), InferenceRunner(ds_test, [ScalarStats('total_costs')]),
...@@ -77,4 +78,4 @@ if __name__ == '__main__': ...@@ -77,4 +78,4 @@ if __name__ == '__main__':
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
SyncMultiGPUTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -2,12 +2,13 @@ ...@@ -2,12 +2,13 @@
# -*- coding: UTF-8 -*- # -*- coding: UTF-8 -*-
# File: cifar-convnet.py # File: cifar-convnet.py
# Author: Yuxin Wu <ppwwyyxxc@gmail.com> # Author: Yuxin Wu <ppwwyyxxc@gmail.com>
from tensorpack import *
import tensorflow as tf import tensorflow as tf
import argparse import argparse
import numpy as np import numpy as np
import os import os
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import *
import tensorpack.tfutils.symbolic_functions as symbf import tensorpack.tfutils.symbolic_functions as symbf
from tensorpack.tfutils.summary import * from tensorpack.tfutils.summary import *
from tensorpack.dataflow import dataset from tensorpack.dataflow import dataset
...@@ -151,8 +152,7 @@ if __name__ == '__main__': ...@@ -151,8 +152,7 @@ if __name__ == '__main__':
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
config.nr_tower = max(len(args.gpu.split(',')), 1) nr_gpu = len(args.gpu.split(','))
if config.nr_tower <= 1: trainer = QueueInputTrainer() if nr_gpu <= 1 \
QueueInputTrainer(config).train() else SyncMultiGPUTrainerParameterServer(nr_gpu)
else: launch_train_with_config(config, trainer)
SyncMultiGPUTrainerParameterServer(config).train()
...@@ -12,6 +12,7 @@ MNIST ConvNet example. ...@@ -12,6 +12,7 @@ MNIST ConvNet example.
about 0.6% validation error after 30 epochs. about 0.6% validation error after 30 epochs.
""" """
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
# Just import everything into current namespace # Just import everything into current namespace
from tensorpack import * from tensorpack import *
from tensorpack.tfutils import summary from tensorpack.tfutils import summary
...@@ -142,4 +143,4 @@ if __name__ == '__main__': ...@@ -142,4 +143,4 @@ if __name__ == '__main__':
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
# SimpleTrainer is slow, this is just a demo. # SimpleTrainer is slow, this is just a demo.
# You can use QueueInputTrainer instead # You can use QueueInputTrainer instead
SimpleTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -14,6 +14,7 @@ the only differences are: ...@@ -14,6 +14,7 @@ the only differences are:
2. use slim names to summarize weights 2. use slim names to summarize weights
""" """
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.dataflow import dataset from tensorpack.dataflow import dataset
import tensorflow as tf import tensorflow as tf
...@@ -101,4 +102,4 @@ if __name__ == '__main__': ...@@ -101,4 +102,4 @@ if __name__ == '__main__':
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
config = get_config() config = get_config()
SimpleTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -11,6 +11,7 @@ import argparse ...@@ -11,6 +11,7 @@ import argparse
MNIST ConvNet example with weights/activations visualization. MNIST ConvNet example with weights/activations visualization.
""" """
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.dataflow import dataset from tensorpack.dataflow import dataset
import tensorflow as tf import tensorflow as tf
...@@ -161,4 +162,4 @@ if __name__ == '__main__': ...@@ -161,4 +162,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
SimpleTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
...@@ -7,6 +7,7 @@ import argparse ...@@ -7,6 +7,7 @@ import argparse
import numpy as np import numpy as np
import os import os
os.environ['TENSORPACK_TRAIN_API'] = 'v2' # will become default soon
from tensorpack import * from tensorpack import *
from tensorpack.tfutils.symbolic_functions import prediction_incorrect from tensorpack.tfutils.symbolic_functions import prediction_incorrect
from tensorpack.dataflow import dataset from tensorpack.dataflow import dataset
...@@ -99,7 +100,7 @@ def get_config(): ...@@ -99,7 +100,7 @@ def get_config():
return TrainConfig( return TrainConfig(
model=Model(), model=Model(),
dataflow=data_train, data=QueueInput(data_train),
callbacks=[ callbacks=[
ModelSaver(), ModelSaver(),
InferenceRunner(data_test, InferenceRunner(data_test,
...@@ -125,4 +126,4 @@ if __name__ == '__main__': ...@@ -125,4 +126,4 @@ if __name__ == '__main__':
config = get_config() config = get_config()
if args.load: if args.load:
config.session_init = SaverRestore(args.load) config.session_init = SaverRestore(args.load)
QueueInputTrainer(config).train() launch_train_with_config(config, SimpleTrainer())
[flake8] [flake8]
max-line-length = 120 max-line-length = 120
ignore = F403,F401,F405,F841,E401 ignore = F403,F401,F405,F841,E4,E741,E742,E743
exclude = private, exclude = private,
FasterRCNN/utils FasterRCNN/utils
...@@ -18,9 +18,9 @@ if _HAS_TF: ...@@ -18,9 +18,9 @@ if _HAS_TF:
# In development. Default to v1 # In development. Default to v1
if _os.environ.get('TENSORPACK_TRAIN_API', 'v1') == 'v2': if _os.environ.get('TENSORPACK_TRAIN_API', 'v1') == 'v2':
from tensorpack.trainv2 import *
else:
from tensorpack.train import * from tensorpack.train import *
else:
from tensorpack.trainv1 import *
from tensorpack.graph_builder import InputDesc, ModelDesc, ModelDescBase from tensorpack.graph_builder import InputDesc, ModelDesc, ModelDescBase
from tensorpack.input_source import * from tensorpack.input_source import *
from tensorpack.predict import * from tensorpack.predict import *
...@@ -38,7 +38,7 @@ class Inferencer(Callback): ...@@ -38,7 +38,7 @@ class Inferencer(Callback):
for k, v in six.iteritems(ret): for k, v in six.iteritems(ret):
try: try:
v = float(v) v = float(v)
except: except ValueError:
logger.warn("{} returns a non-scalar statistics!".format(type(self).__name__)) logger.warn("{} returns a non-scalar statistics!".format(type(self).__name__))
continue continue
else: else:
......
...@@ -203,7 +203,7 @@ class DataParallelInferenceRunner(InferenceRunnerBase): ...@@ -203,7 +203,7 @@ class DataParallelInferenceRunner(InferenceRunnerBase):
self._input_callbacks = Callbacks(input_callbacks) self._input_callbacks = Callbacks(input_callbacks)
# InputSource might have hooks which break us. # InputSource might have hooks which break us.
# e.g. hooks from StagingInputWrapper will force the consumption # e.g. hooks from StagingInput will force the consumption
# of nr_tower datapoints in every run. # of nr_tower datapoints in every run.
input_hooks = self._input_callbacks.get_hooks() input_hooks = self._input_callbacks.get_hooks()
self._hooks = [self._build_hook(inf) for inf in self.infs] + input_hooks self._hooks = [self._build_hook(inf) for inf in self.infs] + input_hooks
......
...@@ -199,7 +199,7 @@ class HumanHyperParamSetter(HyperParamSetter): ...@@ -199,7 +199,7 @@ class HumanHyperParamSetter(HyperParamSetter):
dic = {str(k): float(v) for k, v in lines} dic = {str(k): float(v) for k, v in lines}
ret = dic[self.param.readable_name] ret = dic[self.param.readable_name]
return ret return ret
except: except Exception:
logger.warn( logger.warn(
"Cannot find {} in {}".format( "Cannot find {} in {}".format(
self.param.readable_name, self.file_name)) self.param.readable_name, self.file_name))
......
...@@ -129,7 +129,7 @@ class BatchData(ProxyDataFlow): ...@@ -129,7 +129,7 @@ class BatchData(ProxyDataFlow):
else: else:
try: try:
tp = dt.dtype tp = dt.dtype
except: except AttributeError:
raise TypeError("Unsupported type to batch: {}".format(type(dt))) raise TypeError("Unsupported type to batch: {}".format(type(dt)))
try: try:
result.append( result.append(
...@@ -144,7 +144,7 @@ class BatchData(ProxyDataFlow): ...@@ -144,7 +144,7 @@ class BatchData(ProxyDataFlow):
try: try:
# open an ipython shell if possible # open an ipython shell if possible
import IPython as IP; IP.embed() # noqa import IPython as IP; IP.embed() # noqa
except: except ImportError:
pass pass
return result return result
......
...@@ -247,7 +247,7 @@ class ILSVRC12(ILSVRC12Files): ...@@ -247,7 +247,7 @@ class ILSVRC12(ILSVRC12Files):
cnt += 1 cnt += 1
except KeyboardInterrupt: except KeyboardInterrupt:
raise raise
except: except Exception:
ret.append(None) ret.append(None)
logger.info("{}/{} images have bounding box.".format(cnt, len(imglist))) logger.info("{}/{} images have bounding box.".format(cnt, len(imglist)))
return ret return ret
......
...@@ -61,7 +61,7 @@ def _zmq_catch_error(name): ...@@ -61,7 +61,7 @@ def _zmq_catch_error(name):
raise DataFlowTerminated() raise DataFlowTerminated()
else: else:
raise raise
except: except Exception:
raise raise
...@@ -110,7 +110,7 @@ class _MultiProcessZMQDataFlow(DataFlow): ...@@ -110,7 +110,7 @@ class _MultiProcessZMQDataFlow(DataFlow):
x.terminate() x.terminate()
try: try:
print("{} successfully cleaned-up.".format(type(self).__name__)) print("{} successfully cleaned-up.".format(type(self).__name__))
except: except Exception:
pass pass
...@@ -347,7 +347,7 @@ class MultiThreadMapData(ProxyDataFlow): ...@@ -347,7 +347,7 @@ class MultiThreadMapData(ProxyDataFlow):
return return
# cannot ignore None here. will lead to unsynced send/recv # cannot ignore None here. will lead to unsynced send/recv
self.outq.put(self.func(dp)) self.outq.put(self.func(dp))
except: except Exception:
if self.stopped(): if self.stopped():
pass # skip duplicated error messages pass # skip duplicated error messages
else: else:
......
...@@ -86,16 +86,25 @@ class ModelDescBase(object): ...@@ -86,16 +86,25 @@ class ModelDescBase(object):
:returns: a list of InputDesc :returns: a list of InputDesc
""" """
def build_graph(self, inputs): def build_graph(self, *args):
""" """
Build the whole symbolic graph. Build the whole symbolic graph.
Args: Args:
inputs (list[tf.Tensor]): a list of tensors, args (list[tf.Tensor]): a list of tensors,
that match the list of :class:`InputDesc` defined by ``_get_inputs``. that match the list of :class:`InputDesc` defined by ``_get_inputs``.
""" """
if isinstance(inputs, InputSource): if len(args) == 1:
inputs = inputs.get_input_tensors() arg = args[0]
if isinstance(arg, InputSource):
inputs = arg.get_input_tensors() # remove in the future?
if isinstance(arg, (list, tuple)):
inputs = arg
else:
inputs = [arg]
else:
inputs = args
assert len(inputs) == len(self.get_inputs_desc()), \ assert len(inputs) == len(self.get_inputs_desc()), \
"Number of inputs passed to the graph != number of inputs defined " \ "Number of inputs passed to the graph != number of inputs defined " \
"in ModelDesc! ({} != {})".format(len(inputs), len(self.get_inputs_desc())) "in ModelDesc! ({} != {})".format(len(inputs), len(self.get_inputs_desc()))
...@@ -148,14 +157,11 @@ class ModelDesc(ModelDescBase): ...@@ -148,14 +157,11 @@ class ModelDesc(ModelDescBase):
def _get_optimizer(self): def _get_optimizer(self):
raise NotImplementedError() raise NotImplementedError()
def build_graph_get_cost(self, *inputs): def _build_graph_get_cost(self, *inputs):
"""
Build the graph from inputs and return the cost tensor.
"""
self.build_graph(inputs) self.build_graph(inputs)
return self.get_cost() return self.get_cost()
def build_graph_get_grads(self, *inputs): def _build_graph_get_grads(self, *inputs):
""" """
Build the graph from inputs and return the grads. Build the graph from inputs and return the grads.
This is useful for most of the :class:`GraphBuilder` which expects such a function. This is useful for most of the :class:`GraphBuilder` which expects such a function.
...@@ -164,7 +170,7 @@ class ModelDesc(ModelDescBase): ...@@ -164,7 +170,7 @@ class ModelDesc(ModelDescBase):
[(grad, var)] [(grad, var)]
""" """
ctx = get_current_tower_context() ctx = get_current_tower_context()
cost = self.build_graph_get_cost(*inputs) cost = self._build_graph_get_cost(*inputs)
varlist = ctx.filter_vars_by_vs_name(tf.trainable_variables()) varlist = ctx.filter_vars_by_vs_name(tf.trainable_variables())
opt = self.get_optimizer() opt = self.get_optimizer()
......
...@@ -28,7 +28,8 @@ __all__ = ['PlaceholderInput', 'FeedInput', ...@@ -28,7 +28,8 @@ __all__ = ['PlaceholderInput', 'FeedInput',
'QueueInput', 'BatchQueueInput', 'QueueInput', 'BatchQueueInput',
'DummyConstantInput', 'TensorInput', 'DummyConstantInput', 'TensorInput',
'TFDatasetInput', 'TFDatasetInput',
'StagingInputWrapper'] 'StagingInputWrapper',
'StagingInput']
class PlaceholderInput(InputSource): class PlaceholderInput(InputSource):
...@@ -398,7 +399,7 @@ class TFDatasetInput(FeedfreeInput): ...@@ -398,7 +399,7 @@ class TFDatasetInput(FeedfreeInput):
return self._iterator.get_next() return self._iterator.get_next()
class StagingInputWrapper(FeedfreeInput): class StagingInput(FeedfreeInput):
""" """
A wrapper around a feedfree input, A wrapper around a feedfree input,
to prefetch the input in StagingArea (on GPUs). to prefetch the input in StagingArea (on GPUs).
...@@ -433,7 +434,7 @@ class StagingInputWrapper(FeedfreeInput): ...@@ -433,7 +434,7 @@ class StagingInputWrapper(FeedfreeInput):
self._input = input self._input = input
if not isinstance(towers[0], int): if not isinstance(towers[0], int):
# API changed # API changed
log_deprecated("StagingInputWrapper(devices=)", "Use (towers=) instead!", "2018-01-31") log_deprecated("StagingInput(devices=)", "Use (towers=) instead!", "2018-01-31")
self._devices = towers self._devices = towers
else: else:
self._devices = ['/gpu:{}'.format(k) for k in towers] self._devices = ['/gpu:{}'.format(k) for k in towers]
...@@ -451,7 +452,7 @@ class StagingInputWrapper(FeedfreeInput): ...@@ -451,7 +452,7 @@ class StagingInputWrapper(FeedfreeInput):
cbs = self._input.get_callbacks() cbs = self._input.get_callbacks()
cbs.append( cbs.append(
StagingInputWrapper.StagingCallback( StagingInput.StagingCallback(
self._get_stage_op(), self._get_unstage_op(), self._nr_stage)) self._get_stage_op(), self._get_unstage_op(), self._nr_stage))
return cbs return cbs
...@@ -488,3 +489,6 @@ class StagingInputWrapper(FeedfreeInput): ...@@ -488,3 +489,6 @@ class StagingInputWrapper(FeedfreeInput):
with self.cached_name_scope(): with self.cached_name_scope():
all_outputs = list(chain.from_iterable(self._unstage_ops)) all_outputs = list(chain.from_iterable(self._unstage_ops))
return tf.group(*all_outputs) return tf.group(*all_outputs)
StagingInputWrapper = StagingInput
...@@ -16,7 +16,7 @@ class StaticDynamicAxis(object): ...@@ -16,7 +16,7 @@ class StaticDynamicAxis(object):
try: try:
st = f(self.static) st = f(self.static)
return StaticDynamicAxis(st, st) return StaticDynamicAxis(st, st)
except: except TypeError:
return StaticDynamicAxis(None, f(self.dynamic)) return StaticDynamicAxis(None, f(self.dynamic))
def __str__(self): def __str__(self):
...@@ -53,7 +53,7 @@ class StaticDynamicShape(object): ...@@ -53,7 +53,7 @@ class StaticDynamicShape(object):
self.static[axis] = st self.static[axis] = st
self.dynamic[axis] = StaticLazyAxis(st) self.dynamic[axis] = StaticLazyAxis(st)
return return
except: except TypeError:
pass pass
self.static[axis] = None self.static[axis] = None
dyn = self.dynamic[axis] dyn = self.dynamic[axis]
......
This diff is collapsed.
...@@ -5,13 +5,13 @@ ...@@ -5,13 +5,13 @@
import tensorflow as tf import tensorflow as tf
from ..input_source import ( from ..input_source import (
InputSource, FeedInput, QueueInput, StagingInputWrapper, DummyConstantInput) InputSource, FeedInput, QueueInput, StagingInput, DummyConstantInput)
from ..train.config import TrainConfig from ..trainv1.config import TrainConfig
from .base import SingleCostTrainer from .tower import SingleCostTrainer
from .trainers import SimpleTrainer, DistributedTrainerReplicated from .trainers import SimpleTrainer, DistributedTrainerReplicated
__all__ = ['launch_train_with_config', 'TrainConfig', 'apply_default_prefetch'] __all__ = ['launch_train_with_config', 'apply_default_prefetch']
def apply_default_prefetch(input_source_or_dataflow, trainer, towers): def apply_default_prefetch(input_source_or_dataflow, trainer, towers):
...@@ -36,19 +36,26 @@ def apply_default_prefetch(input_source_or_dataflow, trainer, towers): ...@@ -36,19 +36,26 @@ def apply_default_prefetch(input_source_or_dataflow, trainer, towers):
assert not isinstance(trainer, SimpleTrainer) assert not isinstance(trainer, SimpleTrainer)
assert tf.test.is_gpu_available() assert tf.test.is_gpu_available()
if not isinstance(input, (StagingInputWrapper, DummyConstantInput)): if not isinstance(input, (StagingInput, DummyConstantInput)):
input = StagingInputWrapper(input, towers) input = StagingInput(input, towers)
return input return input
def launch_train_with_config(config, trainer): def launch_train_with_config(config, trainer):
""" """
Train with a :class:`TrainConfig` and a new version of :class:`Trainer`, to Train with a :class:`TrainConfig` and a :class:`Trainer`, to
mimic the old training interface. mimic the old training interface. It basically does the following
3 things (and you can easily do them by yourself):
1. Setup the :class:`InputSource` with automatic prefetching,
for `config.data` or `config.dataflow`.
2. Call `trainer.setup_graph` with the :class:`InputSource`,
as well as `config.model`.
3. Call `trainer.train` with rest of the attributes of config.
Args: Args:
config (TrainConfig): config (TrainConfig):
trainer (Trainer): an instance of the new trainer trainer (Trainer): an instance of a SingleCostTrainer
Examples: Examples:
...@@ -78,7 +85,7 @@ def launch_train_with_config(config, trainer): ...@@ -78,7 +85,7 @@ def launch_train_with_config(config, trainer):
trainer.setup_graph( trainer.setup_graph(
inputs_desc, input, inputs_desc, input,
model.build_graph_get_cost, model.get_optimizer) model._build_graph_get_cost, model.get_optimizer)
trainer.train( trainer.train(
config.callbacks, config.monitors, config.callbacks, config.monitors,
config.session_creator, config.session_init, config.session_creator, config.session_init,
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# File: tower.py
import tensorflow as tf
import six
from abc import abstractmethod, ABCMeta
from ..utils.argtools import call_only_once, memoized
from ..graph_builder.predictor_factory import SimplePredictBuilder
from ..input_source import PlaceholderInput
from ..predict.base import OnlinePredictor
from ..tfutils.tower import TowerFuncWrapper, get_current_tower_context
from ..tfutils.gradproc import FilterNoneGrad
from .base import Trainer
__all__ = ['SingleCostTrainer', 'TowerTrainer']
class TowerTrainer(Trainer):
"""
Base trainers for models that can be built by calling a tower function under a :class:`TowerContext`.
This is required by some features that replicates the model
automatically, e.g. creating a predictor.
"""
tower_func = None
"""
A :class:`TowerFuncWrapper` instance.
A callable which takes some input tensors and builds one replicate of the model.
"""
@call_only_once
def set_tower_func(self, tower_func):
"""
Args:
tower_func (TowerFuncWrapper)
"""
assert isinstance(tower_func, TowerFuncWrapper), tower_func
self.tower_func = tower_func
@property
def inputs_desc(self):
"""
Returns:
list[InputDesc]: metainfo about the inputs to the tower.
"""
return self.tower_func.inputs_desc
def get_predictor(self, input_names, output_names, device=0):
"""
Returns a callable predictor built under ``TowerContext(is_training=False)``.
Args:
input_names (list), output_names(list): list of names
device (int): build the predictor on device '/gpu:{device}' or use -1 for '/cpu:0'.
Returns:
an :class:`OnlinePredictor`.
"""
assert self.tower_func is not None, "Must set tower_func on the trainer to use get_predictor()!"
tower_name = 'tower-pred-{}'.format(device) if device >= 0 else 'tower-pred-cpu'
try:
tower = self.tower_func.towers[tower_name]
except KeyError:
input = PlaceholderInput()
input.setup(self.inputs_desc)
with tf.variable_scope(tf.get_variable_scope(), reuse=True):
SimplePredictBuilder(
ns_name=tower_name, vs_name=self._main_tower_vs_name,
device=device).build(input, self.tower_func)
tower = self.tower_func.towers[tower_name]
input_tensors = tower.get_tensors(input_names)
output_tensors = tower.get_tensors(output_names)
return OnlinePredictor(input_tensors, output_tensors)
@property
def _main_tower_vs_name(self):
"""
The vs name for the "main" copy of the model,
to be used to build predictors.
"""
return ""
@six.add_metaclass(ABCMeta)
class SingleCostTrainer(TowerTrainer):
"""
Base class for single-cost trainer.
Single-cost trainer has a :meth:`setup_graph` method which takes
(inputs_desc, input, get_cost_fn, get_opt_fn), and build the training operations from them.
To use a :class:`SingleCostTrainer` object, call `trainer.setup_graph(...); trainer.train(...)`.
"""
@call_only_once
def setup_graph(self, inputs_desc, input, get_cost_fn, get_opt_fn):
"""
Responsible for building the main training graph for single-cost training.
Args:
inputs_desc ([InputDesc]):
input (InputSource):
get_cost_fn ([tf.Tensor] -> tf.Tensor): callable, takes some input tenosrs and return a cost tensor.
get_opt_fn (-> tf.train.Optimizer): callable which returns an
optimizer. Will only be called once.
Note:
1. `get_cost_fn` will always be called under a :class:`TowerContext`.
which will contain information abouut reuse,
training/inference, scope name, etc.
2. `get_cost_fn` might get called multiple times for data-parallel training or inference.
3. To respect variable reuse, use `tf.get_variable` instead of
`tf.Variable` in `get_cost_fn`.
"""
get_cost_fn = TowerFuncWrapper(get_cost_fn, inputs_desc)
get_opt_fn = memoized(get_opt_fn)
self.set_tower_func(get_cost_fn)
input_callbacks = self._setup_input(inputs_desc, input)
train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn)
internal_callbacks = input_callbacks + train_callbacks
for cb in internal_callbacks:
self._register_callback(cb)
# TODO register directly instead of return?
@abstractmethod
def _setup_graph(self, input, get_cost_fn, get_opt_fn):
"""
Implement the logic to build the graph, with an :class:`InputSource`
that's been setup already.
Returns:
[Callback]: list of callbacks needed
"""
def _setup_input(self, inputs_desc, input):
assert not input.setup_done()
return input.setup(inputs_desc)
def _make_get_grad_fn(self, input, get_cost_fn, get_opt_fn):
"""
Returns:
a get_grad_fn for GraphBuilder to use.
"""
# internal use only
assert input.setup_done()
def get_grad_fn():
ctx = get_current_tower_context()
cost = get_cost_fn(*input.get_input_tensors())
varlist = ctx.filter_vars_by_vs_name(tf.trainable_variables())
opt = get_opt_fn()
grads = opt.compute_gradients(
cost, var_list=varlist,
gate_gradients=False, colocate_gradients_with_ops=True)
grads = FilterNoneGrad().process(grads)
return grads
return get_grad_fn
...@@ -8,6 +8,7 @@ from ..callbacks.graph import RunOp ...@@ -8,6 +8,7 @@ from ..callbacks.graph import RunOp
from ..tfutils.sesscreate import NewSessionCreator from ..tfutils.sesscreate import NewSessionCreator
from ..utils import logger from ..utils import logger
from ..utils.argtools import map_arg
from ..tfutils import get_global_step_var from ..tfutils import get_global_step_var
from ..tfutils.distributed import get_distributed_session_creator from ..tfutils.distributed import get_distributed_session_creator
from ..tfutils.tower import TowerContext from ..tfutils.tower import TowerContext
...@@ -20,16 +21,24 @@ from ..graph_builder.training import ( ...@@ -20,16 +21,24 @@ from ..graph_builder.training import (
from ..graph_builder.distributed import DistributedReplicatedBuilder from ..graph_builder.distributed import DistributedReplicatedBuilder
from ..graph_builder.utils import override_to_local_variable from ..graph_builder.utils import override_to_local_variable
from .base import SingleCostTrainer from .tower import SingleCostTrainer
__all__ = ['SimpleTrainer', __all__ = ['SimpleTrainer',
'QueueInputTrainer', 'QueueInputTrainer',
'SyncMultiGPUTrainer',
'SyncMultiGPUTrainerReplicated', 'SyncMultiGPUTrainerReplicated',
'SyncMultiGPUTrainerParameterServer', 'SyncMultiGPUTrainerParameterServer',
'AsyncMultiGPUTrainer', 'AsyncMultiGPUTrainer',
'DistributedTrainerReplicated'] 'DistributedTrainerReplicated']
def _int_to_range(x):
if isinstance(x, int):
assert x > 0, x
return list(range(x))
return x
class SimpleTrainer(SingleCostTrainer): class SimpleTrainer(SingleCostTrainer):
""" """
Single-GPU single-cost single-tower trainer. Single-GPU single-cost single-tower trainer.
...@@ -53,13 +62,14 @@ class SyncMultiGPUTrainerParameterServer(SingleCostTrainer): ...@@ -53,13 +62,14 @@ class SyncMultiGPUTrainerParameterServer(SingleCostTrainer):
__doc__ = SyncMultiGPUParameterServerBuilder.__doc__ __doc__ = SyncMultiGPUParameterServerBuilder.__doc__
def __init__(self, towers, ps_device='gpu'): @map_arg(gpus=_int_to_range)
def __init__(self, gpus, ps_device='gpu'):
""" """
Args: Args:
towers ([int]): list of GPU ids. gpus ([int]): list of GPU ids.
ps_device: either 'gpu' or 'cpu', where variables are stored. Setting to 'cpu' might help when #gpu>=4 ps_device: either 'gpu' or 'cpu', where variables are stored. Setting to 'cpu' might help when #gpu>=4
""" """
self._builder = SyncMultiGPUParameterServerBuilder(towers, ps_device) self._builder = SyncMultiGPUParameterServerBuilder(gpus, ps_device)
super(SyncMultiGPUTrainerParameterServer, self).__init__() super(SyncMultiGPUTrainerParameterServer, self).__init__()
def _setup_graph(self, input, get_cost_fn, get_opt_fn): def _setup_graph(self, input, get_cost_fn, get_opt_fn):
...@@ -68,17 +78,29 @@ class SyncMultiGPUTrainerParameterServer(SingleCostTrainer): ...@@ -68,17 +78,29 @@ class SyncMultiGPUTrainerParameterServer(SingleCostTrainer):
return [] return []
def SyncMultiGPUTrainer(gpus):
"""
Return a default multi-GPU trainer, if you don't care about the details.
It may not be the most efficient one for your task.
Args:
gpus (list[int]): list of GPU ids.
"""
return SyncMultiGPUTrainerParameterServer(gpus, ps_device='gpu')
class AsyncMultiGPUTrainer(SingleCostTrainer): class AsyncMultiGPUTrainer(SingleCostTrainer):
__doc__ = AsyncMultiGPUBuilder.__doc__ __doc__ = AsyncMultiGPUBuilder.__doc__
def __init__(self, towers, scale_gradient=True): @map_arg(gpus=_int_to_range)
def __init__(self, gpus, scale_gradient=True):
""" """
Args: Args:
towers ([int]): list of GPU ids. gpus ([int]): list of GPU ids.
scale_gradient (bool): if True, will scale each gradient by ``1.0/nr_gpu``. scale_gradient (bool): if True, will scale each gradient by ``1.0/nr_gpu``.
""" """
self._builder = AsyncMultiGPUBuilder(towers, scale_gradient) self._builder = AsyncMultiGPUBuilder(gpus, scale_gradient)
super(AsyncMultiGPUTrainer, self).__init__() super(AsyncMultiGPUTrainer, self).__init__()
def _setup_graph(self, input, get_cost_fn, get_opt_fn): def _setup_graph(self, input, get_cost_fn, get_opt_fn):
...@@ -91,12 +113,13 @@ class SyncMultiGPUTrainerReplicated(SingleCostTrainer): ...@@ -91,12 +113,13 @@ class SyncMultiGPUTrainerReplicated(SingleCostTrainer):
__doc__ = SyncMultiGPUReplicatedBuilder.__doc__ __doc__ = SyncMultiGPUReplicatedBuilder.__doc__
def __init__(self, towers): @map_arg(gpus=_int_to_range)
def __init__(self, gpus):
""" """
Args: Args:
towers ([int]): list of GPU ids. gpus ([int]): list of GPU ids.
""" """
self._builder = SyncMultiGPUReplicatedBuilder(towers) self._builder = SyncMultiGPUReplicatedBuilder(gpus)
super(SyncMultiGPUTrainerReplicated, self).__init__() super(SyncMultiGPUTrainerReplicated, self).__init__()
def _setup_graph(self, input, get_cost_fn, get_opt_fn): def _setup_graph(self, input, get_cost_fn, get_opt_fn):
...@@ -113,10 +136,11 @@ class DistributedTrainerReplicated(SingleCostTrainer): ...@@ -113,10 +136,11 @@ class DistributedTrainerReplicated(SingleCostTrainer):
__doc__ = DistributedReplicatedBuilder.__doc__ __doc__ = DistributedReplicatedBuilder.__doc__
def __init__(self, towers, server): @map_arg(gpus=_int_to_range)
def __init__(self, gpus, server):
""" """
Args: Args:
towers (list[int]): list of GPU ids. gpus (list[int]): list of GPU ids.
server (tf.train.Server): the server with ps and workers. server (tf.train.Server): the server with ps and workers.
The job_name must be 'worker' because 'ps' job doesn't need to The job_name must be 'worker' because 'ps' job doesn't need to
build any graph. build any graph.
...@@ -127,7 +151,7 @@ class DistributedTrainerReplicated(SingleCostTrainer): ...@@ -127,7 +151,7 @@ class DistributedTrainerReplicated(SingleCostTrainer):
if self.job_name == 'worker': if self.job_name == 'worker':
# ps doesn't build any graph # ps doesn't build any graph
self._builder = DistributedReplicatedBuilder(towers, server) self._builder = DistributedReplicatedBuilder(gpus, server)
self.is_chief = self._builder.is_chief self.is_chief = self._builder.is_chief
else: else:
self.is_chief = False self.is_chief = False
......
...@@ -19,7 +19,7 @@ def global_import(name): ...@@ -19,7 +19,7 @@ def global_import(name):
_CURR_DIR = os.path.dirname(__file__) _CURR_DIR = os.path.dirname(__file__)
_SKIP = [] _SKIP = ['utility']
for _, module_name, _ in iter_modules( for _, module_name, _ in iter_modules(
[_CURR_DIR]): [_CURR_DIR]):
srcpath = os.path.join(_CURR_DIR, module_name + '.py') srcpath = os.path.join(_CURR_DIR, module_name + '.py')
......
...@@ -17,9 +17,21 @@ from ..utils.develop import log_deprecated ...@@ -17,9 +17,21 @@ from ..utils.develop import log_deprecated
__all__ = ['TrainConfig'] __all__ = ['TrainConfig']
def DEFAULT_CALLBACKS():
return [
MovingAverageSummary(),
ProgressBar(),
MergeAllSummaries(),
RunUpdateOps()]
def DEFAULT_MONITORS():
return [TFEventWriter(), JSONWriter(), ScalarPrinter()]
class TrainConfig(object): class TrainConfig(object):
""" """
Config for trainer. A collection of options to be used for trainers.
""" """
def __init__(self, def __init__(self,
...@@ -84,9 +96,9 @@ class TrainConfig(object): ...@@ -84,9 +96,9 @@ class TrainConfig(object):
callbacks = [] callbacks = []
assert_type(callbacks, list) assert_type(callbacks, list)
self._callbacks = callbacks + \ self._callbacks = callbacks + \
(extra_callbacks or TrainConfig.DEFAULT_EXTRA_CALLBACKS()) (extra_callbacks or DEFAULT_CALLBACKS())
self.monitors = monitors or TrainConfig.DEFAULT_MONITORS() self.monitors = monitors or DEFAULT_MONITORS()
if session_init is None: if session_init is None:
session_init = JustCurrentSession() session_init = JustCurrentSession()
...@@ -148,15 +160,3 @@ class TrainConfig(object): ...@@ -148,15 +160,3 @@ class TrainConfig(object):
@property @property
def callbacks(self): # disable setter def callbacks(self): # disable setter
return self._callbacks return self._callbacks
@staticmethod
def DEFAULT_EXTRA_CALLBACKS():
return [
MovingAverageSummary(),
ProgressBar(),
MergeAllSummaries(),
RunUpdateOps()]
@staticmethod
def DEFAULT_MONITORS():
return [TFEventWriter(), JSONWriter(), ScalarPrinter()]
...@@ -64,7 +64,7 @@ class DistributedTrainerReplicated(Trainer): ...@@ -64,7 +64,7 @@ class DistributedTrainerReplicated(Trainer):
self._config.callbacks.extend(cbs) self._config.callbacks.extend(cbs)
self.train_op, initial_sync_op, model_sync_op = self._builder.build( self.train_op, initial_sync_op, model_sync_op = self._builder.build(
lambda: self.model.build_graph_get_grads( lambda: self.model._build_graph_get_grads(
*self._input_source.get_input_tensors()), *self._input_source.get_input_tensors()),
self.model.get_optimizer) self.model.get_optimizer)
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# File: interface.py
__all__ = ['launch_train_with_config']
def launch_train_with_config(config, trainer):
from ..train.interface import launch_train_with_config as old_launch
old_launch(config, trainer)
...@@ -8,7 +8,7 @@ import tensorflow as tf ...@@ -8,7 +8,7 @@ import tensorflow as tf
from ..callbacks.graph import RunOp from ..callbacks.graph import RunOp
from ..utils.develop import log_deprecated from ..utils.develop import log_deprecated
from ..input_source import QueueInput, StagingInputWrapper, DummyConstantInput from ..input_source import QueueInput, StagingInput, DummyConstantInput
from ..graph_builder.training import ( from ..graph_builder.training import (
SyncMultiGPUParameterServerBuilder, SyncMultiGPUParameterServerBuilder,
SyncMultiGPUReplicatedBuilder, SyncMultiGPUReplicatedBuilder,
...@@ -43,8 +43,8 @@ def apply_prefetch_policy(config, gpu_prefetch=True): ...@@ -43,8 +43,8 @@ def apply_prefetch_policy(config, gpu_prefetch=True):
assert tf.test.is_gpu_available() assert tf.test.is_gpu_available()
# seem to only improve on >1 GPUs # seem to only improve on >1 GPUs
if not isinstance(config.data, (StagingInputWrapper, DummyConstantInput)): if not isinstance(config.data, (StagingInput, DummyConstantInput)):
config.data = StagingInputWrapper(config.data, config.tower) config.data = StagingInput(config.data, config.tower)
class SyncMultiGPUTrainerParameterServer(Trainer): class SyncMultiGPUTrainerParameterServer(Trainer):
...@@ -70,7 +70,7 @@ class SyncMultiGPUTrainerParameterServer(Trainer): ...@@ -70,7 +70,7 @@ class SyncMultiGPUTrainerParameterServer(Trainer):
self.train_op = SyncMultiGPUParameterServerBuilder( self.train_op = SyncMultiGPUParameterServerBuilder(
self._config.tower, self._ps_device).build( self._config.tower, self._ps_device).build(
lambda: self.model.build_graph_get_grads( lambda: self.model._build_graph_get_grads(
*self._input_source.get_input_tensors()), *self._input_source.get_input_tensors()),
self.model.get_optimizer) self.model.get_optimizer)
...@@ -104,7 +104,7 @@ class SyncMultiGPUTrainerReplicated(Trainer): ...@@ -104,7 +104,7 @@ class SyncMultiGPUTrainerReplicated(Trainer):
self.train_op, post_init_op = SyncMultiGPUReplicatedBuilder( self.train_op, post_init_op = SyncMultiGPUReplicatedBuilder(
self._config.tower).build( self._config.tower).build(
lambda: self.model.build_graph_get_grads( lambda: self.model._build_graph_get_grads(
*self._input_source.get_input_tensors()), *self._input_source.get_input_tensors()),
self.model.get_optimizer) self.model.get_optimizer)
...@@ -134,7 +134,7 @@ class AsyncMultiGPUTrainer(Trainer): ...@@ -134,7 +134,7 @@ class AsyncMultiGPUTrainer(Trainer):
self.train_op = AsyncMultiGPUBuilder( self.train_op = AsyncMultiGPUBuilder(
self._config.tower, self._scale_gradient).build( self._config.tower, self._scale_gradient).build(
lambda: self.model.build_graph_get_grads( lambda: self.model._build_graph_get_grads(
*self._input_source.get_input_tensors()), *self._input_source.get_input_tensors()),
self.model.get_optimizer) self.model.get_optimizer)
......
...@@ -43,7 +43,7 @@ class SimpleTrainer(Trainer): ...@@ -43,7 +43,7 @@ class SimpleTrainer(Trainer):
cbs = self._input_source.setup(self.model.get_inputs_desc()) cbs = self._input_source.setup(self.model.get_inputs_desc())
with TowerContext('', is_training=True): with TowerContext('', is_training=True):
grads = self.model.build_graph_get_grads( grads = self.model._build_graph_get_grads(
*self._input_source.get_input_tensors()) *self._input_source.get_input_tensors())
opt = self.model.get_optimizer() opt = self.model.get_optimizer()
self.train_op = opt.apply_gradients(grads, name='min_op') self.train_op = opt.apply_gradients(grads, name='min_op')
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# File: utility.py
# for backwards-compatibility
from ..graph_builder.utils import ( # noqa
OverrideToLocalVariable,
override_to_local_variable, LeastLoadedDeviceSetter)
...@@ -15,7 +15,7 @@ from tensorpack.user_ops.zmq_recv import ( # noqa ...@@ -15,7 +15,7 @@ from tensorpack.user_ops.zmq_recv import ( # noqa
try: try:
num = int(sys.argv[1]) num = int(sys.argv[1])
except: except ValueError:
num = 2 num = 2
ENDPOINT = 'ipc://test-pipe' ENDPOINT = 'ipc://test-pipe'
......
...@@ -53,7 +53,7 @@ def download(url, dir, filename=None): ...@@ -53,7 +53,7 @@ def download(url, dir, filename=None):
fpath, _ = urllib.request.urlretrieve(url, fpath, reporthook=hook(t)) fpath, _ = urllib.request.urlretrieve(url, fpath, reporthook=hook(t))
statinfo = os.stat(fpath) statinfo = os.stat(fpath)
size = statinfo.st_size size = statinfo.st_size
except: except IOError:
logger.error("Failed to download {}".format(url)) logger.error("Failed to download {}".format(url))
raise raise
assert size > 0, "Download an empty file!" assert size > 0, "Download an empty file!"
......
...@@ -135,7 +135,7 @@ def get_caffe_pb(): ...@@ -135,7 +135,7 @@ def get_caffe_pb():
version = version.decode('utf-8') version = version.decode('utf-8')
version = float('.'.join(version.split(' ')[1].split('.')[:2])) version = float('.'.join(version.split(' ')[1].split('.')[:2]))
assert version >= 2.7, "Require protoc>=2.7 for Python3" assert version >= 2.7, "Require protoc>=2.7 for Python3"
except: except Exception:
logger.exception("protoc --version gives: " + str(version)) logger.exception("protoc --version gives: " + str(version))
raise raise
......
[flake8] [flake8]
max-line-length = 120 max-line-length = 120
ignore = E265 ignore = E265,E741,E742,E743
exclude = .git, exclude = .git,
tensorpack/__init__.py, tensorpack/__init__.py,
setup.py, setup.py,
snippet,
docs, docs,
examples, examples,
docs/conf.py
snippet,
examples-old, examples-old,
_test.py, _test.py,
docs/conf.py
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment