Commit 3476cb43 authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent 45d63caf
......@@ -8,5 +8,7 @@ about: Suggest an idea for Tensorpack
(See http://tensorpack.readthedocs.io/en/latest/tutorial/index.html#extend-tensorpack).
It does not have to be added to Tensorpack unless you have a good reason.
+ "Could you improve/implement an example/paper ?"
-- The answer is: we have no plans to do so. We don't take feature requests for
examples or implement a paper for you. If you don't know how to do it, you may ask a usage question.
-- The answer is: we have no plans to do so. We don't consider feature
requests for examples or implement a paper for you, unless it demonstrates
some Tensorpack features not yet demonstrated in the existing examples.
If you don't know how to do it, you may ask a usage question.
......@@ -14,7 +14,7 @@ that yields datapoints (lists) of two components:
a numpy array of shape (64, 28, 28), and an array of shape (64,).
As you saw,
DataFlow is __independent__ of TensorFlow since it produces any python objects
DataFlow is __independent of TensorFlow__ since it produces any python objects
(usually numpy arrays).
To `import tensorpack.dataflow`, you don't even have to install TensorFlow.
You can simply use DataFlow as a data processing pipeline and plug it into any other frameworks.
......@@ -24,7 +24,7 @@ You can simply use DataFlow as a data processing pipeline and plug it into any o
One good thing about having a standard interface is to be able to provide
the greatest code reusability.
There are a lot of existing DataFlow utilities in tensorpack, which you can use to compose
complex DataFlow with a long data pipeline. A common pipeline usually
DataFlow with complex data pipeline. A common pipeline usually
would __read from disk (or other sources), apply transformations, group into batches,
prefetch data__, etc. A simple example is as the following:
......@@ -38,13 +38,12 @@ df = BatchData(df, 128)
# start 3 processes to run the dataflow in parallel
df = PrefetchDataZMQ(df, 3)
````
You can find more complicated DataFlow in the [ResNet training script](../examples/ResNet/imagenet_utils.py)
You can find more complicated DataFlow in the [ImageNet training script](../examples/ImageNetModels/imagenet_utils.py)
with all the data preprocessing.
Unless you are working with standard data types (image folders, LMDB, etc),
you would usually want to write the source DataFlow (`MyDataFlow` in the above example) for your data format.
See [another tutorial](extend/dataflow.html)
for simple instructions on writing a DataFlow.
See [another tutorial](extend/dataflow.html) for simple instructions on writing a DataFlow.
Once you have the source reader, all the [existing DataFlows](../modules/dataflow.html) are ready for you to complete
the rest of the data pipeline.
......@@ -62,7 +61,7 @@ Nevertheless, tensorpack supports data loading with native TF operators / TF dat
### Use DataFlow (outside Tensorpack)
Normally, tensorpack `InputSource` interface links DataFlow to the graph for training.
If you use DataFlow in some custom code, call `reset_state()` first to initialize it,
If you use DataFlow in other places such as your custom code, call `reset_state()` first to initialize it,
and then use the generator however you like:
```python
df = SomeDataFlow()
......
......@@ -11,7 +11,7 @@ There are two ways to do inference during training.
See [Write a Callback](extend/callback.html).
2. If your inference follows the paradigm of:
"fetch some tensors for each input, and aggregate the results".
"evaluate some tensors for each input, and aggregate the results in the end".
You can use the `InferenceRunner` interface with some `Inferencer**.
This will further support prefetch & data-parallel inference.
More details to come.
......@@ -22,18 +22,19 @@ You can use this predicate to choose a different code path in inference mode.
## Inference After Training
Tensorpack is a training interface -- __it doesn't care what happened after training__.
You have everything needed for inference or model diagnosis after
You have everything you need for inference or model diagnosis after
training:
1. The trained weights: tensorpack saves them in standard TF checkpoint format.
2. The model: you've already written it yourself with TF symbolic functions.
Therefore, you can build the graph for inference, load the checkpoint, and then use whatever deployment methods TensorFlow supports.
Therefore, you can build the graph for inference, load the checkpoint, and apply
any processing or deployment TensorFlow supports.
And you'll need to read TF docs and __do it on your own__.
### Don't Use Training Metagraph for Inference
Metagraph is the wrong abstraction for a "model".
It stores the entire graph which contains not only the model, but also all the
It stores the entire graph which contains not only the mathematical model, but also all the
training settings (queues, iterators, summaries, evaluations, multi-gpu replications).
Therefore it is usually wrong to import a training metagraph for inference.
......
......@@ -39,5 +39,5 @@ Variables that appear in only one side will be printed as warning.
## Transfer Learning
Therefore, transfer learning is trivial.
If you want to load some model, just use the same variable names.
If you want to load a pre-trained model, just use the same variable names.
If you want to re-train some layer, just rename it.
......@@ -2,7 +2,8 @@
# Symbolic Layers
Tensorpack contains a small collection of common model primitives,
such as conv/deconv, fc, bn, pooling layers.
such as conv/deconv, fc, bn, pooling layers. **You do not need to learn them.**
These layers were written only because there were no alternatives when
tensorpack was first developed.
Nowadays, these implementation actually call `tf.layers` directly.
......
......@@ -46,12 +46,13 @@ Model:
Speed:
1. The training will start very slow due to convolution warmup, until about 10k
steps to reach a maximum speed.
1. The training will start very slowly due to convolution warmup, until about
10k steps (or more if scale augmentation is used) to reach a maximum speed.
As a result, the ETA is also inaccurate at the beginning.
You can disable warmup by `export TF_CUDNN_USE_AUTOTUNE=0`, which makes the
training faster at the beginning, but perhaps not in the end.
1. After warmup the training speed will slowly decrease due to more accurate proposals.
1. After warmup, the training speed will slowly decrease due to more accurate proposals.
1. This implementation is about 10% slower than detectron,
probably due to the lack of specialized ops (e.g. AffineChannel, ROIAlign) in TensorFlow.
......@@ -62,7 +63,7 @@ Speed:
Possible Future Enhancements:
1. Define an interface to load custom dataset.
1. Define a better interface to load custom dataset.
1. Support batch>1 per GPU.
......
......@@ -24,7 +24,9 @@ class _COCOMeta(object):
'val2014': 'val2014',
'valminusminival2014': 'val2014',
'minival2014': 'val2014',
'test2014': 'test2014'
'test2014': 'test2014',
'train2017': 'train2017',
'val2017': 'val2017',
}
def valid(self):
......
......@@ -88,7 +88,9 @@ _C.BACKBONE.FREEZE_AT = 2 # options: 0, 1, 2
# See https://github.com/tensorflow/tensorflow/issues/18213
# In tensorpack model zoo, ResNet models with TF_PAD_MODE=False are marked with "-AlignPadding".
# All other models under `ResNet/` in the model zoo are trained with TF_PAD_MODE=True.
# All other models under `ResNet/` in the model zoo are using TF_PAD_MODE=True.
# Using either one should probably give the same performance.
# We use the "AlignPadding" one just to be consistent with caffe2.
_C.BACKBONE.TF_PAD_MODE = False
_C.BACKBONE.STRIDE_1X1 = False # True for MSRA models
......@@ -101,6 +103,7 @@ _C.TRAIN.STEPS_PER_EPOCH = 500
# LR_SCHEDULE means "steps" only when total batch size is 8.
# Otherwise the actual steps to decrease learning rate are computed from the schedule.
# Therefore, there is *no need* to modify the config if you only change the number of GPUs.
# LR_SCHEDULE = [120000, 160000, 180000] # "1x" schedule in detectron
_C.TRAIN.LR_SCHEDULE = [240000, 320000, 360000] # "2x" schedule in detectron
_C.TRAIN.NUM_EVALS = 20 # number of evaluations to run during training
......
......@@ -292,7 +292,7 @@ def get_train_dataflow():
class: numpy array of k integers
is_crowd: k booleans. Use k False if you don't know what it means.
segmentation: k lists of numpy arrays (one for each box).
Each list of numpy array corresponds to the mask for one instance.
Each list of numpy arrays corresponds to the mask for one instance.
Each numpy array in the list is a polygon of shape Nx2,
because one mask can be represented by N polygons.
......
......@@ -417,6 +417,8 @@ class EvalCallback(Callback):
self.dataflows = [get_eval_dataflow(shard=k, num_shards=self.num_predictor)
for k in range(self.num_predictor)]
else:
if hvd.size() > hvd.local_size():
logger.warn("Distributed evaluation with horovod is unstable. Sometimes MPI hangs for unknown reasons.")
self.predictor = self._build_coco_predictor(0)
self.dataflow = get_eval_dataflow(shard=hvd.rank(), num_shards=hvd.size())
......@@ -495,7 +497,7 @@ if __name__ == '__main__':
if get_tf_version_tuple() < (1, 6):
# https://github.com/tensorflow/tensorflow/issues/14657
logger.warn("TF<1.6 has a bug which may lead to crash in FasterRCNN training if you're unlucky.")
logger.warn("TF<1.6 has a bug which may lead to crash in FasterRCNN if you're unlucky.")
args = parser.parse_args()
if args.config:
......@@ -540,7 +542,7 @@ if __name__ == '__main__':
init_lr = cfg.TRAIN.BASE_LR * 0.33 * min(8. / cfg.TRAIN.NUM_GPUS, 1.)
warmup_schedule = [(0, init_lr), (cfg.TRAIN.WARMUP, cfg.TRAIN.BASE_LR)]
warmup_end_epoch = cfg.TRAIN.WARMUP * 1. / stepnum
lr_schedule = [(int(np.ceil(warmup_end_epoch)), cfg.TRAIN.BASE_LR)]
lr_schedule = [(int(warmup_end_epoch + 0.5), cfg.TRAIN.BASE_LR)]
factor = 8. / cfg.TRAIN.NUM_GPUS
for idx, steps in enumerate(cfg.TRAIN.LR_SCHEDULE[:-1]):
......@@ -549,6 +551,10 @@ if __name__ == '__main__':
(steps * factor // stepnum, cfg.TRAIN.BASE_LR * mult))
logger.info("Warm Up Schedule (steps, value): " + str(warmup_schedule))
logger.info("LR Schedule (epochs, value): " + str(lr_schedule))
train_dataflow = get_train_dataflow()
# This is what's commonly referred to as "epochs"
total_passes = cfg.TRAIN.LR_SCHEDULE[-1] * 8 / train_dataflow.size()
logger.info("Total passes of the training set is: {}".format(total_passes))
callbacks = [
PeriodicCallback(
......@@ -573,7 +579,7 @@ if __name__ == '__main__':
traincfg = TrainConfig(
model=MODEL,
data=QueueInput(get_train_dataflow()),
data=QueueInput(train_dataflow),
callbacks=callbacks,
steps_per_epoch=stepnum,
max_epoch=cfg.TRAIN.LR_SCHEDULE[-1] * factor // stepnum,
......
......@@ -170,7 +170,7 @@ class MultiGPUGANTrainer(TowerTrainer):
list(range(num_gpu)),
lambda: self.tower_func(*input.get_input_tensors()),
devices)
# Simply average the cost here. It might be faster to average the gradients
# For simplicity, average the cost here. It might be faster to average the gradients
with tf.name_scope('optimize'):
d_loss = tf.add_n([x[0] for x in cost_list]) * (1.0 / num_gpu)
g_loss = tf.add_n([x[1] for x in cost_list]) * (1.0 / num_gpu)
......
......@@ -35,10 +35,20 @@ class InjectShell(Callback):
"""
Allow users to create a specific file as a signal to pause
and iteratively debug the training.
Once triggered, it detects whether the file exists, and opens an
Once the :meth:`trigger` method is called, it detects whether the file exists, and opens an
IPython/pdb shell if yes.
In the shell, `self` is this callback, `self.trainer` is the trainer, and
from that you can access everything else.
Example:
.. code-block:: python
callbacks=[InjectShell('/path/to/pause-training.tmp'), ...]
# the following command will pause the training when the epoch finishes:
$ touch /path/to/pause-training.tmp
"""
def __init__(self, file='INJECT_SHELL.tmp', shell='ipython'):
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment