Commit 167fa123 authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent 353cd04f
......@@ -49,6 +49,6 @@ Note that the above methods only prevent variables being updated by SGD.
Some variables may be updated by other means,
e.g., BatchNorm statistics are updated through the `UPDATE_OPS` collection and the [RunUpdateOps](../modules/callbacks.html#tensorpack.callbacks.RunUpdateOps) callback.
## My training is slow!
## My training seems slow. Why?
Checkout the [Performance Tuning tutorial](performance-tuning.html)
......@@ -157,11 +157,11 @@ with TowerContext('', is_training=False):
You can just use `tf.train.Saver` for all the work.
Alternatively, use tensorpack's `get_model_loader(path).init(tf.get_default_session())`
Now, you've already built a graph for inference, and the checkpoint is loaded.
Now, you've already built a graph for inference, and the checkpoint is also loaded.
You may now:
1. use `sess.run` to do inference
2. save the grpah to some formats for further processing
2. save the graph to some formats for further processing
3. apply graph transformation for efficient inference
These steps are unrelated to tensorpack, and you'll need to learn TensorFlow and
......
......@@ -17,7 +17,8 @@ The Tensorpack API brings speed and flexibility together.
Is TensorFlow Slow?
~~~~~~~~~~~~~~~~~~~~~
No it's not, but it's not easy to write it in an efficient way.
There is a common misconception,
but no, it's not slow. But it's not easy to write it in an efficient way.
When **speed** is a concern, users will have to worry a lot about things unrelated to the model.
Code written with low-level APIs or other existing high-level wrappers is often suboptimal in speed.
......@@ -28,7 +29,7 @@ The `official TensorFlow benchmark <https://github.com/tensorflow/benchmarks/tre
These models are designed for performance. For models that have clean and easy-to-read implementations, see the TensorFlow Official Models.
which seems to suggest that you cannot have performance and ease-of-use together.
which seems to suggest that you cannot have **performance and ease-of-use together**.
However you can have them both in tensorpack.
Tensorpack uses TensorFlow efficiently, and hides performance details under its APIs.
You no longer need to write
......
......@@ -6,13 +6,20 @@ __We do not know why your training is slow__
Tensorpack is designed to be high-performance, as can be seen in the [benchmarks](https://github.com/tensorpack/benchmarks).
But performance is different across machines and tasks,
so it's not easy to let others understand what goes wrong without doing some investigations by your own.
and it requires knowledge of the entire stack to understand what might be wrong.
Therefore when you have a performance issue,
it's not easy to let others understand what goes wrong without doing some investigations by your own.
Tensorpack has some tools to make it easier to understand the performance.
Here is a list of things you can do to understand why your training is slow.
Here we provide a list of things you can do to understand why your training is slow.
If you ask for help to understand and improve the speed, PLEASE do the
investigations below, post your hardware information and your findings from the investigation, such as what changes
you've made and what performance numbers you've seen.
investigations below, and post your hardware information & your findings from the investigation.
The findings should be something like:
1. [... your code ...], performance: ...
1. [... made change A ...], performane: ...
1. [... made change B ...], performane: ...
## Figure out the bottleneck
......@@ -29,7 +36,7 @@ you've made and what performance numbers you've seen.
## Benchmark the components
Whatever benchmarks you're doing, never look at the speed of the first 50 iterations.
Everything is slow at the beginning.
Things can be slow at the beginning.
1. Use `dataflow=FakeData(shapes, random=False)` to replace your original DataFlow by a constant DataFlow.
This will benchmark the graph, without the possible overhead of DataFlow.
......
......@@ -55,7 +55,7 @@ graph or the variables in your loader.
## Resume Training
"resume training" means "loading the last known checkpoint".
"resume training" is mostly just "loading the last known checkpoint".
Therefore you should refer to the [previous section](#load-a-model-to-a-session)
on how to load a model.
......@@ -63,14 +63,15 @@ on how to load a model.
.. note:: **A checkpoint does not resume everything!**
The TensorFlow checkpoint only saves TensorFlow variables,
which means other Python states that are not TensorFlow variables will not be saved
and resumed. This often include:
which means other Python state that are not TensorFlow variables will not be saved
and resumed. This means:
1. Training epoch number. You can set it by providing a `starting_epoch` to
your resume job.
2. State in your callbacks. Certain callbacks maintain a state
1. Training epoch number will not be resumed.
You can set it by providing a ``starting_epoch`` to your resume job.
2. State in your callbacks will not be resumed. Certain callbacks maintain a state
(e.g., current best accuracy) in Python, which cannot be saved automatically.
```
The [AutoResumeTrainConfig](../modules/train.html#tensorpack.train.AutoResumeTrainConfig)
is an alternative of `TrainConfig` which applies some heuristics to
automatically resume both checkpoint and the epoch number from your log directory.
......@@ -40,14 +40,15 @@ for easier options.
### Noisy TensorFlow Summaries
Since TF summaries are evaluated infrequently (every epoch) by default,
if the content is data-dependent, the values could have high variance.
if the content is data-dependent (e.g., training loss),
the infrequently-sampled values could have high variance.
To address this issue, you can:
1. Change "When to Log": log more frequently, but note that certain summaries can be expensive to
log. You may want to use a separate collection for frequent logging.
2. Change "What to Log": you can call
[tfutils.summary.add_moving_summary](../modules/tfutils.html#tensorpack.tfutils.summary.add_moving_summary)
on scalar tensors, which will summarize the moving average of those scalars, instead of their instant values.
The moving averages are maintained by the
The moving averages are updated every step by the
[MovingAverageSummary](../modules/callbacks.html#tensorpack.callbacks.MovingAverageSummary)
callback (enabled by default).
......@@ -56,7 +57,7 @@ To address this issue, you can:
Besides TensorFlow summaries,
a callback can also write other data to the monitor backend anytime once the training has started,
by `self.trainer.monitors.put_xxx`.
As long as the type of data is supported, the data will be dispatched to and logged to the same place.
As long as the type of data is supported, the data will be dispatched to and logged to the same places.
As a result, tensorboard will show not only summaries in the graph, but also your custom data.
For example, a precise validation error often needs to be computed manually, outside the TensorFlow graph.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment