Commit 10c6f81d authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent db5019b8
......@@ -49,8 +49,3 @@ TrainConfig(
]
)
```
## Write a callback
TODO
# Dataflow
# DataFlow
Dataflow is a library to help you build Python iterators to load data.
DataFlow is a library to help you build Python iterators to load data.
A Dataflow has a `get_data()` generator method,
A DataFlow has a `get_data()` generator method,
which yields `datapoints`.
A datapoint must be a **list** of Python objects which I called the `components` of a datapoint.
For example, to train on MNIST dataset, you can build a Dataflow with a `get_data()` method
For example, to train on MNIST dataset, you can build a DataFlow with a `get_data()` method
that yields datapoints of two elements (components):
a numpy array of shape (64, 28, 28), and an array of shape (64,).
......@@ -15,13 +15,13 @@ a numpy array of shape (64, 28, 28), and an array of shape (64,).
One good thing about having a standard interface is to be able to provide
the greatest code reusablility.
There are a lot of existing modules in tensorpack which you can use to compose
complex Dataflow instances with a long pre-processing pipeline. A whole pipeline usually
complex DataFlow instances with a long pre-processing pipeline. A whole pipeline usually
would __read from disk (or other sources), apply augmentations, group into batches,
prefetch data__, etc. A simple example is as the following:
````python
# define a Dataflow which produces image-label pairs from a caffe lmdb database
df = CaffeLMDB('/path/to/caffe/lmdb', shuffle=False)
# a DataFlow you implement to produce [image,label] pairs from whatever sources:
df = MyDataFlow(shuffle=True)
# resize the image component of each datapoint
df = AugmentImageComponent(df, [imgaug.Resize((225, 225))])
# group data into batches of size 128
......@@ -43,46 +43,21 @@ tasks as large as ImageNet training.
-->
### Reuse in other frameworks
Another good thing about Dataflow is that it is independent of
Another good thing about DataFlow is that it is independent of
tensorpack internals. You can just use it as an efficient data processing pipeline,
and plug it into other frameworks.
To use a DataFlow, you'll need to call `reset_state()` first to initialize it, and then use the generator however you
want:
To use a DataFlow independently, you'll need to call `reset_state()` first to initialize it,
and then use the generator however you want:
```python
df = get_some_df()
df.reset_state()
generator = df.get_data()
for dp in generator:
# dp is now a list. do whatever
```
### Write your own Dataflow
There are several existing Dataflow, e.g. ImageFromFile, DataFromList, which you can
use to read images or load data from a list.
But in general, you'll probably need to write a new Dataflow to produce data for your task.
Dataflow implementations for several well-known datasets are provided in the
[dataflow.dataset](http://tensorpack.readthedocs.io/en/latest/modules/tensorpack.dataflow.dataset.html)
module, you can take them as a reference.
Usually you just need to implement the `get_data()` method which yields a datapoint every time.
```python
class MyDataFlow(DataFlow):
def get_data(self):
for k in range(100):
digit = np.random.rand(28, 28)
label = np.random.randint(10)
yield [digit, label]
```
Optionally, Dataflow can implement the following two methods:
+ `size()`. Return the number of elements the generator can produce. Certain modules might require this.
For example, only Dataflows with the same number of elements can be joined together.
+ `reset_state()`. It's guaranteed that the actual process which runs a DataFlow will invoke this method before using it.
So if this DataFlow needs to something after a `fork()`, you should put it here.
A typical situation is when your Dataflow uses random number generator (RNG). Then you'd need to reset the RNG here,
otherwise child processes will have the same random seed. The `RNGDataFlow` class does this already.
With a "low-level" Dataflow defined, you can then compose it with existing modules.
Unless you're working with standard data types (image folders, LMDB, etc),
you would usually want to write your own DataFlow.
See [another tutorial](http://tensorpack.readthedocs.io/en/latest/tutorial/extend/dataflow.html)
for details.
## Write a callback
TODO
### Write a DataFlow
There are several existing DataFlow, e.g. ImageFromFile, DataFromList, which you can
use to read images or load data from a list.
But in general, you'll probably need to write a new DataFlow to produce data for your task.
DataFlow implementations for several well-known datasets are provided in the
[dataflow.dataset](http://tensorpack.readthedocs.io/en/latest/modules/tensorpack.dataflow.dataset.html)
module, you can take them as a reference.
Usually you just need to implement the `get_data()` method which yields a datapoint every time.
```python
class MyDataFlow(DataFlow):
def get_data(self):
for k in range(100):
digit = np.random.rand(28, 28)
label = np.random.randint(10)
yield [digit, label]
```
Optionally, DataFlow can implement the following two methods:
+ `size()`. Return the number of elements the generator can produce. Certain modules might require this.
For example, only DataFlows with the same number of elements can be joined together.
+ `reset_state()`. It's guaranteed that the actual process which runs a DataFlow will invoke this method before using it.
So if this DataFlow needs to something after a `fork()`, you should put it here.
A typical situation is when your DataFlow uses random number generator (RNG). Then you'd need to reset the RNG here,
otherwise child processes will have the same random seed. The `RNGDataFlow` class does this already.
With a "low-level" DataFlow defined, you can then compose it with existing modules.
## Implement a layer
Symbolic functions should be nothing new to you.
Using symbolic functions is not special in tensorpack: you can use any symbolic functions you've
made or seen elsewhere with tensorpack layers.
You can use symbolic functions from slim/tflearn/tensorlayer, and even Keras ([with some tricks](../../examples/mnist-keras.py)).
So you never **have to** implement a tensorpack layer.
If you'd like, you can make a symbolic function become a "layer" by following some simple rules, and then gain benefits from the framework.
Take a look at the [Convolutional Layer](../../tensorpack/models/conv2d.py#L14) implementation for an example of how to define a layer:
```python
@layer_register()
def Conv2D(x, out_channel, kernel_shape,
padding='SAME', stride=1,
W_init=None, b_init=None,
nl=tf.nn.relu, split=1, use_bias=True):
```
Basically, a layer is a symbolic function with the following rules:
+ It is decorated by `@layer_register`.
+ The first argument is its "input". It must be a **tensor or a list of tensors**.
+ It returns either a tensor or a list of tensors as its "output".
By making a symbolic function a "layer", the following things will happen:
+ You will need to call the function with a scope name as the first argument, e.g. `Conv2D('conv0', x, 32, 3)`.
Everything happening in this function will be under the variable scope 'conv0'.
You can register the layer with `use_scope=False` to disable this feature.
+ Static shapes of input/output will be printed to screen.
+ `argscope` will then work for all its arguments except the input tensor(s).
+ It will work with `LinearWrap`: you can use it if the output of one layer matches the input of the next layer.
There are also a number of (non-layer) symbolic functions in the `tfutils.symbolic_functions` module.
There isn't a rule about what kind of symbolic functions should be made a layer -- they're quite
similar anyway. But in general I define the following symbolic functions as layers:
+ Functions which contain variables. A variable scope is almost always helpful for such functions.
+ Functions which are commonly referred to as "layers", such as pooling. This make a model
definition more straightforward.
## Write a trainer
The existing trainers should be enough for single-cost optimization tasks. If you
want to do something inside the trainer, consider writing it as a callback, or
write an issue to see if there is a better solution than creating new trainers.
For certain tasks, you might need a new trainer.
The [GAN trainer](../../examples/GAN/GAN.py) is one example of how to implement
new trainers.
More details to come.
......@@ -2,7 +2,10 @@
Tutorials
---------------------
Test.
To be completed.
user tutorials
========================
.. toctree::
:maxdepth: 1
......@@ -13,3 +16,14 @@ Test.
model
trainer
callback
extend tensorpack
=================
.. toctree::
:maxdepth: 1
extend/dataflow
extend/model
extend/trainer
extend/callback
......@@ -59,43 +59,3 @@ l = tf.multiply(l, 0.5)
l = func(l, *args, **kwargs)
l = FullyConnected('fc1', l, 10, nl=tf.identity)
```
## Implement a layer
Symbolic functions should be nothing new to you, and writing a simple symbolic function is nothing special in tensorpack.
But you can make a symbolic function become a "layer" by following some very simple rules, and then gain benefits from the framework.
Take a look at the [Convolutional Layer](../tensorpack/models/conv2d.py#L14) implementation for an example of how to define a layer:
```python
@layer_register()
def Conv2D(x, out_channel, kernel_shape,
padding='SAME', stride=1,
W_init=None, b_init=None,
nl=tf.nn.relu, split=1, use_bias=True):
```
Basically, a layer is a symbolic function with the following rules:
+ It is decorated by `@layer_register`.
+ The first argument is its "input". It must be a tensor or a list of tensors.
+ It returns either a tensor or a list of tensors as its "output".
By making a symbolic function a "layer", the following things will happen:
+ You will call the function with a scope argument, e.g. `Conv2D('conv0', x, 32, 3)`.
Everything happening in this function will be under the variable scope 'conv0'. You can register
the layer with `use_scope=False` to disable this feature.
+ Static shapes of input/output will be logged.
+ `argscope` will then work for all its arguments except the first one (input).
+ It will work with `LinearWrap`: you can use it if the output of a previous layer is the input of a next layer.
Take a look at the [Inception example](../examples/Inception/inception-bn.py#L36) to see how a complicated model can be described with these primitives.
There are also a number of (non-layer) symbolic functions in the `tfutils.symbolic_functions` module.
There isn't a rule about what kind of symbolic functions should be made a layer -- they're quite
similar anyway. But in general I define the following kinds of symbolic functions as layers:
+ Functions which contain variables. A variable scope is almost always helpful for such function.
+ Functions which are commonly referred to as "layers", such as pooling. This make a model
definition more straightforward.
......@@ -42,16 +42,3 @@ For example, [GAN trainer](../examples/GAN/GAN.py) minimizes
two cost functions alternatively.
Some trainer takes data from a TensorFlow reading pipeline instead of a Dataflow
([PTB example](../examples/PennTreebank)).
## Write a trainer
The existing trainers should be enough for single-cost optimization tasks. If you
want to do something inside the trainer, consider writing it as a callback, or
write an issue to see if there is a better solution than creating new trainers.
For certain tasks, you might need a new trainer.
The [GAN trainer](../examples/GAN/GAN.py) is one example of how to implement
new trainers.
More details to come.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment