DataFlow is a library to build Python iterators for efficient data loading.
DataFlow is a library to build Python iterators for efficient data loading.
**Definition**: A DataFlow is a idiomatic Python container object that has a `__iter__()` generator method, which yields `datapoints` and a `__len__()` method returning the size of the flow.
**Definition**: A DataFlow is a idiomatic Python container object that has a `__iter__()` generator method,
which yields `datapoints` and optionally a `__len__()` method returning the size of the flow.
A datapoint is a **list** of Python objects which are called the `components` of a datapoint.
A datapoint is a **list** of Python objects which are called the `components` of a datapoint.
**Example**: to train on MNIST dataset, you may need a DataFlow with a `__iter__()` method
**Example**: to train on MNIST dataset, you may need a DataFlow with a `__iter__()` method
...
@@ -40,6 +41,7 @@ df = PrefetchDataZMQ(df, 3)
...
@@ -40,6 +41,7 @@ df = PrefetchDataZMQ(df, 3)
You can find more complicated DataFlow in the [ImageNet training script](../examples/ImageNetModels/imagenet_utils.py)
You can find more complicated DataFlow in the [ImageNet training script](../examples/ImageNetModels/imagenet_utils.py)
with all the data preprocessing.
with all the data preprocessing.
### Work with Your Data
Unless you are working with standard data types (image folders, LMDB, etc),
Unless you are working with standard data types (image folders, LMDB, etc),
you would usually want to write the source DataFlow (`MyDataFlow` in the above example) for your data format.
you would usually want to write the source DataFlow (`MyDataFlow` in the above example) for your data format.
See [another tutorial](extend/dataflow.html) for simple instructions on writing a DataFlow.
See [another tutorial](extend/dataflow.html) for simple instructions on writing a DataFlow.
...
@@ -58,7 +60,7 @@ the rest of the data pipeline.
...
@@ -58,7 +60,7 @@ the rest of the data pipeline.
Nevertheless, tensorpack supports data loading with native TF operators / TF datasets as well.
Nevertheless, tensorpack supports data loading with native TF operators / TF datasets as well.
### Use DataFlow (outside Tensorpack)
### Use DataFlow outside Tensorpack
Normally, tensorpack `InputSource` interface links DataFlow to the graph for training.
Normally, tensorpack `InputSource` interface links DataFlow to the graph for training.
If you use DataFlow in other places such as your custom code, call `reset_state()` first to initialize it,
If you use DataFlow in other places such as your custom code, call `reset_state()` first to initialize it,