Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
784e2b7b
Commit
784e2b7b
authored
Jul 31, 2017
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update docs
parent
5d0d6d16
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
45 additions
and
43 deletions
+45
-43
docs/tutorial/dataflow.md
docs/tutorial/dataflow.md
+20
-17
docs/tutorial/efficient-dataflow.md
docs/tutorial/efficient-dataflow.md
+1
-1
docs/tutorial/input-source.md
docs/tutorial/input-source.md
+24
-25
No files found.
docs/tutorial/dataflow.md
View file @
784e2b7b
# DataFlow
DataFlow is a library to easily build Python iterators for efficient data loading.
### What is DataFlow
DataFlow is a library to build Python iterators for efficient data loading.
A DataFlow has a
`get_data()`
generator method,
which yields
`datapoints`
.
A datapoint
must be a
**list**
of Python objects which I
called the
`components`
of a datapoint.
A datapoint
is a
**list**
of Python objects which is
called the
`components`
of a datapoint.
For example, to train on MNIST dataset, you can build a DataFlow with a
`get_data()`
method
that yields datapoints
of two elements (components)
:
that yields datapoints
(lists) of two components
:
a numpy array of shape (64, 28, 28), and an array of shape (64,).
### Composition of DataFlow
One good thing about having a standard interface is to be able to provide
the greatest code reusability.
There are a lot of existing
modul
es in tensorpack, which you can use to compose
complex DataFlow with a long pre-processing pipeline. A
whole
pipeline usually
There are a lot of existing
DataFlow utiliti
es in tensorpack, which you can use to compose
complex DataFlow with a long pre-processing pipeline. A
common
pipeline usually
would __read from disk (or other sources), apply augmentations, group into batches,
prefetch data__, etc. A simple example is as the following:
````
python
# a DataFlow you implement to produce [tensor1, tensor2, ..] lists from whatever sources:
df
=
MyDataFlow
(
shuffle
=
True
)
df
=
MyDataFlow
(
dir
=
'/my/data'
,
shuffle
=
True
)
# resize the image component of each datapoint
df
=
AugmentImageComponent
(
df
,
[
imgaug
.
Resize
((
225
,
225
))])
# group data into batches of size 128
...
...
@@ -29,24 +31,25 @@ df = BatchData(df, 128)
# start 3 processes to run the dataflow in parallel, and communicate with ZeroMQ
df
=
PrefetchDataZMQ
(
df
,
3
)
````
A more complicated example is
the
[
ResNet training script
](
../examples/ResNet/imagenet-resnet.py
)
You can find more complicated DataFlow in
the
[
ResNet training script
](
../examples/ResNet/imagenet-resnet.py
)
with all the data preprocessing.
All these modules are written in Python,
so you can easily implement whatever operations/transformations you need,
without worrying about adding operators to TensorFlow.
Unless you are working with standard data types (image folders, LMDB, etc),
you would usually want to write
your own DataFlow
.
you would usually want to write
the base DataFlow (
`MyDataFlow`
in the above example) for your data format
.
See
[
another tutorial
](
http://tensorpack.readthedocs.io/en/latest/tutorial/extend/dataflow.html
)
for details on handling your own data format.
for details on writing a DataFlow.
### Why DataFlow
1.
It's easy: write everything in pure Python, and reuse existing utilities. On the contrary,
writing data loaders in TF operators is painful.
2.
It's fast (enough): see
[
Input Pipeline tutorial
](
http://tensorpack.readthedocs.io/en/latest/tutorial/input-source.html
)
on how tensorpack handles data loading.
<!--
-
TODO mention RL, distributed data, and zmq operator in the future.
-->
Nevertheless, tensorpack support data loading with native TF operators as well.
### Use DataFlow outside Tensorpack
DataFlow is
independent
of both tensorpack and TensorFlow.
DataFlow is
__independent__
of both tensorpack and TensorFlow.
You can simply use it as a data processing pipeline and plug it into any other frameworks.
To use a DataFlow independently, you will need to call
`reset_state()`
first to initialize it,
...
...
docs/tutorial/efficient-dataflow.md
View file @
784e2b7b
...
...
@@ -33,7 +33,7 @@ ds1 = BatchData(ds0, 256, use_list=True)
TestDataSpeed
(
ds1
)
.
start
()
```
Here
`ds0`
simply
reads original images from the filesystem. It is implemented simply by:
Here
`ds0`
reads original images from the filesystem. It is implemented simply by:
```
python
for
filename
,
label
in
filelist
:
yield
[
cv2
.
imread
(
filename
),
label
]
...
...
docs/tutorial/input-source.md
View file @
784e2b7b
# Input Pipeline
This tutorial covers some general basics of the possible methods to send data from external sources to TensorFlow graph,
This tutorial covers some general basics of the possible methods to send data from external sources to
a
TensorFlow graph,
and how tensorpack support these methods.
You don't have to read it because these are details under the tensorpack interface,
but knowing it could help understand the efficiency and choose the best input pipeline for your task.
## Prepare Data in Parallel
<!--
-!
[
prefetch
](
input-source.png
)
-->

A common sense no matter what framework you use:
...
...
@@ -19,9 +15,9 @@ Start to prepare the next (batch of) data while you're training!
The reasons are:
1.
Data preparation often consumes non-trivial time (depend on the actual problem).
2.
Data preparation often uses completely different resources from training --
2.
Data preparation often uses completely different resources from training
(see figure above)
--
doing them together doesn't slow you down. In fact you can further parallelize different stages in
the preparation, because they also use different resources
(as shown in the figure)
.
the preparation, because they also use different resources.
3.
Data preparation often doesn't depend on the result of the previous training step.
Let's do some simple math: according to
[
tensorflow/benchmarks
](
https://www.tensorflow.org/performance/benchmarks
)
,
...
...
@@ -30,29 +26,33 @@ Assuming you have 5GB/s `memcpy` bandwidth, simply copying the data once would t
down your training by 10%. Think about how many more copies are made during your preprocessing.
Failure to hide the data preparation latency is the major reason why people
cannot see good GPU utilization. Always choose a framework that allows latency hiding.
cannot see good GPU utilization. __Always choose a framework that allows latency hiding.__
However most other TensorFlow wrappers are designed to be
`feed_dict`
based -- no latency hiding at all.
This is the major reason why tensorpack is
[
faster
](
https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6
)
.
## Python or C++ ?
The above discussion is valid regardless of what you use to load/preprocess, Python code or TensorFlow operators (written in C++).
The benefit
of using TensorFlow ops is
:
The benefit
s of using TensorFlow ops are
:
*
Faster preprocessing.
*
No "Copy to TF" (i.e.
`feed_dict`
) stage.
While Python is much easier to write, and has much more libraries to use.
* Potentially true, but not necessarily. With Python code you can call a variety of other fast libraries (e.g. lmdb), which
you have no access to in TF ops.
* Python may be just fast enough.
Though C++ ops are potentially faster, they're usually __not necessary__.
As long as data preparation runs faster than training, it makes no difference at all.
And for most types of problems, up to the scale of multi-GPU ImageNet training,
Python can offer enough speed if written properly (e.g. use
`tensorpack.dataflow`
).
See the
[
Efficient DataFlow
](
http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html
)
tutorial.
As long as data preparation runs faster than training, it makes no difference at all.
And for most types of problems, up to the scale of multi-GPU ImageNet training,
Python can offer enough speed if you use a fast library (e.g. `tensorpack.dataflow`).
See the [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html) tutorial.
When you use Python to load/preprocess data, TF
`QueueBase`
can help hide the "Copy to TF" latency,
and TF
`StagingArea`
can help hide the "Copy to GPU" latency.
They are used by most examples in tensorpack,
however most other TensorFlow wrappers are designed to be
`feed_dict`
based -- no latency hiding at all.
This is the major reason why tensorpack is
[
faster
](
https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6
)
.
*
No "Copy to TF" (i.e.
`feed_dict`
) stage.
* True. But as mentioned above, the latency can usually be hidden.
In tensorpack, TF queues are used to hide the "Copy to TF" latency,
and TF `StagingArea` can help hide the "Copy to GPU" latency.
They are used by most examples in tensorpack.
## InputSource
...
...
@@ -65,10 +65,9 @@ For example,
4.
Come from some TF native reading pipeline.
5.
Come from some ZMQ pipe, where the load/preprocessing may happen on a different machine.
You can use
`TrainConfig(data=)`
option to use a customized
`InputSource`
.
Usually you don't need this API, and only have to specify
`TrainConfig(dataflow=)`
, because
tensorpack trainers automatically adds proper prefetching for you.
In cases you want to use TF ops rather than DataFlow, you can use
`TensorInput`
as the
`InputSource`
When you set
`TrainConfig(dataflow=)`
, tensorpack trainers automatically adds proper prefetching for you.
You can also use
`TrainConfig(data=)`
option to use a customized
`InputSource`
.
In cases you want to use TF ops rather than a DataFlow, you can use
`TensorInput`
as the
`InputSource`
(See the
[
PTB example
](
https://github.com/ppwwyyxx/tensorpack/tree/master/examples/PennTreebank
)
).
## Figure out the Bottleneck
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment