Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
03c16776
Commit
03c16776
authored
Aug 08, 2017
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
refine the dataflow/input doc
parent
42322257
Changes
4
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
102 additions
and
65 deletions
+102
-65
docs/tutorial/dataflow.md
docs/tutorial/dataflow.md
+2
-2
docs/tutorial/efficient-dataflow.md
docs/tutorial/efficient-dataflow.md
+75
-49
docs/tutorial/input-source.md
docs/tutorial/input-source.md
+22
-12
tensorpack/dataflow/prefetch.py
tensorpack/dataflow/prefetch.py
+3
-2
No files found.
docs/tutorial/dataflow.md
View file @
03c16776
...
...
@@ -9,7 +9,7 @@ A DataFlow has a `get_data()` generator method,
which yields
`datapoints`
.
A datapoint is a
**list**
of Python objects which is called the
`components`
of a datapoint.
For example, to train on MNIST dataset, you can
build
a DataFlow with a
`get_data()`
method
For example, to train on MNIST dataset, you can
write
a DataFlow with a
`get_data()`
method
that yields datapoints (lists) of two components:
a numpy array of shape (64, 28, 28), and an array of shape (64,).
...
...
@@ -28,7 +28,7 @@ df = MyDataFlow(dir='/my/data', shuffle=True)
df
=
AugmentImageComponent
(
df
,
[
imgaug
.
Resize
((
225
,
225
))])
# group data into batches of size 128
df
=
BatchData
(
df
,
128
)
# start 3 processes to run the dataflow in parallel
, and communicate with ZeroMQ
# start 3 processes to run the dataflow in parallel
df
=
PrefetchDataZMQ
(
df
,
3
)
````
You can find more complicated DataFlow in the
[
ResNet training script
](
../examples/ResNet/imagenet-resnet.py
)
...
...
docs/tutorial/efficient-dataflow.md
View file @
03c16776
This diff is collapsed.
Click to expand it.
docs/tutorial/input-source.md
View file @
03c16776
# Input Pipeline
This tutorial covers some general basics of the possible methods to send data from external sources to a TensorFlow graph,
This tutorial contains some general discussions on the topic of
"how to read data efficiently to work with TensorFlow",
and how tensorpack support these methods.
You don't have to read it because these are details under the tensorpack interface,
but knowing it could help understand the efficiency and choose the best input pipeline for your task.
...
...
@@ -11,13 +12,15 @@ but knowing it could help understand the efficiency and choose the best input pi

A common sense no matter what framework you use:
Start to prepare the next (batch of) data while you're training!
<center>
Prepare data in parallel with the training!
</center>
The reasons are:
1.
Data preparation often consumes non-trivial time (depend on the actual problem).
2.
Data preparation often uses completely different resources from training (see figure above) --
doing them together doesn't slow you down. In fact you can further parallelize different stages in
the preparation
, becaus
e they also use different resources.
the preparation
sinc
e they also use different resources.
3.
Data preparation often doesn't depend on the result of the previous training step.
Let's do some simple math: according to
[
tensorflow/benchmarks
](
https://www.tensorflow.org/performance/benchmarks
)
,
...
...
@@ -27,24 +30,27 @@ down your training by 10%. Think about how many more copies are made during your
Failure to hide the data preparation latency is the major reason why people
cannot see good GPU utilization. __Always choose a framework that allows latency hiding.__
However most other TensorFlow wrappers are designed to be
`feed_dict`
based
-- no latency hiding at all
.
However most other TensorFlow wrappers are designed to be
`feed_dict`
based.
This is the major reason why tensorpack is
[
faster
](
https://gist.github.com/ppwwyyxx/8d95da79f8d97036a7d67c2416c851b6
)
.
## Python
or C++
?
## Python
Reader or TF Reader
?
The above discussion is valid regardless of what you use to load/preprocess, Python code or TensorFlow operators (written in C++).
The above discussion is valid regardless of what you use to load/preprocess data,
either Python code or TensorFlow operators (written in C++).
The benefits of using TensorFlow ops are:
*
Faster preprocessing.
*
Faster
read/
preprocessing.
* Potentially true, but not necessarily. With Python code you can call a variety of other fast libraries (e.g. lmdb), which
you have no access to in TF ops.
you have no access to in TF ops.
For example, LMDB could be faster than TFRecords.
* Python may be just fast enough.
As long as data preparation runs faster than training, it makes no difference at all.
And for most types of problems, up to the scale of multi-GPU ImageNet training,
As long as data preparation runs faster than training, and the latency of all four blocks in the
above figure is hidden, it makes no difference at all.
For most types of problems, up to the scale of multi-GPU ImageNet training,
Python can offer enough speed if you use a fast library (e.g. `tensorpack.dataflow`).
See the [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html) tutorial.
See the [Efficient DataFlow](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html) tutorial
on how to build a fast Python reader with DataFlow.
*
No "Copy to TF" (i.e.
`feed_dict`
) stage.
...
...
@@ -54,6 +60,10 @@ The benefits of using TensorFlow ops are:
and TF `StagingArea` can help hide the "Copy to GPU" latency.
They are used by most examples in tensorpack.
The benefits of using Python reader is obvious:
it's much much easier to write Python to read different data format,
handle corner cases in noisy data, preprocess, etc.
## InputSource
`InputSource`
is an abstract interface in tensorpack, to describe where the input come from and how they enter the graph.
...
...
@@ -67,7 +77,7 @@ For example,
When you set
`TrainConfig(dataflow=)`
, tensorpack trainers automatically adds proper prefetching for you.
You can also use
`TrainConfig(data=)`
option to use a customized
`InputSource`
.
In case
s
you want to use TF ops rather than a DataFlow, you can use
`TensorInput`
as the
`InputSource`
In case you want to use TF ops rather than a DataFlow, you can use
`TensorInput`
as the
`InputSource`
(See the
[
PTB example
](
https://github.com/ppwwyyxx/tensorpack/tree/master/examples/PennTreebank
)
).
## Figure out the Bottleneck
...
...
tensorpack/dataflow/prefetch.py
View file @
03c16776
...
...
@@ -246,10 +246,11 @@ class ThreadedMapData(ProxyDataFlow):
Note:
1. There is tiny communication overhead with threads, but you
should avoid starting many threads in your main process to
avoid GIL
.
should avoid starting many threads in your main process to
reduce GIL contention
.
The threads will only start in the process which calls :meth:`reset_state()`.
Therefore you can use ``PrefetchDataZMQ(ThreadedMapData(...), 1)`` to avoid GIL.
Therefore you can use ``PrefetchDataZMQ(ThreadedMapData(...), 1)``
to reduce GIL contention.
2. Threads run in parallel and can take different time to run the
mapping function. Therefore the order of datapoints won't be
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment