Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
6c482180
Commit
6c482180
authored
Jun 27, 2019
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update docs
parent
ad8aa0a5
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
29 additions
and
30 deletions
+29
-30
docs/modules/dataflow.rst
docs/modules/dataflow.rst
+1
-1
docs/tutorial/dataflow.md
docs/tutorial/dataflow.md
+14
-16
docs/tutorial/philosophy/dataflow.md
docs/tutorial/philosophy/dataflow.md
+14
-12
tensorpack/__init__.py
tensorpack/__init__.py
+0
-1
No files found.
docs/modules/dataflow.rst
View file @
6c482180
tensorpack.dataflow package
tensorpack.dataflow package
===========================
===========================
Relevant tutorials: :doc:`../tutorial/dataflow`, :doc:`../tutorial/
extend/input-source
`.
Relevant tutorials: :doc:`../tutorial/dataflow`, :doc:`../tutorial/
philosophy/dataflow
`.
.. container:: custom-index
.. container:: custom-index
...
...
docs/tutorial/dataflow.md
View file @
6c482180
...
@@ -2,6 +2,7 @@
...
@@ -2,6 +2,7 @@
# DataFlow
# DataFlow
DataFlow is a pure-Python library to create iterators for efficient data loading.
DataFlow is a pure-Python library to create iterators for efficient data loading.
It is originally part of tensorpack, and now also available as a
[
separate library
](
https://github.com/tensorpack/dataflow
)
.
### What is DataFlow
### What is DataFlow
...
@@ -13,12 +14,9 @@ A datapoint is a **list or dict** of Python objects, each of which are called th
...
@@ -13,12 +14,9 @@ A datapoint is a **list or dict** of Python objects, each of which are called th
that yields datapoints (lists) of two components:
that yields datapoints (lists) of two components:
a numpy array of shape (64, 28, 28), and an array of shape (64,).
a numpy array of shape (64, 28, 28), and an array of shape (64,).
As you saw,
DataFlow is independent of the training frameworks since it produces any python objects
DataFlow is __independent of TensorFlow__ since it produces any python objects
(usually numpy arrays).
(usually numpy arrays).
To
`import tensorpack.dataflow`
, you don't even have to install TensorFlow.
You can simply use DataFlow as a data processing pipeline and plug it into your own training code.
You can simply use DataFlow as a data processing pipeline and plug it into any other frameworks.
And we plan to make it installable as a separate project.
### Load Raw Data
### Load Raw Data
We do not make any assumptions about your data format.
We do not make any assumptions about your data format.
...
@@ -45,14 +43,16 @@ df = BatchData(df, 128)
...
@@ -45,14 +43,16 @@ df = BatchData(df, 128)
df
=
MultiProcessRunnerZMQ
(
df
,
3
)
df
=
MultiProcessRunnerZMQ
(
df
,
3
)
````
````
A list of built-in DataFlow to
compose with
can be found at
[
API docs
](
../modules/dataflow.html
)
.
A list of built-in DataFlow to
use
can be found at
[
API docs
](
../modules/dataflow.html
)
.
You can also find complicated real-life DataFlow pipelines in the
[
ImageNet training script
](
../examples/ImageNetModels/imagenet_utils.py
)
You can also find complicated real-life DataFlow pipelines in the
[
ImageNet training script
](
../examples/ImageNetModels/imagenet_utils.py
)
or other tensorpack examples.
or other tensorpack examples.
### Parallelize the Pipeline
### Parallelize the Pipeline
DataFlow includes optimized parallel runner and parallel mapper.
DataFlow includes carefully optimized parallel runners and parallel mappers:
`Multi{Thread,Process}{Runner,MapData}`
.
You can find them in the
[
API docs
](
../modules/dataflow.html
)
under the
Runners execute multiple clones of a dataflow in parallel.
Mappers execute a mapping function in parallel on top of an existing dataflow.
You can find details in the
[
API docs
](
../modules/dataflow.html
)
under the
"parallel" and "parallel_map" section.
"parallel" and "parallel_map" section.
The
[
Efficient DataFlow
](
efficient-dataflow.html
)
give a deeper dive
The
[
Efficient DataFlow
](
efficient-dataflow.html
)
give a deeper dive
...
@@ -61,8 +61,9 @@ on how to use them to optimize your data pipeline.
...
@@ -61,8 +61,9 @@ on how to use them to optimize your data pipeline.
### Run the DataFlow
### Run the DataFlow
When training with tensorpack, typically it is the
`InputSource`
interface that runs the DataFlow.
When training with tensorpack, typically it is the
`InputSource`
interface that runs the DataFlow.
However, DataFlow can be used without other tensorpack components.
To run a DataFlow by yourself, call
`reset_state()`
first to initialize it,
When using DataFlow alone without other tensorpack components,
you need to call
`reset_state()`
first to initialize it,
and then use the generator however you like:
and then use the generator however you like:
```
python
```
python
...
@@ -70,14 +71,11 @@ df = SomeDataFlow()
...
@@ -70,14 +71,11 @@ df = SomeDataFlow()
df
.
reset_state
()
df
.
reset_state
()
for
dp
in
df
:
for
dp
in
df
:
# dp is now a list
. do whatever
# dp is now a list
/dict. do whatever with it
```
```
### Why DataFlow?
### Why DataFlow?
It's
easy and fast
. For more discussions, see
[
Why DataFlow?
](
/tutorial/philosophy/dataflow.html
)
It's
**easy and fast**
*
. For more discussions, see
[
Why DataFlow?
](
/tutorial/philosophy/dataflow.html
)
Nevertheless, using DataFlow is not required
.
Nevertheless, using DataFlow is not required
in tensorpack.
Tensorpack supports data loading with native TF operators / TF datasets as well.
Tensorpack supports data loading with native TF operators / TF datasets as well.
Read the
[
API documentation
](
../../modules/dataflow.html
)
to see API details of DataFlow and a complete list of built-in DataFlow.
docs/tutorial/philosophy/dataflow.md
View file @
6c482180
...
@@ -150,8 +150,11 @@ or when you need to filter your data on the fly.
...
@@ -150,8 +150,11 @@ or when you need to filter your data on the fly.
`torch.utils.data.DataLoader`
is quite good, despite that it also makes some
`torch.utils.data.DataLoader`
is quite good, despite that it also makes some
**bad assumptions on batching**
and is not always efficient:
**bad assumptions on batching**
and is not always efficient:
1.
It assumes you always do batch training, has a constant batch size, and
1.
`torch.utils.data.DataLoader`
assumes that:
the batch grouping can be purely determined by indices.
1.
You do batch training
1.
You use a constant batch size
1.
Indices are sufficient to determine the samples to batch together
None of these are necessarily true.
None of these are necessarily true.
2.
Its multiprocessing implementation is efficient on
`torch.Tensor`
,
2.
Its multiprocessing implementation is efficient on
`torch.Tensor`
,
...
@@ -161,21 +164,20 @@ or when you need to filter your data on the fly.
...
@@ -161,21 +164,20 @@ or when you need to filter your data on the fly.
On the other hand, DataFlow:
On the other hand, DataFlow:
1.
Is a pure iterator, not necessarily has a length or can be indexed. This is more generic.
1.
Is a pure iterator, not necessarily has a length or can be indexed. This is more generic.
2.
Parallelization and batching are disentangled concepts.
2.
Does not assume batches, and allow you to implement different batching logic easily.
You do not need to use batches, and can implement different batching logic easily.
3.
Is optimized for generic data type and numpy arrays.
3.
Is optimized for generic data type and numpy arrays.
```
eval_rst
```
eval_rst
.. note::
**Why is an iterator more general than ``__getitem__``? **
.. note::
Why is an iterator interface more generic than ``__getitem__``?
DataFlow's iterator interface can perfectly simulate the behavior of
``__getitem__``
interface like this:
DataFlow's iterator interface can perfectly simulate the behavior of
indexing
interface like this:
.. code-block:: python
.. code-block:: python
df = SomeIndexGenerator()
# A dataflow which produces indices, like [0], [1], [2], ...
# A dataflow which produces indices, like [0], [1], [2], ..
.
# The indices can be either sequential, or more fancy, akin to torch.utils.data.Sampler
.
# The indices can be either sequential, or more fancy, akin to `torch.utils.data.Sampler`.
df = SomeIndexGenerator()
df = MapData(df, lambda idx: dataset[idx[0]])
# Map the indices to datapoints by ``__getitem__``.
# Map the indices to datapoints by ``__getitem__``.
df = MapData(df, lambda idx: dataset[idx[0]])
```
```
tensorpack/__init__.py
View file @
6c482180
...
@@ -8,7 +8,6 @@ from tensorpack.utils import *
...
@@ -8,7 +8,6 @@ from tensorpack.utils import *
from
tensorpack.dataflow
import
*
from
tensorpack.dataflow
import
*
# dataflow can be used alone without installing tensorflow
# dataflow can be used alone without installing tensorflow
# TODO maybe separate dataflow to a new project if it's good enough
# https://github.com/celery/kombu/blob/7d13f9b95d0b50c94393b962e6def928511bfda6/kombu/__init__.py#L34-L36
# https://github.com/celery/kombu/blob/7d13f9b95d0b50c94393b962e6def928511bfda6/kombu/__init__.py#L34-L36
STATICA_HACK
=
True
STATICA_HACK
=
True
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment