Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
9b710110
Commit
9b710110
authored
May 19, 2017
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
docs & tfrecord dump bar
parent
ada058f3
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
26 additions
and
5 deletions
+26
-5
docs/tutorial/dataflow.md
docs/tutorial/dataflow.md
+10
-1
docs/tutorial/efficient-dataflow.md
docs/tutorial/efficient-dataflow.md
+1
-1
tensorpack/dataflow/dftools.py
tensorpack/dataflow/dftools.py
+8
-2
tensorpack/dataflow/format.py
tensorpack/dataflow/format.py
+7
-1
No files found.
docs/tutorial/dataflow.md
View file @
9b710110
# DataFlow
# DataFlow
DataFlow is a library to
help you build Python iterators to load data
.
DataFlow is a library to
easily build Python iterators for efficient data loading
.
A DataFlow has a
`get_data()`
generator method,
A DataFlow has a
`get_data()`
generator method,
which yields
`datapoints`
.
which yields
`datapoints`
.
...
@@ -61,3 +61,12 @@ generator = df.get_data()
...
@@ -61,3 +61,12 @@ generator = df.get_data()
for
dp
in
generator
:
for
dp
in
generator
:
# dp is now a list. do whatever
# dp is now a list. do whatever
```
```
### Efficiency
DataFlow is purely Python -- a convenient and slow language (w.r.t C++). But faster data loading doesn't always mean faster
training: we only need data to be __fast enough__.
DataFlow is fast enough for problems up to the scale of multi-GPU ImageNet training.
See
[
efficient dataflow tutorial
](
http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html
)
for details.
docs/tutorial/efficient-dataflow.md
View file @
9b710110
...
@@ -203,7 +203,7 @@ So DataFlow will not be a serious bottleneck if configured properly.
...
@@ -203,7 +203,7 @@ So DataFlow will not be a serious bottleneck if configured properly.
## More Efficient DataFlow
## More Efficient DataFlow
To work with larger datasets (or smaller networks, or more GPUs) you could be severely bounded by CPU or disk speed of a single machine.
To work with larger datasets (or smaller networks, or more
/better
GPUs) you could be severely bounded by CPU or disk speed of a single machine.
One way is to optimize the preprocessing routine (e.g. write something in C++ or use TF reading operators).
One way is to optimize the preprocessing routine (e.g. write something in C++ or use TF reading operators).
Another way to scale is to run DataFlow in a distributed fashion and collect them on the
Another way to scale is to run DataFlow in a distributed fashion and collect them on the
training machine. E.g.:
training machine. E.g.:
...
...
tensorpack/dataflow/dftools.py
View file @
9b710110
...
@@ -140,8 +140,14 @@ def dump_dataflow_to_tfrecord(df, path):
...
@@ -140,8 +140,14 @@ def dump_dataflow_to_tfrecord(df, path):
"""
"""
df
.
reset_state
()
df
.
reset_state
()
with
tf
.
python_io
.
TFRecordWriter
(
path
)
as
writer
:
with
tf
.
python_io
.
TFRecordWriter
(
path
)
as
writer
:
for
dp
in
df
.
get_data
():
try
:
writer
.
write
(
dumps
(
dp
))
sz
=
df
.
size
()
except
NotImplementedError
:
sz
=
0
with
get_tqdm
(
total
=
sz
)
as
pbar
:
for
dp
in
df
.
get_data
():
writer
.
write
(
dumps
(
dp
))
pbar
.
update
()
from
..utils.develop
import
create_dummy_func
# noqa
from
..utils.develop
import
create_dummy_func
# noqa
...
...
tensorpack/dataflow/format.py
View file @
9b710110
...
@@ -244,8 +244,14 @@ class TFRecordData(DataFlow):
...
@@ -244,8 +244,14 @@ class TFRecordData(DataFlow):
This class works with :func:`dftools.dump_dataflow_to_tfrecord`.
This class works with :func:`dftools.dump_dataflow_to_tfrecord`.
"""
"""
def
__init__
(
self
,
path
,
size
=
None
):
def
__init__
(
self
,
path
,
size
=
None
):
"""
Args:
path (str): path to the tfrecord file
size (int): total number of records, because this metadata is not
stored in the tfrecord file.
"""
self
.
_gen
=
tf
.
python_io
.
tf_record_iterator
(
path
)
self
.
_gen
=
tf
.
python_io
.
tf_record_iterator
(
path
)
self
.
_size
=
size
self
.
_size
=
int
(
size
)
def
size
(
self
):
def
size
(
self
):
if
self
.
_size
:
if
self
.
_size
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment