Commit 9b710110 authored by Yuxin Wu's avatar Yuxin Wu

docs & tfrecord dump bar

parent ada058f3
# DataFlow # DataFlow
DataFlow is a library to help you build Python iterators to load data. DataFlow is a library to easily build Python iterators for efficient data loading.
A DataFlow has a `get_data()` generator method, A DataFlow has a `get_data()` generator method,
which yields `datapoints`. which yields `datapoints`.
...@@ -61,3 +61,12 @@ generator = df.get_data() ...@@ -61,3 +61,12 @@ generator = df.get_data()
for dp in generator: for dp in generator:
# dp is now a list. do whatever # dp is now a list. do whatever
``` ```
### Efficiency
DataFlow is purely Python -- a convenient and slow language (w.r.t C++). But faster data loading doesn't always mean faster
training: we only need data to be __fast enough__.
DataFlow is fast enough for problems up to the scale of multi-GPU ImageNet training.
See [efficient dataflow tutorial](http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html)
for details.
...@@ -203,7 +203,7 @@ So DataFlow will not be a serious bottleneck if configured properly. ...@@ -203,7 +203,7 @@ So DataFlow will not be a serious bottleneck if configured properly.
## More Efficient DataFlow ## More Efficient DataFlow
To work with larger datasets (or smaller networks, or more GPUs) you could be severely bounded by CPU or disk speed of a single machine. To work with larger datasets (or smaller networks, or more/better GPUs) you could be severely bounded by CPU or disk speed of a single machine.
One way is to optimize the preprocessing routine (e.g. write something in C++ or use TF reading operators). One way is to optimize the preprocessing routine (e.g. write something in C++ or use TF reading operators).
Another way to scale is to run DataFlow in a distributed fashion and collect them on the Another way to scale is to run DataFlow in a distributed fashion and collect them on the
training machine. E.g.: training machine. E.g.:
......
...@@ -140,8 +140,14 @@ def dump_dataflow_to_tfrecord(df, path): ...@@ -140,8 +140,14 @@ def dump_dataflow_to_tfrecord(df, path):
""" """
df.reset_state() df.reset_state()
with tf.python_io.TFRecordWriter(path) as writer: with tf.python_io.TFRecordWriter(path) as writer:
for dp in df.get_data(): try:
writer.write(dumps(dp)) sz = df.size()
except NotImplementedError:
sz = 0
with get_tqdm(total=sz) as pbar:
for dp in df.get_data():
writer.write(dumps(dp))
pbar.update()
from ..utils.develop import create_dummy_func # noqa from ..utils.develop import create_dummy_func # noqa
......
...@@ -244,8 +244,14 @@ class TFRecordData(DataFlow): ...@@ -244,8 +244,14 @@ class TFRecordData(DataFlow):
This class works with :func:`dftools.dump_dataflow_to_tfrecord`. This class works with :func:`dftools.dump_dataflow_to_tfrecord`.
""" """
def __init__(self, path, size=None): def __init__(self, path, size=None):
"""
Args:
path (str): path to the tfrecord file
size (int): total number of records, because this metadata is not
stored in the tfrecord file.
"""
self._gen = tf.python_io.tf_record_iterator(path) self._gen = tf.python_io.tf_record_iterator(path)
self._size = size self._size = int(size)
def size(self): def size(self):
if self._size: if self._size:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment