update docs

215a4d6d · Yuxin Wu · a6936913 · 215a4d6d · 215a4d6d
Commit 215a4d6d authored Mar 16, 2018 by Yuxin Wu
Hide whitespace changes
Inline Side-by-side

Showing with 21 additions and 8 deletions

docs/tutorial/extend/dataflow.md docs/tutorial/extend/dataflow.md +20 -8

tensorpack/dataflow/raw.py tensorpack/dataflow/raw.py +1 -0

No files found.
--- a/docs/tutorial/extend/dataflow.md
+++ b/docs/tutorial/extend/dataflow.md
@@ -9,6 +9,17 @@ which you can use if your data format is simple.
 In general, you probably need to write a source DataFlow to produce data for your task,
 and then compose it with existing modules (e.g. mapping, batching, prefetching, ...).
+The easiest way to create a DataFlow to load custom data, is to wrap a custom generator, e.g.:
+```python
+def my_data_loader():
+  while True:
+    # load data from somewhere
+    yield [my_array, my_label]
+dataflow = DataFromGenerator(my_data_loader)
+```
+To write more complicated DataFlow, you need to inherit the base `DataFlow` class.
 Usually, you just need to implement the `get_data()` method which yields a datapoint every time.
 ```python
 class MyDataFlow(DataFlow):
@@ -24,12 +35,12 @@ Optionally, you can implement the following two methods:
 + `size()`. Return the number of elements the generator can produce. Certain tensorpack features might use it.
 + `reset_state()`. It is guaranteed that the actual process which runs a DataFlow will invoke this method before using it.
-	So if this DataFlow needs to do something after a `fork()`, you should put it here.
+  So if this DataFlow needs to do something after a `fork()`, you should put it here.
-	The convention is that, `reset_state()` must be called once and usually only once for each DataFlow instance.
+  `reset_state()` must be called once and only once for each DataFlow instance.
-	A typical situation is when your DataFlow uses random number generator (RNG). Then you would need to reset the RNG here.
+  A typical example is when your DataFlow uses random number generator (RNG). Then you would need to reset the RNG here.
-	Otherwise, child processes will have the same random seed. The `RNGDataFlow` base class does this for you.
+  Otherwise, child processes will have the same random seed. The `RNGDataFlow` base class does this for you.
-	You can subclass `RNGDataFlow` to access `self.rng` whose seed has been taken care of.
+  You can subclass `RNGDataFlow` to access `self.rng` whose seed has been taken care of.
 DataFlow implementations for several well-known datasets are provided in the
 [dataflow.dataset](../../modules/dataflow.dataset.html)
@@ -37,15 +48,16 @@ module, you can take them as a reference.
 #### More Data Processing
-You can put any data processing you need in the source DataFlow, or write a new DataFlow for data
+You can put any data processing you need in the source DataFlow you write, or you can write a new DataFlow for data
 processing on top of the source DataFlow, e.g.:
 ```python
 class ProcessingDataFlow(DataFlow):
  def __init__(self, ds):
-	  self.ds = ds
+    self.ds = ds
  def get_data(self):
    for datapoint in self.ds.get_data():
      # do something
-			yield new_datapoint
+      yield new_datapoint
 ```
--- a/tensorpack/dataflow/raw.py
+++ b/tensorpack/dataflow/raw.py
@@ -104,6 +104,7 @@ class DataFromGenerator(DataFlow):
        """
        Args:
            gen: iterable, or a callable that returns an iterable
+            size: deprecated
        """
        if not callable(gen):
            self._gen = lambda: gen