Commit 215a4d6d authored by Yuxin Wu's avatar Yuxin Wu

update docs

parent a6936913
...@@ -9,6 +9,17 @@ which you can use if your data format is simple. ...@@ -9,6 +9,17 @@ which you can use if your data format is simple.
In general, you probably need to write a source DataFlow to produce data for your task, In general, you probably need to write a source DataFlow to produce data for your task,
and then compose it with existing modules (e.g. mapping, batching, prefetching, ...). and then compose it with existing modules (e.g. mapping, batching, prefetching, ...).
The easiest way to create a DataFlow to load custom data, is to wrap a custom generator, e.g.:
```python
def my_data_loader():
while True:
# load data from somewhere
yield [my_array, my_label]
dataflow = DataFromGenerator(my_data_loader)
```
To write more complicated DataFlow, you need to inherit the base `DataFlow` class.
Usually, you just need to implement the `get_data()` method which yields a datapoint every time. Usually, you just need to implement the `get_data()` method which yields a datapoint every time.
```python ```python
class MyDataFlow(DataFlow): class MyDataFlow(DataFlow):
...@@ -24,12 +35,12 @@ Optionally, you can implement the following two methods: ...@@ -24,12 +35,12 @@ Optionally, you can implement the following two methods:
+ `size()`. Return the number of elements the generator can produce. Certain tensorpack features might use it. + `size()`. Return the number of elements the generator can produce. Certain tensorpack features might use it.
+ `reset_state()`. It is guaranteed that the actual process which runs a DataFlow will invoke this method before using it. + `reset_state()`. It is guaranteed that the actual process which runs a DataFlow will invoke this method before using it.
So if this DataFlow needs to do something after a `fork()`, you should put it here. So if this DataFlow needs to do something after a `fork()`, you should put it here.
The convention is that, `reset_state()` must be called once and usually only once for each DataFlow instance. `reset_state()` must be called once and only once for each DataFlow instance.
A typical situation is when your DataFlow uses random number generator (RNG). Then you would need to reset the RNG here. A typical example is when your DataFlow uses random number generator (RNG). Then you would need to reset the RNG here.
Otherwise, child processes will have the same random seed. The `RNGDataFlow` base class does this for you. Otherwise, child processes will have the same random seed. The `RNGDataFlow` base class does this for you.
You can subclass `RNGDataFlow` to access `self.rng` whose seed has been taken care of. You can subclass `RNGDataFlow` to access `self.rng` whose seed has been taken care of.
DataFlow implementations for several well-known datasets are provided in the DataFlow implementations for several well-known datasets are provided in the
[dataflow.dataset](../../modules/dataflow.dataset.html) [dataflow.dataset](../../modules/dataflow.dataset.html)
...@@ -37,15 +48,16 @@ module, you can take them as a reference. ...@@ -37,15 +48,16 @@ module, you can take them as a reference.
#### More Data Processing #### More Data Processing
You can put any data processing you need in the source DataFlow, or write a new DataFlow for data You can put any data processing you need in the source DataFlow you write, or you can write a new DataFlow for data
processing on top of the source DataFlow, e.g.: processing on top of the source DataFlow, e.g.:
```python ```python
class ProcessingDataFlow(DataFlow): class ProcessingDataFlow(DataFlow):
def __init__(self, ds): def __init__(self, ds):
self.ds = ds self.ds = ds
def get_data(self): def get_data(self):
for datapoint in self.ds.get_data(): for datapoint in self.ds.get_data():
# do something # do something
yield new_datapoint yield new_datapoint
``` ```
...@@ -104,6 +104,7 @@ class DataFromGenerator(DataFlow): ...@@ -104,6 +104,7 @@ class DataFromGenerator(DataFlow):
""" """
Args: Args:
gen: iterable, or a callable that returns an iterable gen: iterable, or a callable that returns an iterable
size: deprecated
""" """
if not callable(gen): if not callable(gen):
self._gen = lambda: gen self._gen = lambda: gen
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment