update docs

ccef4d4f · Yuxin Wu · 796a4353 · ccef4d4f · ccef4d4f
Commit ccef4d4f authored Nov 01, 2017 by Yuxin Wu
Hide whitespace changes
Inline Side-by-side

Showing with 16 additions and 6 deletions

docs/tutorial/dataflow.md docs/tutorial/dataflow.md +4 -3

docs/tutorial/input-source.md docs/tutorial/input-source.md +12 -3

No files found.
--- a/docs/tutorial/dataflow.md
+++ b/docs/tutorial/dataflow.md
@@ -18,7 +18,7 @@ One good thing about having a standard interface is to be able to provide
 the greatest code reusability.
 There are a lot of existing DataFlow utilities in tensorpack, which you can use to compose
 complex DataFlow with a long data pipeline. A common pipeline usually
-would __read from disk (or other sources), apply augmentations, group into batches,
+would __read from disk (or other sources), apply transformations, group into batches,
 prefetch data__, etc. A simple example is as the following:

 ````python
@@ -35,16 +35,17 @@ You can find more complicated DataFlow in the [ResNet training script](../exampl
 with all the data preprocessing.

 Unless you are working with standard data types (image folders, LMDB, etc),
-you would usually want to write the base DataFlow (`MyDataFlow` in the above example) for your data format.
+you would usually want to write the source DataFlow (`MyDataFlow` in the above example) for your data format.
 See [another tutorial](extend/dataflow.html)
 for simple instructions on writing a DataFlow.
-Once you have the base reader, all the [existing DataFlows](../modules/dataflow.html) are ready for you to complete
+Once you have the source reader, all the [existing DataFlows](../modules/dataflow.html) are ready for you to complete
 the rest of the data pipeline.

 ### Why DataFlow

 1. It's easy: write everything in pure Python, and reuse existing utilities.
 	 On the contrary, writing data loaders in TF operators is usually painful, and performance is hard to tune.
+	 See more discussions in [Python Reader or TF Reader](input-source.html#python-reader-or-tf-reader).
 2. It's fast: see [Efficient DataFlow](efficient-dataflow.html)
 	on how to build a fast DataFlow with parallelism.
 	If you're using DataFlow with tensorpack, also see [Input Pipeline tutorial](input-source.html)

--- a/docs/tutorial/input-source.md
+++ b/docs/tutorial/input-source.md
@@ -60,9 +60,18 @@ The benefits of using TensorFlow ops are:
 		and TF `StagingArea` can help hide the "Copy to GPU" latency.
 		They are used by most examples in tensorpack.

-The benefits of using Python reader is obvious:
-it's much much easier to write Python to read different data format,
-handle corner cases in noisy data, preprocess, etc.
+The benefits of using Python reader is obvious: it's __much much easier__.
+Reading data is a much more complicated and much less structured job than training a model.
+You need to handle different data format, handle corner cases in noisy data,
+which all require logical operations, condition operations, loops, etc. These operations
+are __naturally not suitable__ for a graph computation framework.
+
+It only makes sense to use TF to read data, if your data is originally very clean and well-formated.
+You may want to write a script to clean your data, then you're almost writing a Python loader already!
+Think about it: it's a waste of time to write a Python script to transform from raw data to TFRecords,
+then a TF script to transform from TFRecords to tensors.
+The intermediate step (TFRecords) doesn't have to exist.
+

 ## InputSource