update docs

e0391e29 · Yuxin Wu · 22f410e2 · e0391e29
Commit e0391e29 authored Jun 23, 2019 by Yuxin Wu
Show whitespace changes
Inline Side-by-side

Showing with 23 additions and 19 deletions

docs/tutorial/philosophy/dataflow.md docs/tutorial/philosophy/dataflow.md +23 -19

No files found.
--- a/docs/tutorial/philosophy/dataflow.md
+++ b/docs/tutorial/philosophy/dataflow.md
@@ -49,23 +49,27 @@ We think you usually do not, at least not after you try DataFlow, because they a
 	 to this new format. Then you read data from this format to training workers.
 	 It's a waste of your effort: the intermediate format does not have to exist.

-1. **Still Not Easy**: There are cases when having an intermediate format is useful
-	 for performance reasons.
-	 For example, to apply some one-time expensive preprocessing to your dataset, or
-	 merge small files to large files to reduce disk burden.
-	 However, those binary data formats are not necessarily good for the cases.
+1. **Not Easy**: Even when you do need to use an intermediate format that's different from your
+	 original data format
+	(for performance reasons, for example), there are many formats you can choose from.

-	 Why use a single dedicated binary format when you could use something else?
+	 Why use a special binary format when you could use something else?
 	 A different format may bring you:
+
 	 * Simpler code for data loading.
 	 * Easier visualization.
 	 * Interoperability with other libraries.
 	 * More functionalities.

-	 After all, why merging all the images into a binary file on the disk,
-	 when you know that saving all the images separately is fast enough for your task?
+	 Different formats have their strength and weakness in the above aspects.
+	 Forcing a single binary format on users is certainly not ideal.
+	 We should let users make the choice.

 1. **Not Necessarily Fast**:
+	There are cases when having an intermediate format is useful for performance reasons.
+	For example, to apply some one-time expensive preprocessing to your dataset.
+	But other formats are probably equally fast.
+
 	Formats like TFRecords and RecordIO are just as fast as your disk, and of course,
 	as fast as other libraries.
 	Decades of engineering in dataset systems have provided
@@ -92,8 +96,8 @@ On the other hand, DataFlow is:
 ### Alternative Data Loading Solutions:

 Some frameworks have also provided good framework-specific solutions for data loading.
-In addition to that DataFlow is framework-agnostic, there are other reasons you
-might prefer DataFlow over the alternatives:
+On the contrary, DataFlow is framework-agnostic: you can use it in any Python environment.
+In addition to this benefit, there are other reasons you might prefer DataFlow over the alternatives:

 #### tf.data or other TF operations

@@ -131,7 +135,7 @@ It only makes sense to use TF to read data, if your data is originally very clea
 If not, you may feel like writing a Python script to reformat your data, but then you're
 almost writing a DataFlow (a DataFlow can be made from a Python iterator)!

-As for speed, when TF happens to support the operators you need, 
+As for speed, when TF happens to support and optimize the operators you need,
 it does offer a similar or higher speed (it takes effort to tune, of course).
 But how do you make sure you'll not run into one of the unsupported situations listed above?

@@ -144,7 +148,7 @@ which does not work when you have a dynamic/unreliable data source,
 or when you need to filter your data on the fly.

 `torch.utils.data.DataLoader` is quite good, despite that it also makes some
-**bad assumptions on batching** and is not always efficient.
+**bad assumptions on batching** and is not always efficient:

 1. It assumes you always do batch training, has a constant batch size, and
   the batch grouping can be purely determined by indices.