Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
seminar-breakout
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Shashank Suhas
seminar-breakout
Commits
e0391e29
Commit
e0391e29
authored
Jun 23, 2019
by
Yuxin Wu
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update docs
parent
22f410e2
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
23 additions
and
19 deletions
+23
-19
docs/tutorial/philosophy/dataflow.md
docs/tutorial/philosophy/dataflow.md
+23
-19
No files found.
docs/tutorial/philosophy/dataflow.md
View file @
e0391e29
...
...
@@ -49,23 +49,27 @@ We think you usually do not, at least not after you try DataFlow, because they a
to this new format. Then you read data from this format to training workers.
It's a waste of your effort: the intermediate format does not have to exist.
1.
**Still Not Easy**
: There are cases when having an intermediate format is useful
for performance reasons.
For example, to apply some one-time expensive preprocessing to your dataset, or
merge small files to large files to reduce disk burden.
However, those binary data formats are not necessarily good for the cases.
1.
**Not Easy**
: Even when you do need to use an intermediate format that's different from your
original data format
(for performance reasons, for example), there are many formats you can choose from.
Why use a s
ingle dedicated
binary format when you could use something else?
Why use a s
pecial
binary format when you could use something else?
A different format may bring you:
* Simpler code for data loading.
* Easier visualization.
* Interoperability with other libraries.
* More functionalities.
After all, why merging all the images into a binary file on the disk,
when you know that saving all the images separately is fast enough for your task?
Different formats have their strength and weakness in the above aspects.
Forcing a single binary format on users is certainly not ideal.
We should let users make the choice.
1.
**Not Necessarily Fast**
:
There are cases when having an intermediate format is useful for performance reasons.
For example, to apply some one-time expensive preprocessing to your dataset.
But other formats are probably equally fast.
Formats like TFRecords and RecordIO are just as fast as your disk, and of course,
as fast as other libraries.
Decades of engineering in dataset systems have provided
...
...
@@ -92,8 +96,8 @@ On the other hand, DataFlow is:
### Alternative Data Loading Solutions:
Some frameworks have also provided good framework-specific solutions for data loading.
In addition to that DataFlow is framework-agnostic, there are other reasons you
might prefer DataFlow over the alternatives:
On the contrary, DataFlow is framework-agnostic: you can use it in any Python environment.
In addition to this benefit, there are other reasons you
might prefer DataFlow over the alternatives:
#### tf.data or other TF operations
...
...
@@ -131,7 +135,7 @@ It only makes sense to use TF to read data, if your data is originally very clea
If not, you may feel like writing a Python script to reformat your data, but then you're
almost writing a DataFlow (a DataFlow can be made from a Python iterator)!
As for speed, when TF happens to support
the operators you need,
As for speed, when TF happens to support
and optimize the operators you need,
it does offer a similar or higher speed (it takes effort to tune, of course).
But how do you make sure you'll not run into one of the unsupported situations listed above?
...
...
@@ -144,7 +148,7 @@ which does not work when you have a dynamic/unreliable data source,
or when you need to filter your data on the fly.
`torch.utils.data.DataLoader`
is quite good, despite that it also makes some
**bad assumptions on batching**
and is not always efficient
.
**bad assumptions on batching**
and is not always efficient
:
1.
It assumes you always do batch training, has a constant batch size, and
the batch grouping can be purely determined by indices.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment