Simple pipe reader for hdfs or other service #5282

typhoonzero · 2017-11-01T13:02:10Z

helinwang · 2017-11-09T00:56:10Z

python/paddle/v2/reader/decorator.py

+def pipe_reader(left_cmd,
+                parser,
+                bufsize=8192,
+                file_type="plain",


Maybe we just need to support "plain", the user can decompress it outside of Paddle using pipe.

Thought it may be inconvenient for users to decompress stream data in their parsers.

I meant the user can decompress the data using shell commands, not in the parsers, e.g.:

hadoop fs -cat /path/to/some/file | gzip -d

Well, this is simpler, but I'm considering the pipe size using bash is set by ulimit, when in cluster trainer, users may not have control over every node's ulimit configuration, but using python code can.

I don't understand bash very well, but does the pipe just "block" if it's full, and probably gzip can decode in a stream fashion, and will consume the pipe buffer, so it will be unblocked.

By default, pipes can block both producer and consumer:

If a process attempts to read from an empty pipe, then read(2) will
block until data is available. If a process attempts to write to a
full pipe (see below), then write(2) blocks until sufficient data has
been read from the pipe to allow the write to complete.

Well, my point is, use pipes in python code, can let users to define pipe buffer size which is critical to the reader performance.

I see, ok. Thanks!

helinwang · 2017-11-09T01:00:29Z

python/paddle/v2/reader/decorator.py

@@ -323,3 +323,101 @@ def xreader():
                yield sample

    return xreader
+
+
+def _buf2lines(buf, line_break="\n"):


line break won't work in binary data, maybe we should let parser decide when to output a new data item?

If cut_lines=False the binary data will send to parser directly. Do you mean by should let user's parser generate data, and make pipe_reader a decorator?

Yes, I thought maybe pipe_reader should not cut the lines, since it does not have sufficient information, we might want leave it to the user's parser to do so (cut and generate data).

Agree, will update.

fix pipe_reader unimport packages

helinwang

LGTM! Let's update https://github.com/PaddlePaddle/Paddle/pull/5282/files#r150348324 with a follow up commit.

simple pipe reader for hdfs or other service

aeeb77d

typhoonzero assigned helinwang Nov 1, 2017

helinwang reviewed Nov 9, 2017

View reviewed changes

seiriosPlus and others added 2 commits November 27, 2017 22:19

fix pipe_reader unimport packages

bf360c7

Merge pull request #1 from seiriosPlus/tangw/simple_pipe_reader

cd36531

fix pipe_reader unimport packages

helinwang approved these changes Dec 1, 2017

View reviewed changes

typhoonzero merged commit d89061c into PaddlePaddle:develop Dec 2, 2017

typhoonzero deleted the simple_pipe_reader branch December 22, 2017 05:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple pipe reader for hdfs or other service #5282

Simple pipe reader for hdfs or other service #5282

typhoonzero commented Nov 1, 2017

helinwang Nov 9, 2017

typhoonzero Nov 9, 2017

helinwang Nov 10, 2017

typhoonzero Nov 13, 2017 •

edited

Loading

helinwang Nov 13, 2017

typhoonzero Nov 14, 2017

helinwang Nov 14, 2017 •

edited

Loading

helinwang Nov 9, 2017

typhoonzero Nov 9, 2017

helinwang Nov 10, 2017

typhoonzero Nov 13, 2017

helinwang left a comment

Simple pipe reader for hdfs or other service #5282

Simple pipe reader for hdfs or other service #5282

Conversation

typhoonzero commented Nov 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero Nov 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Nov 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang left a comment

Choose a reason for hiding this comment

typhoonzero Nov 13, 2017 •

edited

Loading

helinwang Nov 14, 2017 •

edited

Loading