Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple pipe reader for hdfs or other service #5282

Merged
merged 3 commits into from
Dec 2, 2017

Conversation

typhoonzero
Copy link
Contributor

Fix #5011

def pipe_reader(left_cmd,
parser,
bufsize=8192,
file_type="plain",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just need to support "plain", the user can decompress it outside of Paddle using pipe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought it may be inconvenient for users to decompress stream data in their parsers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the user can decompress the data using shell commands, not in the parsers, e.g.:

hadoop fs -cat /path/to/some/file | gzip -d

Copy link
Contributor Author

@typhoonzero typhoonzero Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is simpler, but I'm considering the pipe size using bash is set by ulimit, when in cluster trainer, users may not have control over every node's ulimit configuration, but using python code can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand bash very well, but does the pipe just "block" if it's full, and probably gzip can decode in a stream fashion, and will consume the pipe buffer, so it will be unblocked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, pipes can block both producer and consumer:

If a process attempts to read from an empty pipe, then read(2) will
block until data is available. If a process attempts to write to a
full pipe (see below), then write(2) blocks until sufficient data has
been read from the pipe to allow the write to complete.

Well, my point is, use pipes in python code, can let users to define pipe buffer size which is critical to the reader performance.

Copy link
Contributor

@helinwang helinwang Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok. Thanks!

@@ -323,3 +323,101 @@ def xreader():
yield sample

return xreader


def _buf2lines(buf, line_break="\n"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line break won't work in binary data, maybe we should let parser decide when to output a new data item?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cut_lines=False the binary data will send to parser directly. Do you mean by should let user's parser generate data, and make pipe_reader a decorator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought maybe pipe_reader should not cut the lines, since it does not have sufficient information, we might want leave it to the user's parser to do so (cut and generate data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will update.

Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero typhoonzero merged commit d89061c into PaddlePaddle:develop Dec 2, 2017
@typhoonzero typhoonzero deleted the simple_pipe_reader branch December 22, 2017 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants