-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add chomp option to readline(s) #19944
Conversation
line = readuntil(s, 0x0a) | ||
i = length(line) | ||
if i < 1 || line[i] != 0x0a | ||
return String(line[1:i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do String(line)
? This could also be merged with the simple case above, as getting the length of line
should have a negligible cost.
if i < 1 || line[i] != 0x0a | ||
return String(line[1:i]) | ||
elseif i < 2 || line[i-1] != 0x0d | ||
return String(line[1:i-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unnecessarily creates a copy of line
and explains the allocations you get. Use unsafe_string(pointer(line), i-1)
. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsafe_string
still allocates a copy. Maybe String(resize!(line, i-1))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try!
readlines(filename::AbstractString) | ||
line = readuntil(s, 0x0a) | ||
i = length(line) | ||
if i < 1 || line[i] != 0x0a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably clearer as i == 0
? I wondered for one moment when it would be negative.
Also, it could be more efficient to start with the i >= 2
case, which is the common one, to avoid an unnecessary branch in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the order branches really make a difference? I've sometimes tried to benchmark with cases I thought were very obvious, but didn't find a clear effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, would need to check. You're right that it's probably going to be negligible except maybe for very short strings, so maybe it's OK to have the first branch precisely for them.
@@ -74,6 +74,12 @@ Base.compact(io) | |||
@test_throws ArgumentError seek(io,0) | |||
@test_throws ArgumentError truncate(io,0) | |||
@test readline(io) == "whipped cream\n" | |||
@test write(io,"pancakes\nwaffles\nblueberries\n") > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a few non-ASCII characters like ë
and €
? This would ensure we don't assume 1-byte characters.
readline(stream::IO=STDIN) | ||
readline(filename::AbstractString) | ||
readline(stream::IO=STDIN, chomp::Bool=false) | ||
readline(filename::AbstractString, chomp::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if chomp=true
should be the default (here and in eachline
)? Almost all uses of readline
that I can find in the packages either (a) are insensitive to whether it ends with a newline or (b) explicitly strip the newline.
That would be a breaking change, but I doubt it would break much actual code, since most code calling readline
or eachline
has to handle the case where the last line in the file does not end in a newline anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan was to merge this first, and then open a PR to deprecate the one-argument form of readline
in 0.6. Then the default would be changed in 1.0.
We could make the breaking change if you say packages are fine, but the deprecation would allow people to notice they no longer need to remove trailing newlines, which will make code simpler/faster. Might be worth the inconvenience, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Would prefer to do the deprecation in the same PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let's ensure we're happy with the implementation, and then have a commit to make the switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for making chomp=true
the default. Does adding a deprecation mean fixing all of deprecation warnings from base as well? I'm afraid I won't have time to do it before 0.6 feature freeze.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but really that should be much easier than what you've done until now. If you really can't, I'll do it for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll look into it first. I guess the easiest would be to change all the calls to readline to `readline(..., false).
How do you add a deprecation warning?
readline(filename::AbstractString, chomp::Bool=false) | ||
|
||
Read a single line of text from the given I/O stream or file (defaults to `STDIN`). | ||
Lines in the input can end in `'\\n'` or `'\\r\\n'`. When reading from a file, the text is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, should be "\r\n"
, so do that and also "\n"
.
While you're at it, the description of chomp
would better be in a separate paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it does look better with separate paragraphs
s = ccall(:jl_readuntil, Ref{String}, (Ptr{Void}, UInt8, UInt8), s.ios, '\n', 1) | ||
i = endof(s) | ||
if i < 1 || codeunit(s,i) != 0x0a | ||
return s[1:i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to find a strategy similar to above to avoid a copy. But I'm not sure how to do that with the new String
design.
BTW, is this expected? I thought no copy was involved when creating the string:
julia> x=UInt8['a', 'b']
2-element Array{UInt8,1}:
0x61
0x62
julia> s=String(x)
"ab"
julia> x[1] = 'c'
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
julia> s # String hasn't changed
"ab"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, String(::Vector)
is no longer copy-free unless the vector is allocated with the (undocumented) Base.StringVector
:
julia> x = Base.StringVector(2); x[1] = 'a'; x[2] = 'b'
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
julia> x = Base.StringVector(2); x[1] = 'a'; x[2] = 'b'; s = String(x)
"ab"
julia> x[1] = 'c'
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
julia> s
"cb"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, in order to avoid an extra copy, it seems like the chomping will have to be done in C, in jl_readuntil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #19945.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR at least, we don't need these functions to be exported since we're in Base. But here we would need jl_readuntil
to return a StringVector
. Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jl_readuntil
can return a String
already if str=1
is passed (see also #19946). It would be easy to add an additional chomp
argument to it that chomped LF/CRLF, and this might be the only way to completely avoid copies with the current String
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chomp || return ccall(:jl_readuntil, Ref{String}, (Ptr{Void}, UInt8, UInt8), s.ios, '\n', 1) | ||
|
||
s = ccall(:jl_readuntil, Ref{String}, (Ptr{Void}, UInt8, UInt8), s.ios, '\n', 1) | ||
i = endof(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endof
is costly since it needs to look for the start of the last Unicode char. Here we're working at the byte level, so use sizeof(s)
instead. Then, you cannot index directly into s
as you might be in the middle of a character: you need to access the underlying array of bytes directly (cf. point below).
|
||
s = ccall(:jl_readuntil, Ref{String}, (Ptr{Void}, UInt8, UInt8), s.ios, '\n', 1) | ||
i = endof(s) | ||
if i < 1 || codeunit(s,i) != 0x0a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add @inbounds
here since you did the check manually. Same below.
IMO the Another option would be, rather than adding Apologies for the noise if the ship has already sailed on the design discussion. |
@ararslan, a function argument would not allow the |
@ararslan I think the idea is that people won't use that argument that often, it's mostly useful as a deprecation path in order to chomp by default. |
Okay, thanks for the explanation, both of you. Sorry for the noise! (I do still think a kwarg would be a nice choice, but I will now sink back into the shadows.) |
The positional vs. keyword choice is a hard one. In general I'd also prefer using keyword arguments everywhere for booleans, but they come at a price currently, so they are seldom used in Base. Maybe something to think about for 2.0. |
I would also like to have a keyword argument, but @nalimilan pointed out that it comes with a performance penalty. As |
After the last two commits there is no difference in performance using |
@@ -292,6 +292,51 @@ JL_DLLEXPORT jl_value_t *jl_readuntil(ios_t *s, uint8_t delim, uint8_t str) | |||
return (jl_value_t*)a; | |||
} | |||
|
|||
JL_DLLEXPORT jl_value_t *jl_readline(ios_t *s, uint8_t chomp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of near-duplicate code. Could we add an argument to jl_readuntil
instead? Similarly ios_copyline
does not seem totally necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've changed it now and its much cleaner.
Should I go ahead deprecate the old one argument version? What's the best practice here, should I rebase (and what's the correct git command) from master to avoid conflicts as Or is there an easy way to check for conflicts for a file before modifying it? |
Since there are no conflicts here at the moment, you should be able to |
""" | ||
function readlines(filename::AbstractString, chomp::Bool=false) | ||
open(filename) do f | ||
readlines(f, chomp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indented a bit too far
eltype(::Type{EachLine}) = String | ||
|
||
readlines(s=STDIN) = collect(eachline(s)) | ||
readlines(s::IO=STDIN, chomp::Bool=false) = collect(eachline(s, chomp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this ever make sense to send in to eachline on a non-IO first argument? If so this might be restricting vs what worked before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell no (there is a separate definition for strings). This also matches documentation.
1cd2051
to
98fa88c
Compare
I have no added deprecation warnings to old methods and tried to fix all warnings from master. I have done clean build locally from scratch, but tests are still running. |
@@ -165,6 +165,9 @@ Compiler/Runtime improvements | |||
Deprecated or removed | |||
--------------------- | |||
|
|||
* One argument methods to `readline`, `readlines` and `eachline` have been depracated in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"deprecated"
eachline(stream::IO) = EachLine(stream) | ||
function eachline(filename::AbstractString) | ||
eachline(stream::IO, chomp::Bool=true) = EachLine(stream, chomp) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single new line is probably enough, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks, fixed
@@ -821,7 +821,8 @@ size_t ios_copyuntil(ios_t *to, ios_t *from, char delim) | |||
} | |||
else { | |||
size_t ntowrite = pd - (from->buf+from->bpos) + 1; | |||
written = ios_write(to, from->buf+from->bpos, ntowrite); | |||
size_t nchomp = ios_nchomp(from, ntowrite, chomp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd check chomp
here, and not pass it at all to ios_nchomp
. That can avoid a function call when chomp
is false. Then you can assume chomp
is true inside ios_nchomp
.
@test_throws ArgumentError seek(io,0) | ||
@test_throws ArgumentError truncate(io,0) | ||
@test readline(io) == "whipped cream\n" | ||
@test readline(io, false) == "whipped cream\n" | ||
@test write(io,"pancakes\nwaffles\nblueberries\n") > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-ASCII characters, please! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@@ -33,6 +33,9 @@ Breaking changes | |||
|
|||
This section lists changes that do not have deprecation warnings. | |||
|
|||
* `readline`, `readlines` and `eachline` return lines without line ends by default. | |||
You can use `readline(s, false`) to get the old behavior and include EOL character(s). ([#19944]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misplaced closing backtick: should be after the )
@@ -33,6 +33,9 @@ Breaking changes | |||
|
|||
This section lists changes that do not have deprecation warnings. | |||
|
|||
* `readline`, `readlines` and `eachline` return lines without line ends by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"line endings" instead of "line ends" might be clearer
@stevengj Great. If you post a list of affected packages here, I can help filing issues. |
I have updated the news according to comments. I think this should be ready to merge. |
readline(stdout_read) | ||
readline(stdout_read) | ||
readline(stdout_read, false) | ||
readline(stdout_read, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it also be a good idea to test the default behavior with readline(....)
without line endings or is that already being tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tested in iobuffer
and read
@@ -31,7 +31,7 @@ function prompt(msg::AbstractString; default::AbstractString="", password::Bool= | |||
Base.getpass(msg) | |||
else | |||
print(msg) | |||
chomp(readline(STDIN)) | |||
readline(STDIN, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe leave off true
since it's now default?
@@ -7,7 +7,7 @@ end | |||
|
|||
function parserow(stream::IO) | |||
withstream(stream) do | |||
line = readline(stream) |> chomp | |||
line = readline(stream, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe leave off true
since it's now default?
@@ -46,7 +46,7 @@ function linecontains(io::IO, chars; allow_whitespace = true, | |||
eat = true, | |||
allowempty = false) | |||
start = position(io) | |||
l = readline(io) |> chomp | |||
l = readline(io, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe leave off true
since it's now default?
@@ -99,7 +99,7 @@ function startswith(stream::IO, r::Regex; eat = true, padding = false) | |||
@assert Base.startswith(r.pattern, "^") | |||
start = position(stream) | |||
padding && skipwhitespace(stream) | |||
line = chomp(readline(stream)) | |||
line = readline(stream, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe leave off true
since it's now default?
@@ -67,7 +67,7 @@ end | |||
|
|||
function getmetabranch() | |||
try | |||
chomp(readline(joinpath(path(),"META_BRANCH"))) | |||
readline(joinpath(path(),"META_BRANCH"), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe leave off true
since it's now default?
@@ -65,8 +65,7 @@ end | |||
|
|||
function read(readable::Union{IO,Base.AbstractCmd}) | |||
lines = Line[] | |||
for line in eachline(readable) | |||
line = chomp(line) | |||
for line in eachline(readable, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe leave off true
since it's now default?
unmark(sock) | ||
@test !ismarked(sock) | ||
@test_throws ArgumentError reset(sock) | ||
@test !unmark(sock) | ||
@test readline(sock) == "Goodbye, world...\n" | ||
@test readline(sock, true) == "Goodbye, world..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe leave off true
since it's now default?
One more bit of bikeshedding. How do people feel about |
We touched this above. Overall I prefer keyword arguments in many cases, but they are rarely used in the API currently anyway. In the present case, the performance impact could be significant if one calls |
Would be good to do that performance experiment. |
As a worst case, I tried calling buf = IOBuffer(join(["a" for i = 1:10000], "\n"))
function foo(io)
len = 0
while !eof(io)
len += sizeof(readline(io))
end
return len
end
@benchmark foo($buf)
buf = IOBuffer(join(["a" for i = 1:10000], "\n"))
myreadline(io::IO; chomp=false) = chomp ? chomp(readline(io)) : readline(io)
function bar(io)
len = 0
while !eof(io)
len += sizeof(myreadline(io))
end
return len
end
@benchmark bar($buf)
|
No, wait, that benchmark isn't right. The 6ns time is for subsequent benchmark runs when the buffer is empty. (6ns is obviously too small a time to read 10^4 lines!) |
Okay, here is a correct benchmark: @benchmark foo(b) setup=(b=IOBuffer(join(["a" for i = 1:10000], "\n")))
@benchmark bar(b) setup=(b=IOBuffer(join(["a" for i = 1:10000], "\n"))) The minimum time for |
The call to chomp in myreadline should fail since chomp is a Boolean, not a function. |
Oh right. It doesn't fail because I'm calling it with chomp=false, though |
Given the overhead I also think its better to leave it as positional argument. I have now cleaned the code as suggested in the last review by removing extra |
I realized that I forgot to give a type to the keyword argument (since type specialization does not occur on keyword args, I think?). I changed the definition to
and got:
i.e. a 15% overhead. Since this case (10000 lines in a memory buffer, with 1 char per line) is so pessimistic, it seems like a keyword argument will be fine. |
@stevengj I think your benchmark is not completely correct, since myreadline2(io::IO, c::Bool=false) = c ? chomp(readline(io)) : readline(io)
function foo(io)
len = 0
while !eof(io)
len += sizeof(myreadline2(io))
end
return len
end With this, I get an even smaller penalty due to the keyword argument: 5%. So +1 for the keyword argument. But let's make sure that's the final decision before asking @mpastell to update the PR once again. |
Adding a 0.6 milestone so that the decision is made before the feature freeze. @mpastell Since everybody seems to be in favor of a keyword argument, maybe it's worth updating the PR to use it so that it can be merged immediately if the consensus is confirmed. |
I'm in favor of the keyword argument and then merging this as a breaking change. |
I'm going to go ahead and make a PR squashing these changes into a single commit and then changing chomp to a keyword argument. |
Superseded by #20203. |
This pull request adds
chomp::Bool=false
option toreadline
,readlines
andeachline
This is "fresh start" based on discussion in #19877 .
The implementation is completely based on Julia, but it could be made a bit faster by modifying C code.
An example with a text file with 1e6 lines, this matches current master:
Setting chomp to true has 2x the number allocations and is therefore a bit slower: