-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace ASCIIString & UTF8String with String #16058
Conversation
74bb1af
to
d80ed00
Compare
145 files changed == smaller change, eh? |
convert(::Type{ASCIIString}, a::Array{UInt8,1}, invalids_as::AbstractString) = | ||
convert(ASCIIString, a, ascii(invalids_as)) | ||
convert(::Type{ASCIIString}, s::AbstractString) = ascii(bytestring(s)) | ||
ascii(x) = ascii(convert(String, x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are we going to keep the ascii
methods here around instead of deprecating? I guess there's still a use-case for when you want to ensure your string is only ASCII characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first stage, I figured it would be less disruptive to keep it. We should probably get rid of ascii
and utf8
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not deprecate them at the same time as the type renaming? It's easier to fix everything at the same time for users.
Also, EDIT: I misread the code.ascii
should definitely continue checking that the input is ASCII, for both safety and reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I say "at the same time", I mean that it should be done soon, but not necessarily in this PR, which is already big enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan: that's the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this nevermind it does so, got confusedascii(x)
method verify its input is ascii?
The downside of losing an |
Yeah, this change is a beast but there's even more to come... |
@@ -82,7 +82,7 @@ finalize(o::ANY) = ccall(:jl_finalize, Void, (Any,), o) | |||
gc(full::Bool=true) = ccall(:jl_gc_collect, Void, (Cint,), full) | |||
gc_enable(on::Bool) = ccall(:jl_gc_enable, Cint, (Cint,), on)!=0 | |||
|
|||
bytestring(str::ByteString) = str | |||
bytestring(str::String) = str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my question above about ascii
; is there a reason to keep bytestring
around as opposed to deprecating? I also understand that we want to keep things minimal, but it might be nice to just mark the places we want to follow up on later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to review, @quinnj. Yes, we should deprecate all of ascii
, utf8
, bytestring
, maybe string
and replace them all with String
, which will take any number of values, convert each to a string and produce a single String
object for the concatenation of all of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started down that road, but that's a very big sweater once you start pulling the strings, and this change plus that one were just too much when combined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, understood. Definitely excited for this and I'll try to take some more time to review/test things out to help out.
@iamed2: that can be provided by an ASCII string type in an external package. |
@@ -277,8 +273,6 @@ function readdlm_string(sbuff::ByteString, dlm::Char, T::Type, eol::Char, auto:: | |||
catch ex | |||
if isa(ex, TypeError) && (ex.func == :store_cell) | |||
T = ex.expected | |||
elseif get(optsd, :ignore_invalid_chars, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this option need to be removed from docs and elsewhere in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll just reinstate it instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this whole option doesn't really make sense, which is why I deleted it. The transformation is now a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I hate this file. It should not be in Base, imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OT: Judging by the frequent complaints about performance we get, I also think these functions should be deprecated in favor of CSV.jl as soon as it's ready. CSV.jl+DataStreams.jl will also unify reading data into different structures like Array
, DataFrame
, TimeSeries
, data bases, etc.
data::Array{UInt8,1} | ||
ASCIIString(d::Array{UInt8,1}) = new(d) | ||
String(d::Array{UInt8,1}) = new(d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this inner constructor useless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it is. Of course, the old code had it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not remove it then, we're likely to forget about it after merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm trying to get this PR merged before someone introduces more conflicts and/or breaks the tests. You have no idea how many times I've rebased this and how often it breaks when someone changes something seemingly unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was indeed my concern too (hence the quick review). Go ahead then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is no longer fresh and I'm going to have to rebase and rerun CI anyway, I'll include this change (and a few others) when I do.
dac2e0c
to
d74cbe7
Compare
I'm inclined to merge this but it's going to wreak havoc on packages. Thoughts? |
I guess I should add |
@@ -3728,7 +3728,7 @@ second variant. | |||
popdisplay | |||
|
|||
""" | |||
readdlm(source, delim::Char, T::Type, eol::Char; header=false, skipstart=0, skipblanks=true, use_mmap, ignore_invalid_chars=false, quotes=true, dims, comments=true, comment_char='#') | |||
readdlm(source, delim::Char, T::Type, eol::Char; header=false, skipstart=0, skipblanks=true, use_mmap, quotes=true, dims, comments=true, comment_char='#') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's another mention of ignore_invalid_chars
below
d74cbe7
to
8509862
Compare
Nice! |
Lineedit change is due to
Before:
After:
I believe the following is the correct patch:
|
Ah, I see. That's pretty subtle. Thanks for figuring it out. |
Since |
This PR seems to have dropped code coverage by nearly 10% (https://codecov.io/gh/JuliaLang/julia/commit/5de52cf9c9343cfcf50be4c7c736290d3f985961/changes). |
Is this expected? julia> String("1")
ERROR: MethodError: no method matching convert(::Type{Array{UInt8,1}}, ::String)
you may have intended to import Base.convert
Closest candidates are:
convert(::Type{Any}, ::ANY)
convert{T}(::Type{T}, ::T)
in String(::String) at ./boot.jl:222
in eval(::Module, ::Any) at ./boot.jl:228 |
Yeah, turns out there was a reason to leave this "useless" inner constructor in and it's the problem @yuyichao just found. I think I discovered that months ago and had left it in for a reason. That's the problem with really long-running PRs. Fixed here: #16212. |
Hard to believe that wasn't tested. |
@tkelman: Yeah, I made a PR but I'm also adding tests to it for that. @vtjnash: I'm not sure what's up with that coverage drop. Some of it is legit in that we relied on ACSIIString and UTF8String to exercise different code paths. But most of it seems totally unrelated – if you look at the actual changes, they look like code paths that don't depend on string types at all. I wonder if the code coverage machinery broke somehow? |
The coverage tests run with inlining off. There are occasionally tests that fail in that situation but pass when inlining is on. https://build.julialang.org/builders/coverage_ubuntu14.04-x64/builds/216/steps/Run%20non-inlined%20tests/logs/stdio
|
Ah, I guess I should make sure that tests pass with inlining off. |
It's definitely well past time to get moving on this change, and I've tried bigger PRs but I think I need to start smaller. Here I'm separating out just the part that merges ASCIIString and UTF8String and replaces them with a single String type, which is basically just a rename of UTF8String. The deprecation is pretty minimal, we'll have to run PkgEval and figure out what breaks out there and try to mitigate.
This is kind of an awkward middle state since UTF8String no longer exists but UTF16String and UTF32String still do. Eventually, these should all move out into a package: if you want fully validated UTF8-encoded strings, then you should use a UTF8String type, etc. The Base String type will then be the somewhat permissive default string type that can handle any UTF8-like data, valid or not.