replace ASCIIString & UTF8String with String #16058

StefanKarpinski · 2016-04-26T20:57:24Z

It's definitely well past time to get moving on this change, and I've tried bigger PRs but I think I need to start smaller. Here I'm separating out just the part that merges ASCIIString and UTF8String and replaces them with a single String type, which is basically just a rename of UTF8String. The deprecation is pretty minimal, we'll have to run PkgEval and figure out what breaks out there and try to mitigate.

This is kind of an awkward middle state since UTF8String no longer exists but UTF16String and UTF32String still do. Eventually, these should all move out into a package: if you want fully validated UTF8-encoded strings, then you should use a UTF8String type, etc. The Base String type will then be the somewhat permissive default string type that can handle any UTF8-like data, valid or not.

quinnj · 2016-04-26T22:38:11Z

145 files changed == smaller change, eh?

quinnj · 2016-04-26T22:39:44Z

base/ascii.jl

-convert(::Type{ASCIIString}, a::Array{UInt8,1}, invalids_as::AbstractString) =
-    convert(ASCIIString, a, ascii(invalids_as))
-convert(::Type{ASCIIString}, s::AbstractString) = ascii(bytestring(s))
+ascii(x) = ascii(convert(String, x))


So are we going to keep the ascii methods here around instead of deprecating? I guess there's still a use-case for when you want to ensure your string is only ASCII characters?

For the first stage, I figured it would be less disruptive to keep it. We should probably get rid of ascii and utf8 later.

Why not deprecate them at the same time as the type renaming? It's easier to fix everything at the same time for users.

~~Also, ascii should definitely continue checking that the input is ASCII, for both safety and reliability.~~ EDIT: I misread the code.

When I say "at the same time", I mean that it should be done soon, but not necessarily in this PR, which is already big enough.

@nalimilan: that's the idea.

~~shouldn't this ascii(x) method verify its input is ascii?~~ nevermind it does so, got confused

iamed2 · 2016-04-26T22:43:00Z

The downside of losing an ASCIIString type is that wrapping an API that only accepts ASCII becomes harder as you can't rely on the type system to help you. Perhaps there could be an ASCIIString type that just wraps String but ensures all content isascii?

StefanKarpinski · 2016-04-26T22:44:06Z

Yeah, this change is a beast but there's even more to come...

quinnj · 2016-04-26T22:46:14Z

base/base.jl

@@ -82,7 +82,7 @@ finalize(o::ANY) = ccall(:jl_finalize, Void, (Any,), o)
 gc(full::Bool=true) = ccall(:jl_gc_collect, Void, (Cint,), full)
 gc_enable(on::Bool) = ccall(:jl_gc_enable, Cint, (Cint,), on)!=0

-bytestring(str::ByteString) = str
+bytestring(str::String) = str


Similar to my question above about ascii; is there a reason to keep bytestring around as opposed to deprecating? I also understand that we want to keep things minimal, but it might be nice to just mark the places we want to follow up on later.

Thanks for taking the time to review, @quinnj. Yes, we should deprecate all of ascii, utf8, bytestring, maybe string and replace them all with String, which will take any number of values, convert each to a string and produce a single String object for the concatenation of all of those.

I started down that road, but that's a very big sweater once you start pulling the strings, and this change plus that one were just too much when combined.

Yep, understood. Definitely excited for this and I'll try to take some more time to review/test things out to help out.

StefanKarpinski · 2016-04-26T22:46:26Z

@iamed2: that can be provided by an ASCII string type in an external package.

tkelman · 2016-04-27T01:11:45Z

base/datafmt.jl

@@ -277,8 +273,6 @@ function readdlm_string(sbuff::ByteString, dlm::Char, T::Type, eol::Char, auto::
        catch ex
            if isa(ex, TypeError) && (ex.func == :store_cell)
                T = ex.expected
-            elseif get(optsd, :ignore_invalid_chars, false)


does this option need to be removed from docs and elsewhere in the code?

I think I'll just reinstate it instead.

Actually, this whole option doesn't really make sense, which is why I deleted it. The transformation is now a no-op.

Also, I hate this file. It should not be in Base, imo.

OT: Judging by the frequent complaints about performance we get, I also think these functions should be deprecated in favor of CSV.jl as soon as it's ready. CSV.jl+DataStreams.jl will also unify reading data into different structures like Array, DataFrame, TimeSeries, data bases, etc.

nalimilan · 2016-04-27T09:26:02Z

base/boot.jl

    data::Array{UInt8,1}
-    ASCIIString(d::Array{UInt8,1}) = new(d)
+    String(d::Array{UInt8,1}) = new(d)


Isn't this inner constructor useless?

Yes, I think it is. Of course, the old code had it too.

Why not remove it then, we're likely to forget about it after merging.

Because I'm trying to get this PR merged before someone introduces more conflicts and/or breaks the tests. You have no idea how many times I've rebased this and how often it breaks when someone changes something seemingly unrelated.

That was indeed my concern too (hence the quick review). Go ahead then.

Since this is no longer fresh and I'm going to have to rebase and rerun CI anyway, I'll include this change (and a few others) when I do.

StefanKarpinski · 2016-04-28T22:19:32Z

I'm inclined to merge this but it's going to wreak havoc on packages. Thoughts?

StefanKarpinski · 2016-04-28T22:29:42Z

I guess I should add String to Compat first so that people can fix their packages.

tkelman · 2016-04-29T00:59:31Z

base/docs/helpdb/Base.jl

@@ -3728,7 +3728,7 @@ second variant.
 popdisplay

 """
-    readdlm(source, delim::Char, T::Type, eol::Char; header=false, skipstart=0, skipblanks=true, use_mmap, ignore_invalid_chars=false, quotes=true, dims, comments=true, comment_char='#')
+    readdlm(source, delim::Char, T::Type, eol::Char; header=false, skipstart=0, skipblanks=true, use_mmap, quotes=true, dims, comments=true, comment_char='#')


there's another mention of ignore_invalid_chars below

quinnj · 2016-05-04T16:49:26Z

Nice!

Keno · 2016-05-04T17:55:32Z

Lineedit change is due to prevind behavior change:

1|debug > n
In REPL.jl:538
568           if match != 0:-1 && h != response_str && haskey(hist.mode_mapping, hist.modes[idx])
569               truncate(response_buffer, 0)
570               write(response_buffer, h)
571               seek(response_buffer, prevind(response_str, first(match)))
572               hist.cur_idx = idx
573               return true

Before:

1|julia > prevind("ll", first(5:5))
4

After:

1|julia > prevind("ll", first(5:5))
2

I believe the following is the correct patch:

diff --git a/base/REPL.jl b/base/REPL.jl
index bc3e4f9..90929b0 100644
--- a/base/REPL.jl
+++ b/base/REPL.jl
@@ -568,7 +568,7 @@ function history_search(hist::REPLHistoryProvider, query_buffer::IOBuffer, respo
         if match != 0:-1 && h != response_str && haskey(hist.mode_mapping, hist.modes[idx])
             truncate(response_buffer, 0)
             write(response_buffer, h)
-            seek(response_buffer, prevind(response_str, first(match)))
+            seek(response_buffer, prevind(h, first(match)))
             hist.cur_idx = idx
             return true
         end

StefanKarpinski · 2016-05-04T18:17:51Z

Ah, I see. That's pretty subtle. Thanks for figuring it out.

nalimilan · 2016-05-04T18:30:08Z

Since "ll"[1:5] throws a BoundsError, shouldn't prevind("ll", 5) do the same? That would catch this kind of weird behavior.

vtjnash · 2016-05-05T00:01:05Z

This PR seems to have dropped code coverage by nearly 10% (https://codecov.io/gh/JuliaLang/julia/commit/5de52cf9c9343cfcf50be4c7c736290d3f985961/changes).

yuyichao · 2016-05-05T02:13:14Z

Is this expected?

julia> String("1")
ERROR: MethodError: no method matching convert(::Type{Array{UInt8,1}}, ::String)
you may have intended to import Base.convert
Closest candidates are:
  convert(::Type{Any}, ::ANY)
  convert{T}(::Type{T}, ::T)
 in String(::String) at ./boot.jl:222
 in eval(::Module, ::Any) at ./boot.jl:228

StefanKarpinski · 2016-05-05T13:24:15Z

Yeah, turns out there was a reason to leave this "useless" inner constructor in and it's the problem @yuyichao just found. I think I discovered that months ago and had left it in for a reason. That's the problem with really long-running PRs. Fixed here: #16212.

tkelman · 2016-05-05T13:34:31Z

Hard to believe that wasn't tested.

StefanKarpinski · 2016-05-05T13:40:15Z

@tkelman: Yeah, I made a PR but I'm also adding tests to it for that.

@vtjnash: I'm not sure what's up with that coverage drop. Some of it is legit in that we relied on ACSIIString and UTF8String to exercise different code paths. But most of it seems totally unrelated – if you look at the actual changes, they look like code paths that don't depend on string types at all. I wonder if the code coverage machinery broke somehow?

tkelman · 2016-05-05T15:13:51Z

The coverage tests run with inlining off. There are occasionally tests that fail in that situation but pass when inlining is on. https://build.julialang.org/builders/coverage_ubuntu14.04-x64/builds/216/steps/Run%20non-inlined%20tests/logs/stdio

read
ERROR: LoadError: MethodError: no method matching convert(::Type{Array{UInt8,1}}, ::Array{Char,1})
you may have intended to import Base.convert
Closest candidates are:
  convert(!Matched::Type{Any}, ::ANY)
  convert{T}(::Type{T}, !Matched::T)
 in String(::Array{Char,1}) at ./boot.jl:222
 in (::##895#909)(::String) at /home/ubuntu/buildbot/slave/coverage_ubuntu14_04-x64/build/julia-6c67baee1a/share/julia/test/read.jl:179
 in mktempdir(::##895#909, ::String) at ./file.jl:272
 in mktempdir(::Function) at ./file.jl:270
 in include_from_node1(::String) at ./loading.jl:426
 in (::CoverageBase.##13#14{Array{String,1}})() at /home/ubuntu/.julia/v0.5/CoverageBase/src/CoverageBase.jl:41
 in cd(::CoverageBase.##13#14{Array{String,1}}, ::String) at ./file.jl:48
 in runtests(::Array{String,1}) at /home/ubuntu/.julia/v0.5/CoverageBase/src/CoverageBase.jl:38
 in eval(::Module, ::Any) at ./boot.jl:228
 [inlined code] from ./sysimg.jl:11
 in process_options(::Base.JLOptions) at ./client.jl:240
 in _start() at ./client.jl:319
while loading /home/ubuntu/buildbot/slave/coverage_ubuntu14_04-x64/build/julia-6c67baee1a/share/julia/test/read.jl, in expression starting on line 3

StefanKarpinski · 2016-05-05T16:40:54Z

Ah, I guess I should make sure that tests pass with inlining off.

StefanKarpinski force-pushed the sk/highlander1 branch from 74bb1af to d80ed00 Compare April 26, 2016 20:59

quinnj reviewed Apr 26, 2016
View reviewed changes

tkelman reviewed Apr 27, 2016
View reviewed changes

ivarne added this to the 0.6.0 milestone Apr 27, 2016

nalimilan reviewed Apr 27, 2016
View reviewed changes

StefanKarpinski force-pushed the sk/highlander1 branch 2 times, most recently from dac2e0c to d74cbe7 Compare April 28, 2016 19:41

StefanKarpinski mentioned this pull request Apr 28, 2016

Stringapalooza #16107

Closed

32 tasks

tkelman reviewed Apr 29, 2016
View reviewed changes

StefanKarpinski added 5 commits May 4, 2016 11:14

replace ASCIIString & UTF8String with String

9e1b5dd

REPL: make all string fields concrete (i.e. String)

011f4ee

delete redundant String inner constructor

530ba95

replace jl_is_utf8_string and jl_is_byte_string with jl_is_string

16a5d05

error on non-ASCII names passed to SuiteSparse

8509862

StefanKarpinski force-pushed the sk/highlander1 branch from d74cbe7 to 8509862 Compare May 4, 2016 15:44

StefanKarpinski merged commit 5de52cf into master May 4, 2016

StefanKarpinski deleted the sk/highlander1 branch May 4, 2016 16:44

stevengj mentioned this pull request May 4, 2016

String is back JuliaLang/Compat.jl#193

Closed

ViralBShah assigned StefanKarpinski May 9, 2016

jrevels mentioned this pull request May 9, 2016

misc. benchmark regressions since 0.4 #16128

Closed

nalimilan mentioned this pull request Oct 24, 2016

Simplify takebuf() API #19088

Merged

replace ASCIIString & UTF8String with String #16058

replace ASCIIString & UTF8String with String #16058

Conversation

StefanKarpinski commented Apr 26, 2016 • edited Loading

quinnj commented Apr 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan Apr 27, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman Apr 29, 2016 • edited Loading

Choose a reason for hiding this comment

iamed2 commented Apr 26, 2016

StefanKarpinski commented Apr 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented Apr 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented Apr 28, 2016

StefanKarpinski commented Apr 28, 2016

Choose a reason for hiding this comment

quinnj commented May 4, 2016

Keno commented May 4, 2016

StefanKarpinski commented May 4, 2016

nalimilan commented May 4, 2016

vtjnash commented May 5, 2016

yuyichao commented May 5, 2016

StefanKarpinski commented May 5, 2016 • edited Loading

tkelman commented May 5, 2016

StefanKarpinski commented May 5, 2016 • edited Loading

tkelman commented May 5, 2016

StefanKarpinski commented May 5, 2016

StefanKarpinski commented Apr 26, 2016 •

edited

Loading

nalimilan Apr 27, 2016 •

edited

Loading

tkelman Apr 29, 2016 •

edited

Loading

StefanKarpinski commented May 5, 2016 •

edited

Loading

StefanKarpinski commented May 5, 2016 •

edited

Loading