Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the TOML stdlib in code loading #36018

Merged
merged 1 commit into from
Aug 28, 2020
Merged

Use the TOML stdlib in code loading #36018

merged 1 commit into from
Aug 28, 2020

Conversation

KristofferC
Copy link
Member

@KristofferC KristofferC commented May 24, 2020

The current way TOML files are parsed in Base during code loading has a few issues:

  • The TOML "parser" in Base does pattern matching to try to find exactly the information code loading in its current state needs. This makes it a bit hard to extend since you need to write a new parsing routine for every new feature. We might want to try add some new features (e.g. sub-projects) that would require some changes in code loading which the current parser causes a bit of a hindrance to.
  • The TOML parser in Base is "stateless" so every query it makes it starts to parse the file from scratch. This means that in some cases it can re-parse the file many times (Code loading might be better just fully parsing TOML files #27414 (comment)) unnecessarily.
  • The TOML parser used by code loading and Pkg is different which means that they support different parts of the spec (AFAIU, the Base parser doesn't support inline tables for example).

Therefore, it would make sense to try:

  • Make Base use a proper TOML parser that implements the spec.
  • Make Pkg and Base share this TOML parser implementation.
  • Eventually make the TOML parser a fully supported stdlib. TOML files are a big part of packages in Julia and it makes sense to me to have an easy option to parse these since they are needed for a lot of the ecosystem.

This PR starts this process by putting a newly written TOML parser into Base and re-implements code loading to take advantage of it. The parser is available as a package with proper API, docs, tests, etc (https://github.com/KristofferC/TOMLX.jl) but the only thing this PR adds is the parser part (with a more low-level API) since that is what is needed in Base and I wanted to make this PR have as little "noise" as possible.
It has a significant number of tests (https://github.com/KristofferC/TOMLX.jl/blob/master/test/readme.jl, https://github.com/KristofferC/TOMLX.jl/blob/master/test/toml_test.jl) and I have benchmarked it pretty thoroughly.

For this PR, I think it makes sense to mostly comment on the code loading parts (@StefanKarpinski for review). For the TOML parsing itself, perhaps it makes sense to focus that discussion to the https://github.com/KristofferC/TOMLX.jl/ repo?.

After this step is done, we update Pkg to start using the parser here for it's parsing.
Then we can try to finally make a TOML stdlib and have Pkg depend on that. This will remove the last bundled dependency for Pkg.

There is a TODO in this PR and that is that I right now use a global cache object to store the filename => Dict mapping in code loading which gets empited in Base.require. This should be moved to a function local cache that gets created in Base.require and threaded through all functions. I wanted to put up this PR anyway for people to get a chance to review things as soon as possible.


One possible question is why this parser is not just the one that currently lives in the ext folder in Pkg. The reason for this is that the ext- parser (forked from https://github.com/JuliaLang/TOML.jl) was written a quite long time ago for an old spec of TOML and requires some significant updates. I was also a bit unhappy with the performance of it (it allocates scratch buffers a bit too much and makes copies of strings etc) so I felt a clean implementation might lead to a better end result.


Fixes #27414.

@KristofferC KristofferC added the packages Package management and loading label May 24, 2020
@timholy
Copy link
Member

timholy commented May 24, 2020

Does this make Pkg operations faster? Thinking of #27414 (comment)

@KristofferC
Copy link
Member Author

Does this make Pkg operations faster?

If you mean loading packages, then I don't think it was bottlenecked on reading the TOML files, even though it did it quite inefficiently.
If you mean stuff like Pkg.add then TOML parsing is significant but not crucial for performance (anymore). I've already fixed the biggest performance problems with the TOML parser in Pkg (JuliaLang/Pkg.jl#113, JuliaLang/Pkg.jl#116, JuliaLang/Pkg.jl#1751) so while this parser is significantly faster there are some diminishing returns in effect.

@StefanKarpinski
Copy link
Member

Any measurements on how this affects code loading speed? One thought I’d had for that was to specify what values you’d like to extract from a file and only return that data instead of constructing the whole data structure. So if you’re just looking for a single value, it could be done with no allocation at all.

@KristofferC
Copy link
Member Author

KristofferC commented May 25, 2020

One thought I’d had for that was to specify what values you’d like to extract from a file and only return that data instead of constructing the whole data structure. So if you’re just looking for a single value, it could be done with no allocation at all.

Yeah, it is possible, but the time to fully parse my pretty big v1.4 Manifest.toml file is ~300 us (including reading it from disc). Compared to the cost of actually loading a package that is negligible. I don't think this really needs to be optimized further.

@StefanKarpinski
Copy link
Member

I was thinking about the issue of parsing the same file over and over. But maybe caching is better for that.

@KristofferC
Copy link
Member Author

I was thinking about the issue of parsing the same file over and over.

I think it is good to completely avoid opening and reading through the file which a cache should help with. Also, it is simpler in the sense that you don't need a second version of the parser that does something more akin lazy parsing.

# [a.b.c.d] doesn't "define" the table [a]
# so keys can later be added to [a], therefore
# we need to keep track of what tables are
# actualyl "defined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# actualyl "defined
# actually defined

Comment on lines 156 to 159
@eval macro $(Symbol("try"))(expr)
:(
v = $(esc(expr));
v isa $ParserError && return v;
v;
)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@eval macro $(Symbol("try"))(expr)
:(
v = $(esc(expr));
v isa $ParserError && return v;
v;
)
end
@eval macro $(:var"try")(expr)
return quote
v = $(esc(expr))
v isa ParserError && return v
v
end
end

function recurse_dict!(l::Parser, d::Dict, dotted_keys::AbstractVector{String}, check=true)::Err{TOMLDict}
for i in 1:length(dotted_keys)
key = dotted_keys[i]
d = get!(() -> TOMLDict(), d, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
d = get!(() -> TOMLDict(), d, key)
d = get!(TOMLDict, d, key)


isvalid_hex(c::Char) = isdigit(c) || ('a' <= c <= 'f') || ('A' <= c <= 'F')
isvalid_oct(c::Char) = '0' <= c <= '7'
isvalid_binary(c::Char) = '0' <= c <= '2'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
isvalid_binary(c::Char) = '0' <= c <= '2'
isvalid_binary(c::Char) = '0' <= c <= '1'

contains_underscore |= read_underscore
end
if !ok_end_value(peek(l))
error()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: message is required

subs = take_substring(l)
# Need to pass a AbstractString to `parse` so materialize it in case it
# contains underscore.
# vvvvvvv <- this looksl like a dude flipping the bird
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# vvvvvvv <- this looksl like a dude flipping the bird

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:(

e isa Base.OverflowError && return(ParserError(ErrOverflowError))
error("internal parser error: did not correctly discredit $(repr(s)) as an int")
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end
return v
end

set_marker!(l)
@try accept_two(l, isdigit)
day = parse_int(l, false)
# Verify the real range in the constructor below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true--as stated at the top of the file, no validity checking is done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is true when you have the Dates stdlib available:

julia> Date(1989, 02, 29)
ERROR: ArgumentError: Day: 29 out of range (1:28)

if ok_end_value(peek(l))
if (read_space = accept(l, ' '))
if !isdigit(peek(l))
return Date(year, month, day)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend for this to throw errors (whereas everything else returns them)?

Copy link
Member Author

@KristofferC KristofferC Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a try catch here as well.

accept_two(l, f::F) where {F} = accept_n(l, 2, f) || return(ParserError(ErrParsingDateTime))
function parse_datetime(l)::Err{Union{DateTime, Date}}
# Year has already been eaten when we reach here
year = parse_int(l, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
year = parse_int(l, false)
year = parse_int(l, false)::Int

base/loading.jl Show resolved Hide resolved
base/loading.jl Outdated
entry = first(d[name])
if found_name
uuid = get(entry, "uuid", nothing)
return PkgId(UUID(uuid), name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERROR: MethodError: Cannot convert an object of type Nothing to an object of type UInt128

base/loading.jl Outdated
error("expected a single entry for $(repr(name)) in $(repr(project_file))")
end
entry = first(d[name])
if found_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always true here?

elseif deps isa Dict
for (dep, uuid) in deps
if dep === name
return PkgId(UUID(uuid), name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuid might not be of the proper type if the file is malformed

base/loading.jl Outdated
return nothing
found_where || return nothing
found_name || return PkgId(name)
# Only here is deps was not a dict which mean we have a unique name for the dep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Only here is deps was not a dict which mean we have a unique name for the dep
# Only here if deps was not a dict which mean we have a unique name for the dep

Comment on lines 10 to 29
# In case we do not have the Dates stdlib available
# we parse DateTime into these internal structs,
# note that these do not do any argument checking
struct Date
year::Int
month::Int
day::Int
end
struct Time
hour::Int
minute::Int
second::Int
ms::Int
end
struct DateTime
date::Date
time::Time
end
DateTime(y, m, d, h, mi, s, ms) =
DateTime(Date(y,m,d), Time(h, mi, s, ms))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be so much better if we could either use delayed definitions here (Requires.jl?) or make these a parser error (unsupported) rather than re-implementing incompatible structs with the same names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't make it a parser error (we need to support arbitrary TOML files) and we want code loading to work without Dates. So this was the best I could come up with.

@vtjnash vtjnash added needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Jun 3, 2020
@vtjnash
Copy link
Member

vtjnash commented Jun 3, 2020

Added the docs and tests labels since the new toml_parser.jl file will need both. Otherwise, most looks pretty straight-forward here and should be in pretty good shape to merge.

base/loading.jl Outdated
p::TOML.Parser
d::Dict{String, Dict{String, Any}}
end
TOMLCache() = TOMLCache(TOML.Parser(), Dict())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it doesn't matter, but maybe:

Suggested change
TOMLCache() = TOMLCache(TOML.Parser(), Dict())
TOMLCache() = TOMLCache(TOML.Parser(), Dict{String,Dict{String,Any}}())

?

@timholy
Copy link
Member

timholy commented Jul 3, 2020

Thanks for pointing out the better alternative. It seems to be identical to the difference between

x::T = y

and

x = y::T

I'll switch to the style you demonstrated for future commits.

@KristofferC
Copy link
Member Author

I've rebased this and changed the target branch to https://github.com/JuliaLang/julia/tree/kc/load_toml for a simpler review.

@KristofferC
Copy link
Member Author

@timholy If you want to do some invalidation sanity checks, you could use this branch together with JuliaLang/Pkg.jl#1984 and see how things look.

@KristofferC KristofferC removed needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Aug 26, 2020
Base automatically changed from kc/toml_stdlib to master August 26, 2020 19:09
@KristofferC
Copy link
Member Author

This is ready to go but it breaks Revise and I don't want to break people's workflow so I'll wait a bit with merging. It should be a fairly small change to https://github.com/timholy/Revise.jl/blob/8336b10a5e7d0e4b1bb0726cf426469c8f866d2e/src/pkgs.jl#L459-L488 @timholy. I didn't really know what the function does so I had a hard time updating it myself.

@KristofferC KristofferC changed the title WIP: Add a TOML parser to Base and use it in code loading Use the TOML stdlib in code loading Aug 28, 2020
@KristofferC
Copy link
Member Author

Revise has been updated to deal with this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
packages Package management and loading
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code loading might be better just fully parsing TOML files
5 participants