-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extensible and fast date parsing #20952
Conversation
@nanosoldier |
base/dates/io.jl
Outdated
|
||
slot_defaults(::Type{Date}) = map(Int64, (1, 1, 1)) | ||
slot_defaults(::Type{DateTime}) = map(Int64, (1, 1, 1, 0, 0, 0, 0)) | ||
const FORMAT_DEFAULTS = Dict{Type, Any}( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Any
here allows TimeZones.jl to add and return a default timezone I presume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any
is used for extensibility. Specifically for TimeZones.jl I'm using an empty string as the default. If/when this is converted into a ZonedDateTime an exception is raised telling the user that the timezone given is invalid.
Thanks @KristofferC. I was just looking up that how to run the benchmarks |
base/dates/parse.jl
Outdated
return tokens, directive_index, directive_letters | ||
end | ||
|
||
genvar(t::DataType) = Symbol(lowercase(string(t.name.name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this is ok or the right way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure this is going to generate lots of discussion. Basically I need to map the specifier types to a variable name. Using separate variable names allows the generated code to avoid type instability problems (occurs once extended). Additionally, I can't get away with using gensym
as some of the code requires me to match up the variable names. See tryparse_internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reflection function for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use Base.datatype_name
to replace t.name.name
. There doesn't seem to be anything that will make a lowercase version of a Symbol
base/dates/parse.jl
Outdated
|
||
@label done | ||
parts = Base.@ntuple $N val | ||
return R(reorder_args(parts, $field_order, $field_defaults, err_idx)::$tuple_type) | ||
return Nullable{$R}($(Expr(:tuple, directive_names...))), directive_idx | ||
|
||
@label error | ||
# Note: Keeping exception generation in separate function helps with performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is this comment no longer applicable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this again and it no longer seems to be an issue. I originally found this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be great to have.
Failure on Appveyor was caused from me wrapping a long line incorrectly. |
@quinnj we've been rather inconsistent on what we call the special characters ( We should probably stick to a consistent term in both the code and in the manual. What do you prefer:
|
base/dates/io.jl
Outdated
const FORMAT_TRANSLATIONS = Dict{Type{<:TimeType}, Tuple}( | ||
Date => (Year, Month, Day), | ||
DateTime => (Year, Month, Day, Hour, Minute, Second, Millisecond), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make all these ImmutableDict? They're small enough that may be a performance and memory improvement (however inconsequential overall), and would help enforce that the data feeding into a generated function can't be modified (esp. if that code wants to eventually be precompiled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vtjnash switching these to ImmutableDict
s would remove ability to extend code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally these were defined through methods (which I prefer):
slot_order(::Type{DateTime}) = (Year, Month, Day, Hour, Minute, Second, Millisecond)
Unfortunately this cannot be used here since generated functions cannot call methods defined after the generated function definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. A generated function cannot be relied upon to observe any external mutation.
@omus, yeah, i'm fine with whatever. Format specifier seems fine to me. |
@nanosoldier |
base/dates/parse.jl
Outdated
end | ||
end | ||
|
||
@generated function Base.parse(::Type{Vector}, str::AbstractString, df::DateFormat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit of an odd method, weren't you going to deprecate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a deprecation for parse(::AbstractString, ::DateFormat)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the method signature is a bit odd. Maybe a different function name like parse_tokens(::AbstractString, ::DateFormat)
would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it doesn't have to be an externally visible method of parse, then yeah a local name would raise fewer eyebrows
(does the deprecation need to be eval'ed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original parse
function was only defined in Base.Dates.parse
and wasn't exported. I believe I need the eval to define a function in another module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider having "Parser types" (kinda like the token types in this module) and using their objects of those types as the first argument to parse rather than have it always be Type
of the return value. That is not sustainable as you can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But of course that's not in the scope of this changeset.
Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @jrevels |
base/dates/parse.jl
Outdated
|
||
# `slot_order`, `slot_defaults`, and `slot_types` return tuples of the same length | ||
assert(num_types == length(field_order) == length(field_defaults)) | ||
@generated function tryparse_core(str::AbstractString, df::DateFormat, raise::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be great if this function can take a start index and return an end position... I've had to hack around this to get this. It's very useful when you're parsing dates and some more stuff after / a list of dates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be supported.
base/dates/parse.jl
Outdated
end | ||
end | ||
@generated function tryparse_internal{T<:TimeType}( | ||
::Type{T}, str::AbstractString, df::DateFormat, raise::Bool=false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And/ or this method should also take a position and return an end position....
base/dates/parse.jl
Outdated
val[idx[i]] | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool that you got rid of this! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the minor (but useful) changes to arguments / return values in tryparse_internal, this is good!
base/dates/parse.jl
Outdated
end of the string (`len`). | ||
|
||
Returns a 3-element tuple `(values, pos, num_parsed)`: | ||
* `values::Nullable{Tuple}`: A tuple which contains a values for each `DatePart` within the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a value for each
base/dates/parse.jl
Outdated
|
||
Returns a 3-element tuple `(values, pos, num_parsed)`: | ||
* `values::Nullable{Tuple}`: A tuple which contains a values for each `DatePart` within the | ||
`DateFormat` in the order in which the occur. If the string ends before we finish parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order in which they occur
Useful for distinguishing parsed values from default values. | ||
""" | ||
@generated function tryparsenext_core( | ||
str::AbstractString, pos::Int, len::Int, df::DateFormat, raise::Bool=false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise not documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was an oversight. I made it a keyword at one point and meant to mention it in the docstring details.
base/dates/parse.jl
Outdated
@generated function tryparse_internal{T<:TimeType}( | ||
::Type{T}, str::AbstractString, df::DateFormat, raise::Bool=false, | ||
Returns a 2-element tuple `(values, pos)`: | ||
* `values::Nullable{Tuple}`: A tuple which contains a values for each token as specified by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a value for each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we be doing these edits on the PR branch?
base/dates/parse.jl
Outdated
|
||
Parses the string according to the directives within the DateFormat. Parsing will start at | ||
character index `pos` and will stop when all directives are used or we have parsed up to the | ||
end of the string (`len`). | ||
character index `pos` and will stop when all directives are used or we have parsed up to, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there shouldn't be a comma at the end of the line here
base/dates/parse.jl
Outdated
@@ -110,14 +111,14 @@ Returns a 3-element tuple `(values, pos, num_parsed)`: | |||
end | |||
|
|||
""" | |||
tryparsenext_internal(::Type{<:TimeType}, str::AbstractString, pos::Int, len::Int, df::DateFormat) | |||
tryparsenext_internal(::Type{<:TimeType}, str, pos, len, df::DateFormat, raise=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should raise also be described here?
@nanosoldier |
base/dates/parse.jl
Outdated
@@ -115,7 +115,8 @@ end | |||
|
|||
Parses the string according to the directives within the DateFormat. The specified TimeType | |||
type determines the type of and order of tokens returned. If the given DateFormat or string | |||
does not provide a required token a default value will be used. | |||
does not provide a required token a default value will be used. If the provided string | |||
cannot be parsed an exception will be thrown only if `raise` is true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the natural question is what happens otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the docstring to explain what happens in either case.
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
Nanosoldier failures appear to be noise |
@nanosoldier |
Nanosoldier breakage is due to #21028 |
base/dates/parse.jl
Outdated
|
||
@label done | ||
parts = Base.@ntuple $N val | ||
return R(reorder_args(parts, $field_order, $field_defaults, err_idx)::$tuple_type) | ||
return Nullable{$R}($(Expr(:tuple, value_names...))), pos, num_parsed | ||
|
||
@label error | ||
# Note: Keeping exception generation in separate function helps with performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment no longer accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I'll make sure to remove that.
This looks good to go as soon as you're done tweaking things, @omus. Merge when ready. |
Changes work in conjunction with: JuliaLang/julia#20952
Code changes are done. @tkelman any other comments? |
Changes work in conjunction with: JuliaLang/julia#20952
* Refactor date parsing to be fast and extensible * Deprecate parse(::AbstractString, ::DateFormat) * fixup * Switch to datatype_name * Move towards consistent terminology * Rename parse(::Vector, ...) to parse_components * Internal parse funcs now take and return position * Documentation for internal functions * Corrections to documentation * Corrections from review * More details about raise * Remove outdated comment
The introduction of fast date time parsing (#19545) no longer allows external packages to extend the format specifiers used within
DateFormat
. This PR re-introduces full extensibility without impacting the previous performance gains. I've also revised the code to make extended specifiers also fast.parse(::Type{Vector}, ...)
parse_components(...)
tryparse_core
tryparse_internal
Add additional benchmarksBenchmarks will be added to TimeZones.jlFixes #20876