Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extensible and fast date parsing #20952

Merged
merged 12 commits into from
Mar 16, 2017
Merged

Extensible and fast date parsing #20952

merged 12 commits into from
Mar 16, 2017

Conversation

omus
Copy link
Member

@omus omus commented Mar 8, 2017

The introduction of fast date time parsing (#19545) no longer allows external packages to extend the format specifiers used within DateFormat. This PR re-introduces full extensibility without impacting the previous performance gains. I've also revised the code to make extended specifiers also fast.

  • Document parse(::Type{Vector}, ...) parse_components(...)
  • Document tryparse_core
  • Document tryparse_internal
  • Add additional benchmarks Benchmarks will be added to TimeZones.jl

Fixes #20876

@omus omus added the dates Dates, times, and the Dates stdlib module label Mar 8, 2017
@omus omus requested a review from Sacha0 March 8, 2017 21:13
@omus omus added the performance Must go faster label Mar 8, 2017
@KristofferC
Copy link
Member

@nanosoldier runbenchmarks("dates", vs = ":master")

@KristofferC KristofferC requested a review from shashi March 8, 2017 21:17
base/dates/io.jl Outdated

slot_defaults(::Type{Date}) = map(Int64, (1, 1, 1))
slot_defaults(::Type{DateTime}) = map(Int64, (1, 1, 1, 0, 0, 0, 0))
const FORMAT_DEFAULTS = Dict{Type, Any}(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Any here allows TimeZones.jl to add and return a default timezone I presume?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any is used for extensibility. Specifically for TimeZones.jl I'm using an empty string as the default. If/when this is converted into a ZonedDateTime an exception is raised telling the user that the timezone given is invalid.

@omus
Copy link
Member Author

omus commented Mar 8, 2017

Thanks @KristofferC. I was just looking up that how to run the benchmarks

return tokens, directive_index, directive_letters
end

genvar(t::DataType) = Symbol(lowercase(string(t.name.name)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this is ok or the right way to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this is going to generate lots of discussion. Basically I need to map the specifier types to a variable name. Using separate variable names allows the generated code to avoid type instability problems (occurs once extended). Additionally, I can't get away with using gensym as some of the code requires me to match up the variable names. See tryparse_internal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reflection function for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use Base.datatype_name to replace t.name.name. There doesn't seem to be anything that will make a lowercase version of a Symbol


@label done
parts = Base.@ntuple $N val
return R(reorder_args(parts, $field_order, $field_defaults, err_idx)::$tuple_type)
return Nullable{$R}($(Expr(:tuple, directive_names...))), directive_idx

@label error
# Note: Keeping exception generation in separate function helps with performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this comment no longer applicable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this again and it no longer seems to be an issue. I originally found this problem.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be great to have.

@omus
Copy link
Member Author

omus commented Mar 8, 2017

Failure on Appveyor was caused from me wrapping a long line incorrectly.

@omus omus removed the request for review from Sacha0 March 8, 2017 21:33
@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`make -j3`, ProcessExited(2)) [2]

Logs and partial data can be found here
cc @jrevels

@omus
Copy link
Member Author

omus commented Mar 8, 2017

@quinnj we've been rather inconsistent on what we call the special characters (yyyy) which we use to indicate a token we need to parse. We've used terms like slot, code, or format specifier in the past.

We should probably stick to a consistent term in both the code and in the manual. What do you prefer:

base/dates/io.jl Outdated
const FORMAT_TRANSLATIONS = Dict{Type{<:TimeType}, Tuple}(
Date => (Year, Month, Day),
DateTime => (Year, Month, Day, Hour, Minute, Second, Millisecond),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make all these ImmutableDict? They're small enough that may be a performance and memory improvement (however inconsequential overall), and would help enforce that the data feeding into a generated function can't be modified (esp. if that code wants to eventually be precompiled)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash switching these to ImmutableDicts would remove ability to extend code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally these were defined through methods (which I prefer):

slot_order(::Type{DateTime}) = (Year, Month, Day, Hour, Minute, Second, Millisecond)

Unfortunately this cannot be used here since generated functions cannot call methods defined after the generated function definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. A generated function cannot be relied upon to observe any external mutation.

@quinnj
Copy link
Member

quinnj commented Mar 8, 2017

@omus, yeah, i'm fine with whatever. Format specifier seems fine to me.

@KristofferC
Copy link
Member

@nanosoldier runbenchmarks("dates", vs = ":master")

end
end

@generated function Base.parse(::Type{Vector}, str::AbstractString, df::DateFormat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit of an odd method, weren't you going to deprecate it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a deprecation for parse(::AbstractString, ::DateFormat).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the method signature is a bit odd. Maybe a different function name like parse_tokens(::AbstractString, ::DateFormat) would be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it doesn't have to be an externally visible method of parse, then yeah a local name would raise fewer eyebrows

(does the deprecation need to be eval'ed?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original parse function was only defined in Base.Dates.parse and wasn't exported. I believe I need the eval to define a function in another module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider having "Parser types" (kinda like the token types in this module) and using their objects of those types as the first argument to parse rather than have it always be Type of the return value. That is not sustainable as you can see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But of course that's not in the scope of this changeset.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @jrevels


# `slot_order`, `slot_defaults`, and `slot_types` return tuples of the same length
assert(num_types == length(field_order) == length(field_defaults))
@generated function tryparse_core(str::AbstractString, df::DateFormat, raise::Bool=false)
Copy link
Contributor

@shashi shashi Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great if this function can take a start index and return an end position... I've had to hack around this to get this. It's very useful when you're parsing dates and some more stuff after / a list of dates

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be supported.

end
end
@generated function tryparse_internal{T<:TimeType}(
::Type{T}, str::AbstractString, df::DateFormat, raise::Bool=false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And/ or this method should also take a position and return an end position....

val[idx[i]]
end
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that you got rid of this! :)

Copy link
Contributor

@shashi shashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the minor (but useful) changes to arguments / return values in tryparse_internal, this is good!

@StefanKarpinski StefanKarpinski added this to the 0.6.0 milestone Mar 9, 2017
end of the string (`len`).

Returns a 3-element tuple `(values, pos, num_parsed)`:
* `values::Nullable{Tuple}`: A tuple which contains a values for each `DatePart` within the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a value for each


Returns a 3-element tuple `(values, pos, num_parsed)`:
* `values::Nullable{Tuple}`: A tuple which contains a values for each `DatePart` within the
`DateFormat` in the order in which the occur. If the string ends before we finish parsing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order in which they occur

Useful for distinguishing parsed values from default values.
"""
@generated function tryparsenext_core(
str::AbstractString, pos::Int, len::Int, df::DateFormat, raise::Bool=false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise not documented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was an oversight. I made it a keyword at one point and meant to mention it in the docstring details.

@generated function tryparse_internal{T<:TimeType}(
::Type{T}, str::AbstractString, df::DateFormat, raise::Bool=false,
Returns a 2-element tuple `(values, pos)`:
* `values::Nullable{Tuple}`: A tuple which contains a values for each token as specified by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a value for each

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we be doing these edits on the PR branch?


Parses the string according to the directives within the DateFormat. Parsing will start at
character index `pos` and will stop when all directives are used or we have parsed up to the
end of the string (`len`).
character index `pos` and will stop when all directives are used or we have parsed up to,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there shouldn't be a comma at the end of the line here

@@ -110,14 +111,14 @@ Returns a 3-element tuple `(values, pos, num_parsed)`:
end

"""
tryparsenext_internal(::Type{<:TimeType}, str::AbstractString, pos::Int, len::Int, df::DateFormat)
tryparsenext_internal(::Type{<:TimeType}, str, pos, len, df::DateFormat, raise=false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should raise also be described here?

@omus
Copy link
Member Author

omus commented Mar 13, 2017

@nanosoldier runbenchmarks("dates", vs = ":master")

@@ -115,7 +115,8 @@ end

Parses the string according to the directives within the DateFormat. The specified TimeType
type determines the type of and order of tokens returned. If the given DateFormat or string
does not provide a required token a default value will be used.
does not provide a required token a default value will be used. If the provided string
cannot be parsed an exception will be thrown only if `raise` is true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the natural question is what happens otherwise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the docstring to explain what happens in either case.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@omus
Copy link
Member Author

omus commented Mar 13, 2017

Nanosoldier failures appear to be noise

@omus
Copy link
Member Author

omus commented Mar 15, 2017

@nanosoldier runbenchmarks("dates", vs = ":master")

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @jrevels

@jrevels
Copy link
Member

jrevels commented Mar 15, 2017

Nanosoldier breakage is due to #21028


@label done
parts = Base.@ntuple $N val
return R(reorder_args(parts, $field_order, $field_defaults, err_idx)::$tuple_type)
return Nullable{$R}($(Expr(:tuple, value_names...))), pos, num_parsed

@label error
# Note: Keeping exception generation in separate function helps with performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment no longer accurate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I'll make sure to remove that.

@StefanKarpinski
Copy link
Member

This looks good to go as soon as you're done tweaking things, @omus. Merge when ready.

omus added a commit to JuliaTime/TimeZones.jl that referenced this pull request Mar 16, 2017
@omus
Copy link
Member Author

omus commented Mar 16, 2017

Code changes are done. @tkelman any other comments?

@omus omus merged commit 4eb8c06 into master Mar 16, 2017
@tkelman tkelman deleted the cv/extensible-fast-parse branch March 17, 2017 00:02
omus added a commit to JuliaTime/TimeZones.jl that referenced this pull request Mar 17, 2017
ajkeller34 pushed a commit to ajkeller34/julia that referenced this pull request Mar 19, 2017
* Refactor date parsing to be fast and extensible

* Deprecate parse(::AbstractString, ::DateFormat)

* fixup

* Switch to datatype_name

* Move towards consistent terminology

* Rename parse(::Vector, ...) to parse_components

* Internal parse funcs now take and return position

* Documentation for internal functions

* Corrections to documentation

* Corrections from review

* More details about raise

* Remove outdated comment
@Sacha0 Sacha0 added the deprecation This change introduces or involves a deprecation label May 20, 2017
@Sacha0
Copy link
Member

Sacha0 commented May 20, 2017

@omus, might this work warrant a NEWS.md entry? Best! (Ref. #21475)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dates Dates, times, and the Dates stdlib module deprecation This change introduces or involves a deprecation performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants