Extensible and fast date parsing #20952

omus · 2017-03-08T21:13:05Z

The introduction of fast date time parsing (#19545) no longer allows external packages to extend the format specifiers used within DateFormat. This PR re-introduces full extensibility without impacting the previous performance gains. I've also revised the code to make extended specifiers also fast.

Document ~~parse(::Type{Vector}, ...)~~ parse_components(...)
Document tryparse_core
Document tryparse_internal
~~Add additional benchmarks~~ Benchmarks will be added to TimeZones.jl

Fixes #20876

KristofferC · 2017-03-08T21:16:56Z

@nanosoldier runbenchmarks("dates", vs = ":master")

quinnj · 2017-03-08T21:20:51Z

base/dates/io.jl

-
-slot_defaults(::Type{Date}) = map(Int64, (1, 1, 1))
-slot_defaults(::Type{DateTime}) = map(Int64, (1, 1, 1, 0, 0, 0, 0))
+const FORMAT_DEFAULTS = Dict{Type, Any}(


The Any here allows TimeZones.jl to add and return a default timezone I presume?

Any is used for extensibility. Specifically for TimeZones.jl I'm using an empty string as the default. If/when this is converted into a ZonedDateTime an exception is raised telling the user that the timezone given is invalid.

omus · 2017-03-08T21:20:52Z

Thanks @KristofferC. I was just looking up that how to run the benchmarks

quinnj · 2017-03-08T21:24:16Z

base/dates/parse.jl

+    return tokens, directive_index, directive_letters
+end
+
+genvar(t::DataType) = Symbol(lowercase(string(t.name.name)))


Hopefully this is ok or the right way to do this?

I'm sure this is going to generate lots of discussion. Basically I need to map the specifier types to a variable name. Using separate variable names allows the generated code to avoid type instability problems (occurs once extended). Additionally, I can't get away with using gensym as some of the code requires me to match up the variable names. See tryparse_internal

is there a reflection function for this?

I can use Base.datatype_name to replace t.name.name. There doesn't seem to be anything that will make a lowercase version of a Symbol

quinnj · 2017-03-08T21:25:04Z

base/dates/parse.jl


        @label done
-        parts = Base.@ntuple $N val
-        return R(reorder_args(parts, $field_order, $field_defaults, err_idx)::$tuple_type)
+        return Nullable{$R}($(Expr(:tuple, directive_names...))), directive_idx

        @label error
        # Note: Keeping exception generation in separate function helps with performance


so is this comment no longer applicable?

I tested this again and it no longer seems to be an issue. I originally found this problem.

quinnj

This will be great to have.

omus · 2017-03-08T21:31:27Z

Failure on Appveyor was caused from me wrapping a long line incorrectly.

nanosoldier · 2017-03-08T21:38:28Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`make -j3`, ProcessExited(2)) [2]

Logs and partial data can be found here
cc @jrevels

omus · 2017-03-08T21:41:27Z

@quinnj we've been rather inconsistent on what we call the special characters (yyyy) which we use to indicate a token we need to parse. We've used terms like slot, code, or format specifier in the past.

We should probably stick to a consistent term in both the code and in the manual. What do you prefer:

vtjnash · 2017-03-08T21:42:09Z

base/dates/io.jl

+const FORMAT_TRANSLATIONS = Dict{Type{<:TimeType}, Tuple}(
+    Date => (Year, Month, Day),
+    DateTime => (Year, Month, Day, Hour, Minute, Second, Millisecond),
+)


Should we make all these ImmutableDict? They're small enough that may be a performance and memory improvement (however inconsequential overall), and would help enforce that the data feeding into a generated function can't be modified (esp. if that code wants to eventually be precompiled)

@vtjnash switching these to ImmutableDicts would remove ability to extend code.

Originally these were defined through methods (which I prefer):

slot_order(::Type{DateTime}) = (Year, Month, Day, Hour, Minute, Second, Millisecond)

Unfortunately this cannot be used here since generated functions cannot call methods defined after the generated function definition.

That is correct. A generated function cannot be relied upon to observe any external mutation.

quinnj · 2017-03-08T21:51:16Z

@omus, yeah, i'm fine with whatever. Format specifier seems fine to me.

KristofferC · 2017-03-08T21:57:20Z

@nanosoldier runbenchmarks("dates", vs = ":master")

tkelman · 2017-03-08T22:26:24Z

base/dates/parse.jl

+    end
+end
+
+@generated function Base.parse(::Type{Vector}, str::AbstractString, df::DateFormat)


this is a bit of an odd method, weren't you going to deprecate it?

I added a deprecation for parse(::AbstractString, ::DateFormat).

I agree that the method signature is a bit odd. Maybe a different function name like parse_tokens(::AbstractString, ::DateFormat) would be better?

if it doesn't have to be an externally visible method of parse, then yeah a local name would raise fewer eyebrows

(does the deprecation need to be eval'ed?)

The original parse function was only defined in Base.Dates.parse and wasn't exported. I believe I need the eval to define a function in another module.

We should consider having "Parser types" (kinda like the token types in this module) and using their objects of those types as the first argument to parse rather than have it always be Type of the return value. That is not sustainable as you can see.

But of course that's not in the scope of this changeset.

nanosoldier · 2017-03-08T22:49:16Z

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @jrevels

shashi · 2017-03-09T04:03:42Z

base/dates/parse.jl


-    # `slot_order`, `slot_defaults`, and `slot_types` return tuples of the same length
-    assert(num_types == length(field_order) == length(field_defaults))
+@generated function tryparse_core(str::AbstractString, df::DateFormat, raise::Bool=false)


It will be great if this function can take a start index and return an end position... I've had to hack around this to get this. It's very useful when you're parsing dates and some more stuff after / a list of dates

I think this can be supported.

shashi · 2017-03-09T04:09:46Z

base/dates/parse.jl

-    end
-end
+@generated function tryparse_internal{T<:TimeType}(
+    ::Type{T}, str::AbstractString, df::DateFormat, raise::Bool=false,


And/ or this method should also take a position and return an end position....

shashi · 2017-03-09T04:10:36Z

base/dates/parse.jl

-            val[idx[i]]
-        end
-    end
-end


Cool that you got rid of this! :)

shashi

Apart from the minor (but useful) changes to arguments / return values in tryparse_internal, this is good!

tkelman · 2017-03-11T01:04:05Z

base/dates/parse.jl

+end of the string (`len`).
+
+Returns a 3-element tuple `(values, pos, num_parsed)`:
+* `values::Nullable{Tuple}`: A tuple which contains a values for each `DatePart` within the


a value for each

tkelman · 2017-03-11T01:05:24Z

base/dates/parse.jl

+
+Returns a 3-element tuple `(values, pos, num_parsed)`:
+* `values::Nullable{Tuple}`: A tuple which contains a values for each `DatePart` within the
+  `DateFormat` in the order in which the occur. If the string ends before we finish parsing


order in which they occur

tkelman · 2017-03-11T01:06:06Z

base/dates/parse.jl

+  Useful for distinguishing parsed values from default values.
+"""
+@generated function tryparsenext_core(
+    str::AbstractString, pos::Int, len::Int, df::DateFormat, raise::Bool=false,


raise not documented?

That was an oversight. I made it a keyword at one point and meant to mention it in the docstring details.

tkelman · 2017-03-11T01:08:27Z

base/dates/parse.jl

-@generated function tryparse_internal{T<:TimeType}(
-    ::Type{T}, str::AbstractString, df::DateFormat, raise::Bool=false,
+Returns a 2-element tuple `(values, pos)`:
+* `values::Nullable{Tuple}`: A tuple which contains a values for each token as specified by


a value for each

shouldn't we be doing these edits on the PR branch?

tkelman · 2017-03-13T14:59:48Z

base/dates/parse.jl


 Parses the string according to the directives within the DateFormat. Parsing will start at
-character index `pos` and will stop when all directives are used or we have parsed up to the
-end of the string (`len`).
+character index `pos` and will stop when all directives are used or we have parsed up to,


there shouldn't be a comma at the end of the line here

tkelman · 2017-03-13T15:00:19Z

base/dates/parse.jl

@@ -110,14 +111,14 @@ Returns a 3-element tuple `(values, pos, num_parsed)`:
 end

 """
-    tryparsenext_internal(::Type{<:TimeType}, str::AbstractString, pos::Int, len::Int, df::DateFormat)
+    tryparsenext_internal(::Type{<:TimeType}, str, pos, len, df::DateFormat, raise=false)


should raise also be described here?

omus · 2017-03-13T15:32:49Z

@nanosoldier runbenchmarks("dates", vs = ":master")

StefanKarpinski · 2017-03-13T15:43:52Z

base/dates/parse.jl

@@ -115,7 +115,8 @@ end

 Parses the string according to the directives within the DateFormat. The specified TimeType
 type determines the type of and order of tokens returned. If the given DateFormat or string
-does not provide a required token a default value will be used.
+does not provide a required token a default value will be used. If the provided string
+cannot be parsed an exception will be thrown only if `raise` is true.


I guess the natural question is what happens otherwise?

I'll update the docstring to explain what happens in either case.

nanosoldier · 2017-03-13T16:24:41Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

omus · 2017-03-13T17:12:14Z

Nanosoldier failures appear to be noise

omus · 2017-03-15T16:52:53Z

@nanosoldier runbenchmarks("dates", vs = ":master")

nanosoldier · 2017-03-15T17:55:02Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @jrevels

jrevels · 2017-03-15T18:38:28Z

Nanosoldier breakage is due to #21028

tkelman · 2017-03-16T14:55:35Z

base/dates/parse.jl


        @label done
-        parts = Base.@ntuple $N val
-        return R(reorder_args(parts, $field_order, $field_defaults, err_idx)::$tuple_type)
+        return Nullable{$R}($(Expr(:tuple, value_names...))), pos, num_parsed

        @label error
        # Note: Keeping exception generation in separate function helps with performance


is this comment no longer accurate?

Correct. I'll make sure to remove that.

StefanKarpinski · 2017-03-16T16:56:12Z

This looks good to go as soon as you're done tweaking things, @omus. Merge when ready.

Changes work in conjunction with: JuliaLang/julia#20952

omus · 2017-03-16T18:00:29Z

Code changes are done. @tkelman any other comments?

Changes work in conjunction with: JuliaLang/julia#20952

* Refactor date parsing to be fast and extensible * Deprecate parse(::AbstractString, ::DateFormat) * fixup * Switch to datatype_name * Move towards consistent terminology * Rename parse(::Vector, ...) to parse_components * Internal parse funcs now take and return position * Documentation for internal functions * Corrections to documentation * Corrections from review * More details about raise * Remove outdated comment

Sacha0 · 2017-05-20T21:29:53Z

@omus, might this work warrant a NEWS.md entry? Best! (Ref. #21475)

omus added 2 commits March 8, 2017 14:38

Refactor date parsing to be fast and extensible

49143c5

Deprecate parse(::AbstractString, ::DateFormat)

eac850f

omus added the dates Dates, times, and the Dates stdlib module label Mar 8, 2017

omus requested a review from Sacha0 March 8, 2017 21:13

omus added the performance Must go faster label Mar 8, 2017

KristofferC requested a review from shashi March 8, 2017 21:17

quinnj reviewed Mar 8, 2017

View reviewed changes

quinnj approved these changes Mar 8, 2017

View reviewed changes

omus removed the request for review from Sacha0 March 8, 2017 21:33

vtjnash reviewed Mar 8, 2017

View reviewed changes

fixup

c5bd3c3

omus mentioned this pull request Mar 8, 2017

io.slotparse, io.slotformat fail on v0.6-dev JuliaTime/TimeZones.jl#44

Closed

tkelman reviewed Mar 8, 2017

View reviewed changes

shashi reviewed Mar 9, 2017

View reviewed changes

base/dates/parse.jl Outdated

val[idx[i]]

end

end

end

Copy link

Contributor

shashi Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that you got rid of this! :)

shashi approved these changes Mar 9, 2017

View reviewed changes

StefanKarpinski added this to the 0.6.0 milestone Mar 9, 2017

omus added 2 commits March 10, 2017 13:06

Switch to datatype_name

ad19d23

Move towards consistent terminology

d833ec4

tkelman reviewed Mar 11, 2017

View reviewed changes

Corrections to documentation

2b787f2

tkelman reviewed Mar 13, 2017

View reviewed changes

Corrections from review

b0f6c75

StefanKarpinski reviewed Mar 13, 2017

View reviewed changes

More details about raise

17d8ecc

tkelman reviewed Mar 16, 2017

View reviewed changes

omus added a commit to JuliaTime/TimeZones.jl that referenced this pull request Mar 16, 2017

Re-add support for parsing/formatting in Julia 0.6

19eb805

Changes work in conjunction with: JuliaLang/julia#20952

Remove outdated comment

510dd02

This was referenced Mar 16, 2017

Deprecate parse(::AbstractString, ::DateFormat) #20880

Closed

Compatibility with Julia 0.6 JuliaTime/TimeZones.jl#50

Merged

omus merged commit 4eb8c06 into master Mar 16, 2017

tkelman deleted the cv/extensible-fast-parse branch March 17, 2017 00:02

omus added a commit to JuliaTime/TimeZones.jl that referenced this pull request Mar 17, 2017

Re-add support for parsing/formatting in Julia 0.6

8cdb4a6

Changes work in conjunction with: JuliaLang/julia#20952

Sacha0 added the deprecation This change introduces or involves a deprecation label May 20, 2017

Sacha0 mentioned this pull request May 20, 2017

Make sure NEWS.md lists all 0.6 deprecations and breaking changes #21475

Closed

Extensible and fast date parsing #20952

Extensible and fast date parsing #20952

Conversation

omus commented Mar 8, 2017 • edited Loading

KristofferC commented Mar 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omus commented Mar 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj left a comment

Choose a reason for hiding this comment

omus commented Mar 8, 2017

nanosoldier commented Mar 8, 2017

omus commented Mar 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj commented Mar 8, 2017

KristofferC commented Mar 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nanosoldier commented Mar 8, 2017

shashi Mar 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shashi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omus commented Mar 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nanosoldier commented Mar 13, 2017

omus commented Mar 13, 2017

omus commented Mar 15, 2017

nanosoldier commented Mar 15, 2017

jrevels commented Mar 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented Mar 16, 2017

omus commented Mar 16, 2017

Sacha0 commented May 20, 2017 • edited Loading

omus commented Mar 8, 2017 •

edited

Loading

omus commented Mar 8, 2017 •

edited

Loading

shashi Mar 9, 2017 •

edited

Loading

Sacha0 commented May 20, 2017 •

edited

Loading