define getindex on regex matches to return captures. #11566

malmaud · 2015-06-03T21:20:05Z

Two non-breaking changes to regular expressions related to named capture groups:

Showing a match object on the REPL will display the name of a named capture group instead of its index
getindex on a match object returns the captured substring. The index can either be an integer, in which case it indicates the position of the capture group in the original regex, or a string, in which case it corresponds to the name of a capture group.

This introduces no additional overhead to match, since the mapping between capture group names and indices is only computed once, when the regular expression is first compiled. This should address @StefanKarpinski's concern about constructing a Dict for every RegexMatch object.

StefanKarpinski · 2015-06-03T21:22:04Z

Can't help wondering if it would be better/nicer to use Symbols for this.

malmaud · 2015-06-03T22:17:15Z

Ya, good point. Luckily the legal PCRE capture group names is a subset of legal Julia symbol literals.

malmaud · 2015-06-04T13:13:25Z

Alright, I converted it to use symbols. This should be g2g, if people like the idea of getindex returning captures.

vtjnash · 2015-06-04T16:25:37Z

base/regex.jl

@@ -15,6 +15,8 @@ type Regex
    extra::Ptr{Void}
    ovec::Vector{Csize_t}
    match_data::Ptr{Void}
+    capture_name_to_idx::Dict{Symbol, Int}
+    idx_to_capture_name::Dict{Int, Symbol}


i think it would be better to do the computation on-demand rather than caching it. a linear scan over the nametable will probably be the same time-performance as this dict lookup, but considerably less memory.

+1. This is not going to be something you want to use for truly high-performance code anyway – for that you'll want to do the indexed lookup.

Just to clarify, are you proposing eliminating both the index->name and the name->index dict from the Regex type, or just one of them?

I guess I didn't think memory was a significant concern here since a user is probably not creating a lot of regex objects and even if they did, the nametable dictionary be a small increase in the total memory taken up by the regex object (which includes the original regex and the JITed regex program). For the few dozen bytes it takes to store the nametable in the regex object, you get much faster capture group extraction from match objects verse re-extracting the nametable from PCRE's internal representation.

I don't feel strongly though so I'll implement whatever you guys think is best.

both, and mostly because doing the linear scan against the internal representation should be at least as fast as this Dict lookup for all reasonable regexes. it's fine to ccall strncmp for this, so you don't need to constantly re-extract the table into a julia string.

OK, makes sense.

looking at the pcre2 api today, it looks like there is a function for doing exactly this: pcre2_substring_number_from_name_8

Yes, if we're not going the route of parsing the name table at regex compile time into a native Julia representation, then we might as well use the PCRE convenience methods.

Actually, the performance penalty is probably negligible/non-existent for using the convenience methods even if we did extract the whole name table at compile time.

Alright, I'm no longer caching the capture names in the regex table. The only time memory is allocated now is when show is called on a Match object.

vtjnash · 2015-06-24T15:08:40Z

this lgtm to merge once CI passes

malmaud · 2015-06-25T13:45:47Z

@vtjnash CI has passed

malmaud · 2015-06-27T17:18:08Z

@vtjnash Should be good to merge now

ivarne · 2015-06-27T23:00:49Z

Updating the Regex docs would be good.

malmaud · 2015-07-02T19:35:20Z

@ivarne Alright, I added a little section to the manual.

define getindex on regex matches to return captures.

define getindex on regex matches to return captures.

14a02da

Store capture names as symbols instead of strings

0479e7a

vtjnash reviewed Jun 4, 2015
View reviewed changes

Don't cache capture names in regex object

1b0127c

malmaud changed the title ~~RFC: define getindex on regex matches to return captures.~~ define getindex on regex matches to return captures. Jun 27, 2015

Added manual section on accessing groups

1b8d47a

ivarne added a commit that referenced this pull request Jul 2, 2015

Merge pull request #11566 from malmaud/regex_named_subpattern

95bf20d

define getindex on regex matches to return captures.

ivarne merged commit 95bf20d into JuliaLang:master Jul 2, 2015

malmaud mentioned this pull request Jul 8, 2015

Named subpatterns in regex #11362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

define getindex on regex matches to return captures. #11566

define getindex on regex matches to return captures. #11566

malmaud commented Jun 3, 2015

StefanKarpinski commented Jun 3, 2015

malmaud commented Jun 3, 2015

malmaud commented Jun 4, 2015

vtjnash Jun 4, 2015

StefanKarpinski Jun 4, 2015

malmaud Jun 8, 2015

vtjnash Jun 8, 2015

malmaud Jun 8, 2015

vtjnash Jun 9, 2015

malmaud Jun 9, 2015

malmaud Jun 9, 2015

malmaud Jun 24, 2015

vtjnash commented Jun 24, 2015

malmaud commented Jun 25, 2015

malmaud commented Jun 27, 2015

ivarne commented Jun 27, 2015

malmaud commented Jul 2, 2015

define getindex on regex matches to return captures. #11566

define getindex on regex matches to return captures. #11566

Conversation

malmaud commented Jun 3, 2015

StefanKarpinski commented Jun 3, 2015

malmaud commented Jun 3, 2015

malmaud commented Jun 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash commented Jun 24, 2015

malmaud commented Jun 25, 2015

malmaud commented Jun 27, 2015

ivarne commented Jun 27, 2015

malmaud commented Jul 2, 2015