Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define getindex on regex matches to return captures. #11566

Merged
merged 4 commits into from
Jul 2, 2015

Conversation

malmaud
Copy link
Contributor

@malmaud malmaud commented Jun 3, 2015

Two non-breaking changes to regular expressions related to named capture groups:

  • Showing a match object on the REPL will display the name of a named capture group instead of its index
  • getindex on a match object returns the captured substring. The index can either be an integer, in which case it indicates the position of the capture group in the original regex, or a string, in which case it corresponds to the name of a capture group.

This introduces no additional overhead to match, since the mapping between capture group names and indices is only computed once, when the regular expression is first compiled. This should address @StefanKarpinski's concern about constructing a Dict for every RegexMatch object.

Review on Reviewable

@StefanKarpinski
Copy link
Member

Can't help wondering if it would be better/nicer to use Symbols for this.

@malmaud
Copy link
Contributor Author

malmaud commented Jun 3, 2015

Ya, good point. Luckily the legal PCRE capture group names is a subset of legal Julia symbol literals.

@malmaud
Copy link
Contributor Author

malmaud commented Jun 4, 2015

Alright, I converted it to use symbols. This should be g2g, if people like the idea of getindex returning captures.

@@ -15,6 +15,8 @@ type Regex
extra::Ptr{Void}
ovec::Vector{Csize_t}
match_data::Ptr{Void}
capture_name_to_idx::Dict{Symbol, Int}
idx_to_capture_name::Dict{Int, Symbol}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be better to do the computation on-demand rather than caching it. a linear scan over the nametable will probably be the same time-performance as this dict lookup, but considerably less memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This is not going to be something you want to use for truly high-performance code anyway – for that you'll want to do the indexed lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, are you proposing eliminating both the index->name and the name->index dict from the Regex type, or just one of them?

I guess I didn't think memory was a significant concern here since a user is probably not creating a lot of regex objects and even if they did, the nametable dictionary be a small increase in the total memory taken up by the regex object (which includes the original regex and the JITed regex program). For the few dozen bytes it takes to store the nametable in the regex object, you get much faster capture group extraction from match objects verse re-extracting the nametable from PCRE's internal representation.

I don't feel strongly though so I'll implement whatever you guys think is best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both, and mostly because doing the linear scan against the internal representation should be at least as fast as this Dict lookup for all reasonable regexes. it's fine to ccall strncmp for this, so you don't need to constantly re-extract the table into a julia string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the pcre2 api today, it looks like there is a function for doing exactly this: pcre2_substring_number_from_name_8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we're not going the route of parsing the name table at regex compile time into a native Julia representation, then we might as well use the PCRE convenience methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the performance penalty is probably negligible/non-existent for using the convenience methods even if we did extract the whole name table at compile time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'm no longer caching the capture names in the regex table. The only time memory is allocated now is when show is called on a Match object.

@vtjnash
Copy link
Member

vtjnash commented Jun 24, 2015

this lgtm to merge once CI passes

@malmaud
Copy link
Contributor Author

malmaud commented Jun 25, 2015

@vtjnash CI has passed

@malmaud
Copy link
Contributor Author

malmaud commented Jun 27, 2015

@vtjnash Should be good to merge now

@malmaud malmaud changed the title RFC: define getindex on regex matches to return captures. define getindex on regex matches to return captures. Jun 27, 2015
@ivarne
Copy link
Member

ivarne commented Jun 27, 2015

Updating the Regex docs would be good.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 2, 2015

@ivarne Alright, I added a little section to the manual.

ivarne added a commit that referenced this pull request Jul 2, 2015
define getindex on regex matches to return captures.
@ivarne ivarne merged commit 95bf20d into JuliaLang:master Jul 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants