-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
define getindex on regex matches to return captures. #11566
Conversation
Can't help wondering if it would be better/nicer to use Symbols for this. |
Ya, good point. Luckily the legal PCRE capture group names is a subset of legal Julia symbol literals. |
Alright, I converted it to use symbols. This should be g2g, if people like the idea of |
@@ -15,6 +15,8 @@ type Regex | |||
extra::Ptr{Void} | |||
ovec::Vector{Csize_t} | |||
match_data::Ptr{Void} | |||
capture_name_to_idx::Dict{Symbol, Int} | |||
idx_to_capture_name::Dict{Int, Symbol} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would be better to do the computation on-demand rather than caching it. a linear scan over the nametable will probably be the same time-performance as this dict lookup, but considerably less memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This is not going to be something you want to use for truly high-performance code anyway – for that you'll want to do the indexed lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, are you proposing eliminating both the index->name and the name->index dict from the Regex type, or just one of them?
I guess I didn't think memory was a significant concern here since a user is probably not creating a lot of regex objects and even if they did, the nametable dictionary be a small increase in the total memory taken up by the regex object (which includes the original regex and the JITed regex program). For the few dozen bytes it takes to store the nametable in the regex object, you get much faster capture group extraction from match objects verse re-extracting the nametable from PCRE's internal representation.
I don't feel strongly though so I'll implement whatever you guys think is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both, and mostly because doing the linear scan against the internal representation should be at least as fast as this Dict lookup for all reasonable regexes. it's fine to ccall strncmp
for this, so you don't need to constantly re-extract the table into a julia string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the pcre2 api today, it looks like there is a function for doing exactly this: pcre2_substring_number_from_name_8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we're not going the route of parsing the name table at regex compile time into a native Julia representation, then we might as well use the PCRE convenience methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the performance penalty is probably negligible/non-existent for using the convenience methods even if we did extract the whole name table at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'm no longer caching the capture names in the regex table. The only time memory is allocated now is when show
is called on a Match
object.
this lgtm to merge once CI passes |
@vtjnash CI has passed |
@vtjnash Should be good to merge now |
Updating the Regex docs would be good. |
@ivarne Alright, I added a little section to the manual. |
define getindex on regex matches to return captures.
Two non-breaking changes to regular expressions related to named capture groups:
getindex
on a match object returns the captured substring. The index can either be an integer, in which case it indicates the position of the capture group in the original regex, or a string, in which case it corresponds to the name of a capture group.This introduces no additional overhead to
match
, since the mapping between capture group names and indices is only computed once, when the regular expression is first compiled. This should address @StefanKarpinski's concern about constructing aDict
for everyRegexMatch
object.