Skip to content

RFC: Add match! function for in-place matching #12353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

malmaud
Copy link
Contributor

@malmaud malmaud commented Jul 28, 2015

This is to speed up inner loops that keep calling match with the same regex. Here's a benchmark for a hypothetical use case of collecting all letters that occur 1 or 2 characters after 'a' in a document:

s=readall("CONTRIBUTING.md"); 

function in_place(s;N=1000)
    r=r"a(.)(.)"
    for i=1:N
        offset = 1
        while true
            m = match(r, s, offset)
            m==nothing && break
            offset = m.offset + length(m.match)
        end
    end
end

function with_match(s;N=1000)
    r=r"a(.)(.)"
    m = RegexMatch(r)
    for i=1:N
        offset = 1
        while true
            Base.match!(m, s, offset) || break
            offset = m.offset + length(m.match)
        end
    end
end

function with_matchall(s;N=1000)
    r=r"a(.)(.)"
    for i=1:N
        matches = matchall(r, s)  # Note this API doesn't let you get the capture groups, but forms a reasonable baseline
    end
end

@time with_match(s);
 515.801 milliseconds (6253 k allocations: 251 MB, 23.24% gc time)


@time inplace(s);
 263.869 milliseconds (3137 k allocations: 85766 KB, 14.53% gc time)


@time with_matchall(s);
 272.747 milliseconds (5404 k allocations: 110 MB, 20.30% gc time)

@malmaud malmaud changed the title Add match! function for in-place matching RFC: Add match! function for in-place matching Jul 28, 2015
@StefanKarpinski
Copy link
Member

This seems like such an unfortunate API. What about making the RegexMatch object part of the Regex object itself? I mean, what else are you going to do with a regex, after all? #threadsafe

@malmaud
Copy link
Contributor Author

malmaud commented Jul 28, 2015

That's a good idea. Since the we're already allocating just one PCRE-internal buffer for the matches in the Regex constructor, it's no less thread-safe than the existing situation.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 28, 2015

(Which is to say, not thread safe)

@malmaud
Copy link
Contributor Author

malmaud commented Jul 28, 2015

I think we'd still need something like match! though, because if match returns a reference as opposed to a copy of the mutable regex-stored match object, that reference will become invalidated with the next match call.

@JeffBezanson
Copy link
Member

(more ! functions ... :cries: ... )

@malmaud
Copy link
Contributor Author

malmaud commented Jul 28, 2015

¡match!

@malmaud
Copy link
Contributor Author

malmaud commented Jul 28, 2015

Another option is to make a breaking API change where match does return a reference to a mutable match object and it's the user's responsibility to call copy on it.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 28, 2015

Or maybe best is to add a new default argument to match that indicates whether a copy is desired, and will be true by default.

@StefanKarpinski
Copy link
Member

Ah, but the point is that I want the fast version to be the default without any ugly nonsense. Not sure if we can accomplish that, but I'd like to.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 28, 2015

Just brainstorming here, we could also have a new "lightweight" match result object that is stack-allocatable - it only stores offset indices (not substrings), and uses tuples instead of arrays for storing the indices (so the type is parameterized by the number of capture groups).

@malmaud
Copy link
Contributor Author

malmaud commented Jul 28, 2015

Instead of introducing match!, we could also just name the new function match and differentiate it from the current match by the signature. It would be breaking convention that mutating functions use !, but arguably match was already since it changes the internal state of the regex object.

@StefanKarpinski
Copy link
Member

Now we're getting somewhere... Since the actual data is in the string itself, which is immutable (by convention), that's not an issue either. In practice, I suspect it's rare to hang onto anything from the match object but the substrings that are matched.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 29, 2015

@StefanKarpinski not sure which of my comments is the one that's 'getting somewhere'.

@StefanKarpinski
Copy link
Member

I was talking about this idea:

Just brainstorming here, we could also have a new "lightweight" match result object that is stack-allocatable - it only stores offset indices (not substrings), and uses tuples instead of arrays for storing the indices (so the type is parameterized by the number of capture groups).

That sounds like the kind of interface we'd want if you can make it work.

@malmaud
Copy link
Contributor Author

malmaud commented Aug 2, 2015

It's difficult to do while maintaining a nice API - if the match object maintains a reference to the regex and subject string, then it's going to be heap-allocated and performance is compromised. But it it doesn't, you'll have to manually dereference the subject with the indices stored in the match object to get the match and captures as substrings (or call some new method that takes both the match and the subject).

@User-764Q
Copy link

Hi,

It looks like this was merged a few year ago,

Perhaps this issue can be closed?

Matt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants