-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
matchall is very slow #3719
Comments
I investigated this a little. From the profiler output it looks like a lot of time is spent creating and destroying the million or so match objects. Pythons Here's a julia equivalent. function matchingstrings(re::Regex, str::ByteString)
matches = String[]
offset = 0
opts = re.options & PCRE.EXECUTE_MASK
ovec = Array(Int32, 3)
while true
result = ccall((:pcre_exec, :libpcre), Int32,
(Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
Int32, Int32, Ptr{Int32}, Int32),
re.regex, C_NULL, str, length(str.data),
offset, opts, ovec, 3)
if result >= 0
push!(matches, str[ovec[1]+1:ovec[2]])
offset = ovec[2] + 1
else
break
end
end
matches
end Using The python is 0.33 for me, so it's a pretty close match for speed. Given the performance difference, maybe we should include something like this? p.s. |
Here's 0.22 seconds by returning SubStrings rather than Strings. function matchingstrings(re::Regex, str::ByteString)
matches = SubString[]
offset = 0
opts = re.options & PCRE.EXECUTE_MASK
ovec = Array(Int32, 3)
while true
result = ccall((:pcre_exec, :libpcre), Int32,
(Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
Int32, Int32, Ptr{Int32}, Int32),
re.regex, C_NULL, str, length(str.data),
offset, opts, ovec, 3)
if result >= 0
push!(matches, SubString(str, ovec[1]+1, ovec[2]))
offset = ovec[2] + 1
else
break
end
end
matches
end |
Nice work, @dcjones. Great that we're within striking distance for speed here. Maybe we should just change it so that |
Ah, we really need to make more of our string stuff non-copying. |
Awesome job, @dcjones. I personally think |
The idea that we've kicked around is making all UTF8Strings (and other string types) have an offset as well as a length so that you can take a substring and get something of the normal UTF8String type. |
Making UTF8Strings more like C pointers sounds like a great idea to me. |
Well, they'd still have a bit more baggage since they'd have to contain a reference to an array object, an offset into that array to start at, and a length. That's significantly more stuff than just a pointer. |
We also talked about having a -Jacob On Mon, Jul 15, 2013 at 6:10 PM, Stefan Karpinski
|
The search function returns the index or range of indices that the match occurs at, however, not the matching substring itself. |
The |
I'd love to see @dcjones change get into 0.2. |
I nearly forgot about this. I'll make a PR later tonight. |
I recently tried translating Norvig's spellchecker into Julia. The following example shows that Julia's string performance needs a lot of work.
To get started, download the file http://norvig.com/ipython/big.txt for tokenization.
We'll tokenize it in Julia first:
This takes 10 seconds on my machine.
In contrast, the following Python code is simpler (because there's no notion that
matchall
won't return strings directly) and 20x faster.This takes 0.4 seconds on my machine.
The text was updated successfully, but these errors were encountered: