Skip to content

Conversation

@arthurscchan
Copy link
Contributor

This fixes a possible out of stack memory problem in src/org/joni/Regex.java for parsing long and complicated regex structures.

In the constructors of the Regex class, it will take in a string or a byte array as for generating a regex matcher. The received string or byte array will be stored and go through the compile() method of the Analyser class. The Analyser class will then use different parsing methods in the Parser class to parse the regular expression. Those parse methods will call each other recursively depending on the structure of the given string or byte array. If the input is too long or contains so many complicated or levelled structures, the depth of the recursion will be high. As each of the recursive calls will occupy part of the stack, the high depth of recursion will use up the stack quickly and result in a stack memory overflow problem.

This PR reduces the problem by adding an additional length limit as a private class variable. If the provided regex string or byte array is longer than the maximum length. The method will simply throw a ValueException.

We found this bug using fuzzing by way of OSS-Fuzz, where we recently integrated joni (google/oss-fuzz#10680). OSS-Fuzz is a free service run by Google for fuzzing important open source software. If you'd like to know more about this then I'm happy to go into detail and also set up things so you can receive emails and detailed reports when bugs are found.

Signed-off-by: Arthur Chan <[email protected]>
@enebo
Copy link
Member

enebo commented Aug 4, 2023

@arthurscchan I think this is a good idea but can you investigate making this a configurable item. joni has a lot of items you can set. You can look at #67 for how to do this.

@arthurscchan
Copy link
Contributor Author

@enebo Thanks for the suggestion. I will fix the PR accordingly.

Signed-off-by: Arthur Chan <[email protected]>
@arthurscchan
Copy link
Contributor Author

@enebo Fixed accordingly, thanks again for your suggestion.

@lopex
Copy link
Contributor

lopex commented Aug 7, 2023

shouldnt there be end - p instead of bytes.length ?

Signed-off-by: Arthur Chan <[email protected]>
@arthurscchan
Copy link
Contributor Author

arthurscchan commented Aug 7, 2023

@lopex Yes, I agree. Thanks for the suggestion.
@enebo I have fixed that accordingly. Thanks for your comment.

@enebo enebo merged commit c187bb2 into jruby:master Aug 9, 2023
@enebo
Copy link
Member

enebo commented Aug 9, 2023

@arthurscchan Thanks for the PR. I think this will add another lever to control time spent.

@kares
Copy link
Member

kares commented Sep 6, 2023

believe logstash's grok-ing patterns might end up with regular expressions even in the 1000s (although they still should not be that complicated and usually do not blow up the stack) ... hopefully not 10_000, let's see if they come around, later.

@enebo
Copy link
Member

enebo commented Jul 24, 2024

@kares @arthurscchan @lopex I did hit this the other month while running Ruby unit tests so I increased it....BUT...This is not a minor release change. Any existing user may stop working just because they have a big regexp they intend to have.

So I changed the default to -1 and you need to set the value to opt into it. We can open an issue for this if we want to change this to opt out with a reasonable default. The commit is 03ac9a5.

@headius
Copy link
Member

headius commented Jul 31, 2024

@enebo JRuby 10 might be the right time to attempt a hard limit.

@enebo
Copy link
Member

enebo commented Jul 31, 2024

@headius I am wondering if this should ever have a hard limit or should just be available as an option. Thus far it has just hit two projects with reasonable but big regexps. regexp instruction count does not imply performance issue but I can if you are building up something from a dsl or something perhaps you will want this.

@headius
Copy link
Member

headius commented Aug 1, 2024

@enebo Ahh I read more of the context and I agree this probably should only be an opt-in feature, at least until someone reports a CVE that we need to address. 😀

I would guess that CRuby has a similar problem in Onigmo, but perhaps nobody has tried.

@arthurscchan Is there some way we could reproduce the issue you found and test other Ruby Regexp implementations?

@arthurscchan
Copy link
Contributor Author

@headius Because this issue was reported almost a year ago, I don't have the "actual" input in hand anymore. But in my understanding and what has been discovered from OSS-Fuzz fuzzing, the input byte array that triggers the problem is a very long byte array with a complex structure, for example, a high dimension nested array or some invalid byte array that keeps opening nested elements until the heap memory is overflow. The length of the byte array reached more than 100,000 bytes in there as I recall.

@headius
Copy link
Member

headius commented Aug 1, 2024

@arthurscchan Ok thanks. I guess we'll just let it ride for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants