Add limit for regex length #75

arthurscchan · 2023-08-04T15:16:45Z

This fixes a possible out of stack memory problem in src/org/joni/Regex.java for parsing long and complicated regex structures.

In the constructors of the Regex class, it will take in a string or a byte array as for generating a regex matcher. The received string or byte array will be stored and go through the compile() method of the Analyser class. The Analyser class will then use different parsing methods in the Parser class to parse the regular expression. Those parse methods will call each other recursively depending on the structure of the given string or byte array. If the input is too long or contains so many complicated or levelled structures, the depth of the recursion will be high. As each of the recursive calls will occupy part of the stack, the high depth of recursion will use up the stack quickly and result in a stack memory overflow problem.

This PR reduces the problem by adding an additional length limit as a private class variable. If the provided regex string or byte array is longer than the maximum length. The method will simply throw a ValueException.

We found this bug using fuzzing by way of OSS-Fuzz, where we recently integrated joni (google/oss-fuzz#10680). OSS-Fuzz is a free service run by Google for fuzzing important open source software. If you'd like to know more about this then I'm happy to go into detail and also set up things so you can receive emails and detailed reports when bugs are found.

Signed-off-by: Arthur Chan <[email protected]>

enebo · 2023-08-04T22:08:17Z

@arthurscchan I think this is a good idea but can you investigate making this a configurable item. joni has a lot of items you can set. You can look at #67 for how to do this.

arthurscchan · 2023-08-07T11:10:21Z

@enebo Thanks for the suggestion. I will fix the PR accordingly.

Signed-off-by: Arthur Chan <[email protected]>

arthurscchan · 2023-08-07T13:08:24Z

@enebo Fixed accordingly, thanks again for your suggestion.

lopex · 2023-08-07T14:28:08Z

shouldnt there be end - p instead of bytes.length ?

Signed-off-by: Arthur Chan <[email protected]>

arthurscchan · 2023-08-07T14:42:53Z

@lopex Yes, I agree. Thanks for the suggestion.
@enebo I have fixed that accordingly. Thanks for your comment.

enebo · 2023-08-09T14:42:29Z

@arthurscchan Thanks for the PR. I think this will add another lever to control time spent.

kares · 2023-09-06T09:54:02Z

believe logstash's grok-ing patterns might end up with regular expressions even in the 1000s (although they still should not be that complicated and usually do not blow up the stack) ... hopefully not 10_000, let's see if they come around, later.

enebo · 2024-07-24T18:40:13Z

@kares @arthurscchan @lopex I did hit this the other month while running Ruby unit tests so I increased it....BUT...This is not a minor release change. Any existing user may stop working just because they have a big regexp they intend to have.

So I changed the default to -1 and you need to set the value to opt into it. We can open an issue for this if we want to change this to opt out with a reasonable default. The commit is 03ac9a5.

headius · 2024-07-31T02:35:18Z

@enebo JRuby 10 might be the right time to attempt a hard limit.

enebo · 2024-07-31T13:27:38Z

@headius I am wondering if this should ever have a hard limit or should just be available as an option. Thus far it has just hit two projects with reasonable but big regexps. regexp instruction count does not imply performance issue but I can if you are building up something from a dsl or something perhaps you will want this.

headius · 2024-08-01T19:47:15Z

@enebo Ahh I read more of the context and I agree this probably should only be an opt-in feature, at least until someone reports a CVE that we need to address. 😀

I would guess that CRuby has a similar problem in Onigmo, but perhaps nobody has tried.

@arthurscchan Is there some way we could reproduce the issue you found and test other Ruby Regexp implementations?

arthurscchan · 2024-08-01T19:56:45Z

@headius Because this issue was reported almost a year ago, I don't have the "actual" input in hand anymore. But in my understanding and what has been discovered from OSS-Fuzz fuzzing, the input byte array that triggers the problem is a very long byte array with a complex structure, for example, a high dimension nested array or some invalid byte array that keeps opening nested elements until the heap memory is overflow. The length of the byte array reached more than 100,000 bytes in there as I recall.

headius · 2024-08-01T20:26:47Z

@arthurscchan Ok thanks. I guess we'll just let it ride for now.

Add limit for regex length

029e378

Signed-off-by: Arthur Chan <[email protected]>

Fix config logic

434e5f6

Signed-off-by: Arthur Chan <[email protected]>

Fix length calculation

5cd8965

Signed-off-by: Arthur Chan <[email protected]>

enebo merged commit c187bb2 into jruby:master Aug 9, 2023

joegallo mentioned this pull request Dec 4, 2025

Bump jruby/joni to 2.2.6 elastic/elasticsearch#139075

Merged

Add limit for regex length #75

Add limit for regex length #75

Uh oh!

Conversation

arthurscchan commented Aug 4, 2023

Uh oh!

enebo commented Aug 4, 2023

Uh oh!

arthurscchan commented Aug 7, 2023

Uh oh!

arthurscchan commented Aug 7, 2023

Uh oh!

lopex commented Aug 7, 2023

Uh oh!

arthurscchan commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enebo commented Aug 9, 2023

Uh oh!

kares commented Sep 6, 2023

Uh oh!

enebo commented Jul 24, 2024

Uh oh!

headius commented Jul 31, 2024

Uh oh!

enebo commented Jul 31, 2024

Uh oh!

headius commented Aug 1, 2024

Uh oh!

arthurscchan commented Aug 1, 2024

Uh oh!

headius commented Aug 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arthurscchan commented Aug 7, 2023 •

edited

Loading