Skip to content

Conversation

@tvernum
Copy link
Contributor

@tvernum tvernum commented Dec 22, 2020

This change substantially reduces the CPU and Heap usage of
StringMatcher when processing large complex patterns.

The improvement is achieved by switching the order in which we
perform concatenation and union for common styles of wildcard patterns.

Given a set of wildcard strings:

  • "*-logs-*"
  • "*-metrics-*"
  • "web-*-prod-*"
  • "web-*-staging-*"

The old implementation would perform steps roughly like:

minimize {
    union {
        concatenate { MATCH_ANY, "-logs-", MATCH_ANY }
        concatenate { MATCH_ANY, "-metrics-", MATCH_ANY }
        concatenate { "web-", MATCH_ANY, "prod-", MATCH_ANY }
        concatenate { "web-", MATCH_ANY, "staging-", MATCH_ANY }
    }
}

The outer minimize would require determinizing the automaton, which
was highly inefficient

The new implementation is:

minimize {
    union {
        concatenate {
            MATCH_ANY ,
            minimize {
                union { "-logs-", "-metrics"- }
            }
            MATCH_ANY
        }
        concatenate {
            minimize {
                union {
                    concatenate { "web-", MATCH_ANY, "prod-" }
                    concatenate { "web-", MATCH_ANY, "staging-" }
                }
            }
            MATCH_ANY
        }
    }
}

By performing a union of the inner strings before concatenating the
MATCH_ANY ("*") the time & heap space spent on determinizing the
automaton is greatly reduced.

Resolves: #36062

This change substantially reduces the CPU and Heap usage of
StringMatcher when processing large complex patterns.

The improvement is achieved by switching the order in which we
perform concatenation and union for common styles of wildcard patterns.

Given a set of wildcard strings:
- "*-logs-*"
- "*-metrics-*"
- "web-*-prod-*"
- "web-*-staging-*"

The old implementation would perform steps roughly like:

    minimize {
        union {
            concatenate { MATCH_ANY, "-logs-", MATCH_ANY }
            concatenate { MATCH_ANY, "-metrics-", MATCH_ANY }
            concatenate { "web-", MATCH_ANY, "prod-", MATCH_ANY }
            concatenate { "web-", MATCH_ANY, "staging-", MATCH_ANY }
        }
    }

The outer minimize would require determinizing the automaton, which
was highly inefficient

The new implementation is:

    minimize {
        union {
            concatenate {
                MATCH_ANY ,
                minimize {
                    union { "-logs-", "-metrics"- }
                }
                MATCH_ANY
            }
            concatenate {
                minimize {
                    union {
                        concatenate { "web-", MATCH_ANY, "prod-" }
                        concatenate { "web-", MATCH_ANY, "staging-" }
                    }
                }
                MATCH_ANY
            }
        }
    }

By performing a union of the inner strings before concatenating the
MATCH_ANY ("*") the time & heap space spent on determinizing the
automaton is greatly reduced.
@tvernum tvernum added >enhancement :Security/Security Security issues without another label v8.0.0 v7.12.0 labels Dec 22, 2020
@elasticmachine elasticmachine added the Team:Security Meta label for security team label Dec 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@tvernum
Copy link
Contributor Author

tvernum commented Dec 22, 2020

To give an indication of the improvement...

If a user has this role:

{
    "indices": [
        {
            "names": [ "*-logs-*-prod-*", "*-logs-*-staging-*", "*-metrics-*-prod-*", "*-metrics-*-staging-*" ],
            "privileges": [ "read" ]
        },
        {
            "names": [ "*-logs-*-test-*", "*-logs-*-dev-*", "*-metrics-*-test-*", "*-metrics-*-dev-*" ],
            "privileges": [ "read" ]
        },
        {
            "names": [ "web-*-prod-*", "web-*-staging-*" ],
            "privileges": [ "read" ]
        },
        {
            "names": [ "web-*-test-*", "web-*-dev-*" ],
            "privileges": [ "read" ]
        },
        {
            "names": [ "*-host-zeus-*", "*-host-hermes-*", "*-host-hera-*", "*-host-ares-*", "*-host-orion-*", "*-host-atlas-*", "*-host-apollo-*" ],
            "privileges": [ "read" ]
        }
    ]
}

Then in current versions (7.11 snapshot), even with max_determinized_states set to 1,000,000 the user will not be able to search any indices, because determinizing the automaton will take about 6 seconds, and then fail with:

Determinizing automaton with 284 states and 356 transitions would result in more than 1000000 states.

After this PR, that role will work correctly, using the defaults (max_determinized_states: 100,000) and determinizes in < 100 milliseconds

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few minor comments.

The efficiency improvement is outstanding. Great job!

Comment on lines +121 to +122
final char first = p.charAt(0);
final char last = p.charAt(p.length() - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoud we guard these with a check for p.length() == 0?

} else if (first == '*') {
if (last == '*') {
// *something*
infix.add(p.substring(1, p.length() - 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that the pattern is **. As such the infix is an empty string and can be skipped. But as discussed, it is probably better handled as part of compacting all consecutive *, which can be done in a separate PR.

// But, that's not true if the string has an embedded '*' in it - in that case, we should handle them in this special way.
prefix.add(p.substring(0, p.length() - 1));
} else {
// some*thing / some?thing / etc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pattern like something* should also reach here and should be part of the comment.

// some*thing*
// For simple prefix patterns ("something*") it's more efficient to do a single pass
// Lucene handles the shared trailing '*' on an accept state well,
// and performing 2 minimizes (on for the union of strings, then on again after concatenating MATCH_ANY) is slower.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence reads weird to me. Should it be something like ... (one for the union of strings, and another one after ...)

return minimize(pattern(patterns.iterator().next()));
}

final Function<Collection<String>, Automaton> build = strings -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd prefer to name this variable with a noun, something like buildFunc.

@BigPandaToo
Copy link
Contributor

I am with Yang on comment on the line 122 ^^
Otherwise LGTM

@tvernum tvernum merged commit f2651e4 into elastic:master Dec 30, 2020
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Dec 31, 2020
This change substantially reduces the CPU and Heap usage of
StringMatcher when processing large complex patterns.

The improvement is achieved by switching the order in which we
perform concatenation and union for common styles of wildcard patterns.

Given a set of wildcard strings:
- "*-logs-*"
- "*-metrics-*"
- "web-*-prod-*"
- "web-*-staging-*"

The old implementation would perform steps roughly like:

    minimize {
        union {
            concatenate { MATCH_ANY, "-logs-", MATCH_ANY }
            concatenate { MATCH_ANY, "-metrics-", MATCH_ANY }
            concatenate { "web-", MATCH_ANY, "prod-", MATCH_ANY }
            concatenate { "web-", MATCH_ANY, "staging-", MATCH_ANY }
        }
    }

The outer minimize would require determinizing the automaton, which
was highly inefficient

The new implementation is:

    minimize {
        union {
            concatenate {
                MATCH_ANY ,
                minimize {
                    union { "-logs-", "-metrics"- }
                }
                MATCH_ANY
            }
            concatenate {
                minimize {
                    union {
                        concatenate { "web-", MATCH_ANY, "prod-" }
                        concatenate { "web-", MATCH_ANY, "staging-" }
                    }
                }
                MATCH_ANY
            }
        }
    }

By performing a union of the inner strings before concatenating the
MATCH_ANY ("*") the time & heap space spent on determinizing the
automaton is greatly reduced.

Backport of: elastic#66724
tvernum added a commit that referenced this pull request Dec 31, 2020
This change substantially reduces the CPU and Heap usage of
StringMatcher when processing large complex patterns.

The improvement is achieved by switching the order in which we
perform concatenation and union for common styles of wildcard patterns.

Given a set of wildcard strings:
- "*-logs-*"
- "*-metrics-*"
- "web-*-prod-*"
- "web-*-staging-*"

The old implementation would perform steps roughly like:

    minimize {
        union {
            concatenate { MATCH_ANY, "-logs-", MATCH_ANY }
            concatenate { MATCH_ANY, "-metrics-", MATCH_ANY }
            concatenate { "web-", MATCH_ANY, "prod-", MATCH_ANY }
            concatenate { "web-", MATCH_ANY, "staging-", MATCH_ANY }
        }
    }

The outer minimize would require determinizing the automaton, which
was highly inefficient

The new implementation is:

    minimize {
        union {
            concatenate {
                MATCH_ANY ,
                minimize {
                    union { "-logs-", "-metrics"- }
                }
                MATCH_ANY
            }
            concatenate {
                minimize {
                    union {
                        concatenate { "web-", MATCH_ANY, "prod-" }
                        concatenate { "web-", MATCH_ANY, "staging-" }
                    }
                }
                MATCH_ANY
            }
        }
    }

By performing a union of the inner strings before concatenating the
MATCH_ANY ("*") the time & heap space spent on determinizing the
automaton is greatly reduced.

Backport of: #66724
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Security/Security Security issues without another label Team:Security Meta label for security team v7.12.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate improvements to all pattern matching

5 participants