Skip to content

Commit

Permalink
Merge branch 'main' into dependabot/github_actions/codecov/codecov-ac…
Browse files Browse the repository at this point in the history
…tion-4.5.0
  • Loading branch information
timbray authored Jun 17, 2024
2 parents e4783fd + 437889c commit 07030e7
Show file tree
Hide file tree
Showing 23 changed files with 378 additions and 337 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332

- name: Set up Go ${{ matrix.go-version }}
uses: actions/setup-go@cdcb36043654635271a94b9a6d1392de5bb323a7
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/codeql-analysis.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332

# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/dep-review.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
timeout-minutes: 5
steps:
- name: Checkout repository
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332

- name: Dependency Review
uses: actions/dependency-review-action@0659a74c94536054bfa5aeb92241f70d680cc78e
2 changes: 1 addition & 1 deletion .github/workflows/go-lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332
with:
fetch-depth: 1

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/go-unit-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332

- name: Set up Go ${{ matrix.go-version }}
uses: actions/setup-go@cdcb36043654635271a94b9a6d1392de5bb323a7
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332
with:
fetch-depth: 0
ref: "main"
Expand Down
44 changes: 25 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
create an instance and add multiple **Patterns** to it,
and then query data objects called **Events** to
discover which of the Patterns match
the fields in the Event.
the fields in the Event. In typical cases, Quamina
can match millions of Events per second, even with
many Patterns added to the instance.

Quamina has no run-time dependencies beyond built-in Go libraries.

Expand Down Expand Up @@ -292,33 +294,20 @@ Events through it as is practical.

### `AddPattern()` Performance

In **most** cases, tens of thousands of Patterns per second can
Tens of thousands of Patterns per second can
be added to a Quamina instance; the in-memory data structure will
become larger, but not unreasonably so. The amount of of
become larger, but not unreasonably so. The amount of
available memory is the only significant limit to the
number of patterns an instance can carry.

The exception is `shellstyle` Patterns. Adding many of these
can rapidly lead to degradation in elapsed time and memory
consumption, at a rate which is uneven but at worst
O(2<sup>N</sup>) in the number of patterns. A fuzz test
which adds random 5-letter words with a `*` at a random
location slows to a crawl after 30 or so `AddPattern()`
calls, with the Quamina instance having many millions of
states. Note that such instances, once built, can still
match Events at high speeds.

This is after some optimization. It is possible there is a
bug such that automaton-building is unduly wasteful but it
may remain the case that adding this flavor of Pattern is
simply not something that can be done at large scale.

### `MatchesForEvent()` Performance

I used to say that the performance of
`MatchesForEvent` was O(1) in the number of
Patterns. That’s probably a reasonable way to think
about it, because it’s *almost* right.
about it, because it’s *almost* right, except in the
case where a very large number of `shellstyle` patterns
have been added; this is discussed in the next section.

To be correct, the performance is a little worse than
O(N) where N is the average number of unique fields in an
Expand Down Expand Up @@ -361,6 +350,23 @@ So, adding a new Pattern that only mentions fields which are
already mentioned in previous Patterns is effectively free,
i.e. O(1) in terms of run-time performance.

### Quamina instances with large numbers of `shellstyle` Patterns

A study of the theory of finite automata reveals that processing
regular-expression constructs such as `*` increases the complexity of
the automaton necessary to match it. It develops that when
a large number of such automata are compiled together, the merged
output can contain a high degree of nondeterminism which can result
in a drastic slowdown.

A fuzz test which adds a pattern for each of 12,959 5-letter words with
one `*` embedded in each at a random offset slows matching speed down to
below 10,000/second, in stark contrast to most Quamina instances, which
can achieve millions of matches/second.

This slowdown is under active investigation and it is possible that the
situation will improve.

### Further documentation

There is a series of blog posts entitled
Expand Down
17 changes: 8 additions & 9 deletions anything_but.go
Original file line number Diff line number Diff line change
Expand Up @@ -73,20 +73,19 @@ func readAnythingButSpecial(pb *patternBuild, valsIn []typedVal) (pathVals []typ
func makeMultiAnythingButFA(vals [][]byte) (*smallTable, *fieldMatcher) {
nextField := newFieldMatcher()
successStep := &faState{table: newSmallTable(), fieldTransitions: []*fieldMatcher{nextField}}
//DEBUG successStep.table.label = "(success)"
success := &faNext{steps: []*faState{successStep}}
success := &faNext{states: []*faState{successStep}}

ret, _ := oneMultiAnythingButStep(vals, 0, success), nextField
ret, _ := makeOneMultiAnythingButStep(vals, 0, success), nextField
return ret, nextField
}

// oneMultiAnythingButStep - spookeh. The idea is that there will be N smallTables in this FA, where N is
// makeOneMultiAnythingButStep - spookeh. The idea is that there will be N smallTables in this FA, where N is
// the longest among the vals. So for each value from 0 through N, we make a smallTable whose default is
// success but transfers to the next step on whatever the current byte in each of the vals that have not
// yet been exhausted. We notice when we get to the end of each val and put in a valueTerminator transition
// to a step with no nextField entry, i.e. failure because we've exactly matched one of the anything-but
// strings.
func oneMultiAnythingButStep(vals [][]byte, index int, success *faNext) *smallTable {
func makeOneMultiAnythingButStep(vals [][]byte, index int, success *faNext) *smallTable {
// this will be the default transition in all the anything-but tables.
var u unpackedTable
for i := range u {
Expand Down Expand Up @@ -115,18 +114,18 @@ func oneMultiAnythingButStep(vals [][]byte, index int, success *faNext) *smallTa

// for each val that still has bytes to process, recurse to process the next one
for utf8Byte, val := range valsWithBytesRemaining {
nextTable := oneMultiAnythingButStep(val, index+1, success)
nextTable := makeOneMultiAnythingButStep(val, index+1, success)
nextStep := &faState{table: nextTable}
u[utf8Byte] = &faNext{steps: []*faState{nextStep}}
u[utf8Byte] = &faNext{states: []*faState{nextStep}}
}

// for each val that ends at 'index', put a failure-transition for this anything-but
// if you hit the valueTerminator, success for everything else
for utf8Byte := range valsEndingHere {
failState := &faState{table: newSmallTable()} // note no transitions
lastStep := &faNext{steps: []*faState{failState}}
lastStep := &faNext{states: []*faState{failState}}
lastTable := makeSmallTable(success, []byte{valueTerminator}, []*faNext{lastStep})
u[utf8Byte] = &faNext{steps: []*faState{{table: lastTable}}}
u[utf8Byte] = &faNext{states: []*faState{{table: lastTable}}}
}

table := newSmallTable()
Expand Down
14 changes: 8 additions & 6 deletions cl2_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,20 +187,20 @@ func TestRulerCl2(t *testing.T) {

// initial run to stabilize memory
bm := newBenchmarker()
bm.addRules(exactRules, exactMatches)
bm.addRules(exactRules, exactMatches, false)

bm.run(t, lines)

bm = newBenchmarker()
bm.addRules(exactRules, exactMatches)
bm.addRules(exactRules, exactMatches, true)
fmt.Printf("EXACT events/sec: %.1f\n", bm.run(t, lines))

bm = newBenchmarker()
bm.addRules(prefixRules, prefixMatches)
bm.addRules(prefixRules, prefixMatches, true)
fmt.Printf("PREFIX events/sec: %.1f\n", bm.run(t, lines))

bm = newBenchmarker()
bm.addRules(anythingButRules, anythingButMatches)
bm.addRules(anythingButRules, anythingButMatches, true)
fmt.Printf("ANYTHING-BUT events/sec: %.1f\n", bm.run(t, lines))
}

Expand All @@ -214,13 +214,15 @@ func newBenchmarker() *benchmarker {
return &benchmarker{q: q, wanted: make(map[X]int)}
}

func (bm *benchmarker) addRules(rules []string, wanted []int) {
func (bm *benchmarker) addRules(rules []string, wanted []int, report bool) {
for i, rule := range rules {
rname := fmt.Sprintf("r%d", i)
_ = bm.q.AddPattern(rname, rule)
bm.wanted[rname] = wanted[i]
}
fmt.Println(matcherStats(bm.q.matcher.(*coreMatcher)))
if report {
fmt.Println(matcherStats(bm.q.matcher.(*coreMatcher)))
}
}

func (bm *benchmarker) run(t *testing.T, events [][]byte) float64 {
Expand Down
29 changes: 18 additions & 11 deletions core_matcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ func (m *coreMatcher) deletePatterns(_ X) error {
// matchesForJSONEvent calls the flattener to pull the fields out of the event and
// hands over to MatchesForFields
// This is a leftover from previous times, is only used by tests, but it's used by a *lot*
// so removing it would require a lot of tedious work
// and it's a convenient API for testing.
func (m *coreMatcher) matchesForJSONEvent(event []byte) ([]X, error) {
fields, err := newJSONFlattener().Flatten(event, m.getSegmentsTreeTracker())
if err != nil {
Expand Down Expand Up @@ -178,20 +178,27 @@ func (m *coreMatcher) matchesForFields(fields []Field) ([]X, error) {
}
matches := newMatchSet()

// pre-allocate a pair of buffers that will be used several levels down the call stack for efficiently
// transversing NFAs
bufs := &bufpair{
buf1: make([]*faState, 0),
buf2: make([]*faState, 0),
}

// for each of the fields, we'll try to match the automaton start state to that field - the tryToMatch
// routine will, in the case that there's a match, call itself to see if subsequent fields after the
// first matched will transition through the machine and eventually achieve a match
s := m.fields()
for i := 0; i < len(fields); i++ {
tryToMatch(fields, i, s.state, matches)
tryToMatch(fields, i, s.state, matches, bufs)
}
return matches.matches(), nil
}

// tryToMatch tries to match the field at fields[index] to the provided state. If it does match and generate
// 1 or more transitions to other states, it calls itself recursively to see if any of the remaining fields
// can continue the process by matching that state.
func tryToMatch(fields []Field, index int, state *fieldMatcher, matches *matchSet) {
func tryToMatch(fields []Field, index int, state *fieldMatcher, matches *matchSet, bufs *bufpair) {
stateFields := state.fields()

// transition on exists:true?
Expand All @@ -200,16 +207,16 @@ func tryToMatch(fields []Field, index int, state *fieldMatcher, matches *matchSe
matches = matches.addXSingleThreaded(existsTrans.fields().matches...)
for nextIndex := index + 1; nextIndex < len(fields); nextIndex++ {
if noArrayTrailConflict(fields[index].ArrayTrail, fields[nextIndex].ArrayTrail) {
tryToMatch(fields, nextIndex, existsTrans, matches)
tryToMatch(fields, nextIndex, existsTrans, matches, bufs)
}
}
}

// an exists:false transition is possible if there is no matching field in the event
checkExistsFalse(stateFields, fields, index, matches)
checkExistsFalse(stateFields, fields, index, matches, bufs)

// try to transition through the machine
nextStates := state.transitionOn(&fields[index])
nextStates := state.transitionOn(&fields[index], bufs)

// for each state in the possibly-empty list of transitions from this state on fields[index]
for _, nextState := range nextStates {
Expand All @@ -221,17 +228,17 @@ func tryToMatch(fields []Field, index int, state *fieldMatcher, matches *matchSe
// of the same array
for nextIndex := index + 1; nextIndex < len(fields); nextIndex++ {
if noArrayTrailConflict(fields[index].ArrayTrail, fields[nextIndex].ArrayTrail) {
tryToMatch(fields, nextIndex, nextState, matches)
tryToMatch(fields, nextIndex, nextState, matches, bufs)
}
}
// now we've run out of fields to match this state against. But suppose it has an exists:false
// transition, and it so happens that the exists:false pattern field is lexically larger than the other
// fields and that in fact such a field does not exist. That state would be left hanging. So…
checkExistsFalse(nextStateFields, fields, index, matches)
checkExistsFalse(nextStateFields, fields, index, matches, bufs)
}
}

func checkExistsFalse(stateFields *fmFields, fields []Field, index int, matches *matchSet) {
func checkExistsFalse(stateFields *fmFields, fields []Field, index int, matches *matchSet, bufs *bufpair) {
for existsFalsePath, existsFalseTrans := range stateFields.existsFalse {
// it seems like there ought to be a more state-machine-idiomatic way to do this, but
// I thought of a few and none of them worked. Quite likely someone will figure it out eventually.
Expand All @@ -250,9 +257,9 @@ func checkExistsFalse(stateFields *fmFields, fields []Field, index int, matches
if i == len(fields) {
matches = matches.addXSingleThreaded(existsFalseTrans.fields().matches...)
if thisFieldIsAnExistsFalse {
tryToMatch(fields, index+1, existsFalseTrans, matches)
tryToMatch(fields, index+1, existsFalseTrans, matches, bufs)
} else {
tryToMatch(fields, index, existsFalseTrans, matches)
tryToMatch(fields, index, existsFalseTrans, matches, bufs)
}
}
}
Expand Down
8 changes: 4 additions & 4 deletions field_matcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import (

// fieldMatcher represents a state in the matching automaton, which matches field names and dispatches to
// valueMatcher to complete matching of field values.
// the fields that hold state are segregated in updateable so they can be replaced atomically and make the coreMatcher
// the fields that hold state are segregated in updateable, so they can be replaced atomically and make the coreMatcher
// thread-safe.
type fieldMatcher struct {
updateable atomic.Value // always holds an *fmFields
Expand Down Expand Up @@ -112,7 +112,7 @@ func (m *fieldMatcher) addTransition(field *patternField, printer printer) []*fi
}
freshStart.transitions[field.path] = vm

// suppose I'm adding the first pattern to a matcher and it has "x": [1, 2]. In principle the branches on
// suppose I'm adding the first pattern to a matcher, and it has "x": [1, 2]. In principle the branches on
// "x": 1 and "x": 2 could go to tne same next state. But we have to make a unique next state for each of them
// because some future other pattern might have "x": [2, 3] and thus we need a separate branch to potentially
// match two patterns on "x": 2 but not "x": 1. If you were optimizing the automaton for size you might detect
Expand Down Expand Up @@ -144,12 +144,12 @@ func (m *fieldMatcher) addTransition(field *patternField, printer printer) []*fi
// or nil if no transitions are possible. An example of name/value that could produce multiple next states
// would be if you had the pattern { "a": [ "foo" ] } and another pattern that matched any value with
// a prefix of "f".
func (m *fieldMatcher) transitionOn(field *Field) []*fieldMatcher {
func (m *fieldMatcher) transitionOn(field *Field, bufs *bufpair) []*fieldMatcher {
// are there transitions on this field name?
valMatcher, ok := m.fields().transitions[string(field.Path)]
if !ok {
return nil
}

return valMatcher.transitionOn(field.Val)
return valMatcher.transitionOn(field.Val, bufs)
}
Loading

0 comments on commit 07030e7

Please sign in to comment.