Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number Benchmarks are slow #25

Open
baldawar opened this issue Aug 22, 2022 · 8 comments
Open

Number Benchmarks are slow #25

baldawar opened this issue Aug 22, 2022 · 8 comments
Labels
bug Something isn't working

Comments

@baldawar
Copy link
Collaborator

Describe the bug

The test setup surrounding Numeric Benchmarks was fixed as part of #18 and we found that the tests were actually slow. Documenting this comment to investigate later #18 (comment).

I'm pretty sure I know why numeric benchmarks are slow. I had forgotten that this was true in Ruler, and I implemented a very similar approach in Quamina and was shocked when it ran slowly to a degree very similar to what we see here. So I profiled it and found out the time was being sucked up by whatever the Java equivalent is of Go's strconv.ParseFloat() and Sprintf("%019.0f"...) - it's not that cheap to parse floats or generate a normalized 19-digit representation.

Which I think means we're probably stuck with it. Unless you think you can write better floating-point parse/format code that's faster than what comes with the core libraries.

I don't think we'll do anything but we should look into if there's a way to improve precision.

Additional Context

File for Ruler https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/ComparableNumber.java#L37.

@baldawar baldawar added the bug Something isn't working label Aug 22, 2022
@baldawar
Copy link
Collaborator Author

baldawar commented Sep 7, 2022

I looked into this a bit today. Sharing my thoughts before I get busy with operator duties for sometime.

  1. The long poll is the part of Range constructor where we generate ComparableNumber https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/Range.java#L31. This as Tim mentioned in the comment above it affected by the padding + hex conversion. Benchmarks show its around 80~90% of the overall latency in CL2Benchmark benchmark for Numeric matchers.
  2. We can make significant improvements (> 20%) in the benchmark if we pass static pre-generated byte arrays instead of Constants for lessThan, greaterThan, and so on methods https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/Range.java#L57. Real benefits will vary based on how often clients use in-between X & Y range or CIDR.
  3. We've also over-biased towards CIDR / Hex matching. We should investigate if directly converting long to bytes is sufficient. On the surface, seems "only" need to update digitSequence method but I've run out time to dig into the potential benefits.
  4. While we likely can't write code that's faster than core libraries, doesn't hurt to try as we have need a very limited set of features. It's worth looking if bitmasking or look-up tables lead to any benefits. One place more investigation is needed.

@baldawar
Copy link
Collaborator Author

baldawar commented Sep 8, 2022

Related discussion in #37

@baldawar
Copy link
Collaborator Author

baldawar commented Sep 9, 2022

huh, so the issue is actually in the rules used within the benchmarks. It's not obvious unless you dig into the cityplots test data but there's multiple layers of arrays involved before we get to co-ordinates. This slows down rule-matching because of Array Consistency checks.

When I update cityplots2 file to also have a "firstCoordinates" key (takes the top most JSON w x/y/z values[1]), then match the rules against this new key then the benchmarks show significant improvement

Reading citylots2
Read 213068 events
EXACT events/sec: 213281.3
WILDCARD events/sec: 168967.5
PREFIX events/sec: 202728.8
SUFFIX events/sec: 217638.4
EQUALS_IGNORE_CASE events/sec: 198572.2
NUMERIC events/sec: 124820.2
ANYTHING-BUT events/sec: 136845.2
COMBO events/sec: 87899.3

[1] Sample JSON:

{
  "type": "Feature",
  "properties": {
    "MAPBLKLOT": "0001001",
    "BLKLOT": "0001001",
    "BLOCK_NUM": "0001",
    "LOT_NUM": "001",
    "FROM_ST": "0",
    "TO_ST": "0",
    "STREET": "UNKNOWN",
    "ST_TYPE": null,
    "ODD_EVEN": "E"
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        {
          "x": -122.42200352825247,
          "y": 37.80848009696725,
          "z": 0
        },
        {
          "x": -122.42207601332528,
          "y": 37.808835019815085,
          "z": 0
        },
        {
          "x": -122.42110217434863,
          "y": 37.808803534992904,
          "z": 0
        },
        {
          "x": -122.42106256906727,
          "y": 37.80860105681815,
          "z": 0
        },
        {
          "x": -122.42200352825247,
          "y": 37.80848009696725,
          "z": 0
        }
      ]
    ],
    "firstCoordinates": {
      "x": -122.42200352825247,
      "y": 37.80848009696725,
      "z": 0
    }
  }
}

@timbray
Copy link
Collaborator

timbray commented Sep 9, 2022

Yes, but the benchmark is doing exactly what it was designed to - exercise the hell out of the flattener and number-processor and so on. One effect is that we're testing a worst-case, most people's real applications are going to run faster than the benchmark, which is a good thing. Having said that, The CityLots data is real live data from a real live dataset, so probably a realistic problem. At AWS I had one group complain to me about how matching fields that were arrays of numbers was so much slower than matching ARNs or detail-types, so the kind of change we just checked in is still useful.

BTW the Quamina array-consistency design is totally different from Ruler's. One of them is probably faster but I have no idea which one.

@rudygt
Copy link
Contributor

rudygt commented Sep 9, 2022

I did some exploration around this, using the AnythingButPerformanceBenchmark (there is a slow down around the 10k event mark, caused by a few very large events) and found some events where the number of coordinates inside the geometry of the event is very high, the sample I was testing with had

Event.fields: 2462 

ACTask.stepQueue: 3029492 <- this is the max size the stepQueue gets handling that event, feels a little high

the worst offender I found, had more than 10k coordinates, and did got to around 17 million size on stepQueue, i have the impression something is exploding to get that many steps, I remain curious and will keep learning a little more about how the whole thing works to see if I can find other ways to help.

@timbray
Copy link
Collaborator

timbray commented Sep 9, 2022

Yeah, the basic algorithm is in ACFinder.java, like this: You have a sorted list of fields, i.e. name/value pairs, in the event, in this case 2462 long. You have a certain number of steps N in the state machine, dunno what it is. So, you start with the first state and drop an entry on the StepQueue for each field, because the state machine might start matching at any field. Suppose you get matches for first state at fields 2 (transitioning to state X) and 4 (trans to state Y). So then you're going to drop entries on the stepQ for state X and all the fields starting with 3, and for state Y and all the fields starting with 5. I.e you keep adding things to the stepQ and the central loop is "while stepQ not empty".

OK, I acknowledge this is a bit indirect, not the most intuitive thing. Anyhow, per the above, if you have 2.4K fields and get a few interim matches, it's not crazy that the stepQueue could get long. On the other hand, it is totally possible that we're doing something dumb here and unnecessarily loading it up. A fresh pair of eyes would be welcome.

@timbray
Copy link
Collaborator

timbray commented Sep 9, 2022

And I've realized, at some point in the intervening years, that the whole thing could be done recursively without using the stepQueue at all. Which my intuition says would probably be faster but I wouldn't bet much on it. But the basic concept is the same, wherever you are in the state machine you have to give all the remaining fields a chance to match the current state.

@baldawar
Copy link
Collaborator Author

baldawar commented Sep 9, 2022

I don't want to interrupt the discussion here too much but just a minor note tied my previous comment that I'm planning on adding a change, that 1/ rename the current rules around numeric matches to be called "array consistent matchers: and 2/ have a different set of rules for numeric matches that aren't affected by arrays. Mostly because I got very close to this state while digging into this issue, so not a lot of effort to make the change. It avoids any confusion around ruler's numeric matching performance (like being slow because of sprintf) and makes it clear that current bottleneck is in ACFinder.

baldawar added a commit that referenced this issue Sep 10, 2022
As part of #25 realized that numeric matchers are orders of magnitude slow not because  of inherent issue within the matcher specific  code, but instead because the benchmarks were stressed while checking for complex arrays (don't have a better term for this yet, think "json arrays within arrays within ..." )

After splitting the matchers into two and introducing a second PARTIAL_COMBO benchmark, I was able to identify a regression I would have introduced within the ByteMachine.java for numeric ranges.

As part of this change we're also changing the citylots2.json.gz file and adding a new firstCoordinates key for numeric matching only. I tried other existing properties first but as none of them have floating points or large numbers, the benchmarks results were not matching my expectations.
baldawar added a commit that referenced this issue Sep 12, 2022
As part of #25 realized that numeric matchers are orders of magnitude slow not because  of inherent issue within the matcher specific  code, but instead because the benchmarks were stressed while checking for complex arrays (don't have a better term for this yet, think "json arrays within arrays within ..." )

After splitting the matchers into two and introducing a second PARTIAL_COMBO benchmark, I was able to identify a regression I would have introduced within the ByteMachine.java for numeric ranges.

As part of this change we're also changing the citylots2.json.gz file and adding a new firstCoordinates key for numeric matching only. I tried other existing properties first but as none of them have floating points or large numbers, the benchmarks results were not matching my expectations.


* Diff https://gist.github.com/baldawar/adc1283ac89c029dba6f52c4fcd51f08/revisions
* Before https://gist.github.com/baldawar/adc1283ac89c029dba6f52c4fcd51f08/8efdbf0da705d04e6a2cc4b4354d8573095bddcc
* After https://gist.github.com/baldawar/adc1283ac89c029dba6f52c4fcd51f08/eb140daed14771c0848cc83b63010898e83a19c3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants