Skip to content

Commit

Permalink
Improve Numeric matching to support full range of float64 (#188)
Browse files Browse the repository at this point in the history
This change follows the guidance from #179 on using 10 byte base-128 encoded format for numbers similar to how Quamina does it.

Didn't see any performance implications of supporting the new range, but had to fix a bunch of tests. I will be changing the numbers we use for testing to better test the new range of numbers before merging.

During debugging, I found it challenging to make sense of the numbers to I've also added a helper method in ComparableNumbers and modified toString() methods in few places.
  • Loading branch information
baldawar authored Sep 19, 2024
1 parent c5dc202 commit 1406b64
Show file tree
Hide file tree
Showing 14 changed files with 239 additions and 276 deletions.
6 changes: 2 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,10 +273,8 @@ Anything-but wildcard list (strings):
```
Above, the references to `c-count`, `d-count`, and `x-limit` illustrate numeric matching,
and only
work with values that are JSON numbers. Numeric matching is limited to value between
-5.0e11 and +5.0e11 inclusive, with 17 digits of precision, that is to say 6 digits
to the right of the decimal point.
and only work with values that are JSON numbers. Numeric matching supports the same
precision and range as Java's `double` primitive which implements IEEE 754 `binary64` standard.

### IP Address Matching
```javascript
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
<groupId>software.amazon.event.ruler</groupId>
<artifactId>event-ruler</artifactId>
<name>Event Ruler</name>
<version>1.7.6</version>
<version>1.8.0</version>
<description>Event Ruler is a Java library that allows matching Rules to Events. An event is a list of fields,
which may be given as name/value pairs or as a JSON object. A rule associates event field names with lists of
possible values. There are two reasons to use Ruler: 1/ It's fast; the time it takes to match Events doesn't
Expand Down
154 changes: 86 additions & 68 deletions src/main/software/amazon/event/ruler/ComparableNumber.java
Original file line number Diff line number Diff line change
Expand Up @@ -3,104 +3,122 @@
import ch.randelshofer.fastdoubleparser.JavaBigDecimalParser;

import java.math.BigDecimal;

import static software.amazon.event.ruler.Constants.BASE64_DIGITS;
import static software.amazon.event.ruler.Constants.MIN_NUM_DIGIT;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;
import java.util.List;

/**
* Represents a number as a comparable string.
* <br/>
* Numbers are allowed in the range -500,000,000,000 to +500,000,000,000 (inclusive).
* Comparisons are precise to 17 decimal places, with six to the right of the decimal.
* Numbers are treated as floating-point values.
* <br>
* Numbers are converted to strings by:
* 1. Multiplying by 1,000,000 to remove the decimal point and then adding 500,000,000,000 (to remove negatives), then
* 2. Formatting to a 12-character to base64 string with padding, because the base64 string
* converted from 500,000,000,000 * 1,000,000 = 500,000,000,000,000,000 has 12 characters.
* All possible double numbers (IEEE-754 binary64 standard) are allowed.
* Numbers are first standardized to floating-point values and then converted
* to a Base128 encoded string of 10 bytes.
* <br/>
* Hexadecimal representation is used because:
* 1. It saves 3 bytes of memory per number compared to decimal representation.
* 2. It is lexicographically comparable, which is useful for maintaining sorted order of numbers.
* 2. It aligns with the radix used for IP addresses.
* We use Base128 encoding offers a compact representation of decimal numbers
* as it preserves the lexicographical order of the numbers. See
* https://github.com/aws/event-ruler/issues/179 for more context.
* <br/>
* The number is parsed as a Java {@code BigDecimal} to support decimal fractions. We're avoiding double as
* there is a well-known issue that double numbers can lose precision when performing calculations involving
* other data types. The higher the number, the lower the accuracy that can be maintained. For example,
* {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test
* results show that only 5 decimal places of precision can be guaranteed when using doubles.
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue
* where parsing directly to {@code Double} can lose precision when parsing doubles. It's
* probably possible to support wider ranges with our current implementation of parsing strings to
* BigDecimal, but it's not worth the effort as JSON also support upto float64 range. In
* case this requirement changes, it would be advisable to move away from using {@code Doubles}
* and {@code Long} in this class.
* <br/>
* CAVEAT:
* The current range of +/- 500,000,000,000 is selected as a balance between maintaining the committed 6
* decimal places of precision and memory cost (each number is parsed into a 12-character hexadecimal string).
* There are precision and memory implications of the implementation here.
* When trying to increase the maximum number, PLEASE BE VERY CAREFUL TO PRESERVE THE NUMBER PRECISION AND
* CONSIDER THE MEMORY COST.
* <br/>
* Also, while {@code BigDecimal} can ensure the precision of double calculations, it has been shown to be
* 2-4 times slower for basic mathematical and comparison operations, so we turn to long integer arithmetic.
* This will need to be modified if we ever need to support larger numbers.
*/
class ComparableNumber {
// Use scientific notation to define the double number directly to avoid losing Precision by calculation
// for example 5000 * 1000 *1000 will be wrongly parsed as 7.05032704E8 by computer.
static final double HALF_TRILLION = 5E11;

static final int MAX_LENGTH_IN_BYTES = 16;
static final int MAX_DECIMAL_PRECISON = 6;
static final int MAX_LENGTH_IN_BYTES = 10;
static final int BASE_128_BITMASK = 0x7f; // 127 or 01111111

public static final BigDecimal TEN_E_SIX = new BigDecimal("1E6"); // to remove decimals
public static final long HALF_TRILLION_TEN_E_SIX = new BigDecimal(ComparableNumber.HALF_TRILLION).multiply(TEN_E_SIX).longValueExact();

private ComparableNumber() {
}
private ComparableNumber() {}

/**
* Generates a hexadecimal string representation of a given decimal string value,
* with a maximum precision of 6 decimal places and a range between -5,000,000,000
* and 5,000,000,000 (inclusive).
* Generates a comparable number string from a given string representation
* using numbits representation.
*
* @param str the decimal string value to be converted
* @return the hexadecimal string representation of the input value
* @throws IllegalArgumentException if the input value has more than 6 decimal places
* or is outside the allowed range
* @param str the string representation of the number
* @return the comparable number string
* @throws NumberFormatException if the input isn't a number
* @throws IllegalArgumentException if the input isn't a number we can compare
*/
static String generate(final String str) {
final BigDecimal number = JavaBigDecimalParser.parseBigDecimal(str).stripTrailingZeros();
if (number.scale() > MAX_DECIMAL_PRECISON) {
throw new IllegalArgumentException("Only values upto 6 decimals are supported");
}

final long shiftedBySixDecimals = number.multiply(TEN_E_SIX).longValueExact();
final BigDecimal bigDecimal = JavaBigDecimalParser.parseBigDecimal(str);
final double doubleValue = bigDecimal.doubleValue();

// faster than doing bigDecimal comparisons
if (shiftedBySixDecimals < -HALF_TRILLION_TEN_E_SIX || shiftedBySixDecimals > HALF_TRILLION_TEN_E_SIX) {
throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION +
" and " + ComparableNumber.HALF_TRILLION + ", inclusive");
// make sure we have the comparable numbers and haven't eaten up decimals values
if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) ||
BigDecimal.valueOf(doubleValue).compareTo(bigDecimal) != 0) {
throw new IllegalArgumentException("Cannot compare number : " + str);
}
final long bits = Double.doubleToRawLongBits(doubleValue);

return longToBase64Bytes(shiftedBySixDecimals + HALF_TRILLION_TEN_E_SIX);
// if high bit is 0, we want to xor with sign bit 1 << 63, else negate (xor with ^0). Meaning,
// bits >= 0, mask = 1000000000000000000000000000000000000000000000000000000000000000
// bits < 0, mask = 1111111111111111111111111111111111111111111111111111111111111111
final long mask = ((bits >>> 63) * 0xFFFFFFFFFFFFFFFFL) | (1L << 63);
return numbits(bits ^ mask );
}

public static String longToBase64Bytes(long value) {
if (value < 0) {
throw new IllegalArgumentException("Input value must be non-negative");
/**
* Converts a long value to a Base128 encoded string representation.
* <br/>
* The Base128 encoding scheme is a way to represent a long value as a sequence
* of bytes, where each byte encodes 7 bits of the original value. This allows for
* efficient storage and transmission of large numbers.
* <br/>
* The method first determines the number of trailing zero bytes in the input
* value by iterating over the bytes from the most significant byte to the least
* significant byte, and counting the number of consecutive zero bytes at the end.
* It then creates a byte array of fixed length {@code MAX_LENGTH_IN_BYTES} and
* populates it with the Base128 encoded bytes of the input value, starting from
* the least significant byte.
* <br/>
* As shown in Quamina's numbits.go, it's possible to use variable length encoding
* to reduce storage for simple (often common) numbers but it's not done here to
* keep range comparisons simple for now.
*
* @param value the long value to be converted
* @return the Base128 encoded string representation of the input value
*/
public static String numbits(long value) {
int trailingZeroes = 0;
int index;
// Count the number of trailing zero bytes to skip setting them
for(index = MAX_LENGTH_IN_BYTES - 1; index >= 0; index--) {
if((value & BASE_128_BITMASK) != 0) {
break;
}
trailingZeroes ++;
value >>= 7;
}

char[] bytes = new char[12]; // Maximum length of base-64 encoded long is 12 bytes
int index = 11;

while (value > 0) {
int digit = (int) (value & 0x3F); // Get the lowest 6 bits
bytes[index--] = (char) BASE64_DIGITS[digit];
value >>= 6; // Shift the value right by 6 bits
}
byte[] result = new byte[MAX_LENGTH_IN_BYTES];

while(index >= 0) { // left padding
bytes[index--] = (char) MIN_NUM_DIGIT;
// Populate the byte array with the Base128 encoded bytes of the input value
for(; index >= 0; index--) {
result[index] = (byte)(value & BASE_128_BITMASK);
value >>= 7;
}

return new String(bytes);
return new String(result, StandardCharsets.UTF_8);
}

/**
* This is a utility function for debugging and tests.
* Converts a given string into a list of integers, where each integer represents
* the ASCII value of the corresponding character in the string.
*/
static List<Integer> toIntVals(String s) {
Integer[] arr = new Integer[s.length()];
for (int i=0; i<s.length(); i++) {
arr[i] = (int)s.charAt(i);
}
return Arrays.asList(arr);
}
}

20 changes: 9 additions & 11 deletions src/main/software/amazon/event/ruler/Constants.java
Original file line number Diff line number Diff line change
Expand Up @@ -38,18 +38,16 @@ private Constants() {
final static byte MAX_HEX_DIGIT = HEX_DIGITS[HEX_DIGITS.length - 1]; // F
final static byte MIN_HEX_DIGIT = HEX_DIGITS[0]; // 0

static final byte[] BASE64_DIGITS = {
// numbers are ordered intentionally to based on ascii table value
'+', '/',
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
'W', 'X', 'Y', 'Z',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
'w', 'x', 'y', 'z',
};
static final byte[] BASE128_DIGITS = new byte[128];

static {
for (int i = 0; i < BASE128_DIGITS.length; i++) {
BASE128_DIGITS[i] = (byte) i;
}
}

final static byte MAX_NUM_DIGIT = BASE64_DIGITS[BASE64_DIGITS.length - 1]; // z
final static byte MIN_NUM_DIGIT = BASE64_DIGITS[0]; // +
final static byte MAX_NUM_DIGIT = BASE128_DIGITS[BASE128_DIGITS.length - 1];
final static byte MIN_NUM_DIGIT = BASE128_DIGITS[0];

final static List<String> RESERVED_FIELD_NAMES_IN_OR_RELATIONSHIP = Arrays.asList(
EXACT_MATCH,
Expand Down
49 changes: 17 additions & 32 deletions src/main/software/amazon/event/ruler/Range.java
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import java.util.Arrays;
import java.util.function.Function;

import static software.amazon.event.ruler.Constants.BASE64_DIGITS;
import static software.amazon.event.ruler.Constants.BASE128_DIGITS;
import static software.amazon.event.ruler.Constants.HEX_DIGITS;
import static software.amazon.event.ruler.Constants.MAX_HEX_DIGIT;
import static software.amazon.event.ruler.Constants.MAX_NUM_DIGIT;
Expand All @@ -17,8 +17,8 @@
* implementation, the number of digits in the top and bottom of the range is the same.
*/
public final class Range extends Patterns {
private static final byte[] NEGATIVE_HALF_TRILLION_BYTES = doubleToComparableBytes(-ComparableNumber.HALF_TRILLION);
private static final byte[] POSITIVE_HALF_TRILLION_BYTES = doubleToComparableBytes(ComparableNumber.HALF_TRILLION);
private static final byte[] MIN_RANGE_BYTES = doubleToComparableBytes(-Double.MAX_VALUE);
private static final byte[] MAX_RANGE_BYTES = doubleToComparableBytes(Double.MAX_VALUE);
private static final int HEX_DIGIT_A_DECIMAL_VALUE = 10;
/**
* Bottom and top of the range. openBottom true means we're looking for > bottom, false means >=
Expand Down Expand Up @@ -51,22 +51,22 @@ private Range(Range range) {

public static Range lessThan(final String val) {
byte[] byteVal = stringToComparableBytes(val);
return between(NEGATIVE_HALF_TRILLION_BYTES, false, byteVal, true);
return between(MIN_RANGE_BYTES, false, byteVal, true);
}

public static Range lessThanOrEqualTo(final String val) {
byte[] byteVal = stringToComparableBytes(val);
return between(NEGATIVE_HALF_TRILLION_BYTES, false, byteVal, false);
return between(MIN_RANGE_BYTES, false, byteVal, false);
}

public static Range greaterThan(final String val) {
byte[] byteVal = stringToComparableBytes(val);
return between(byteVal, true, POSITIVE_HALF_TRILLION_BYTES, false);
return between(byteVal, true, MAX_RANGE_BYTES, false);
}

public static Range greaterThanOrEqualTo(final String val) {
byte[] byteVal = stringToComparableBytes(val);
return between(byteVal, false, POSITIVE_HALF_TRILLION_BYTES, false);
return between(byteVal, false, MAX_RANGE_BYTES, false);
}

public static Range between(final String bottom, final boolean openBottom, final String top, final boolean openTop) {
Expand Down Expand Up @@ -128,7 +128,7 @@ public byte minDigit() {
static byte[] digitSequence(byte first, byte last, boolean includeFirst, boolean includeLast, boolean isCIDR) {
return isCIDR ?
digitSequence(first, last, includeFirst, includeLast, HEX_DIGITS, Range::getHexByteIndex) :
digitSequence(first, last, includeFirst, includeLast, BASE64_DIGITS, Range::getNumByteIndex);
digitSequence(first, last, includeFirst, includeLast, BASE128_DIGITS, Integer::new);
}

private static byte[] digitSequence(byte first, byte last, boolean includeFirst, boolean includeLast,
Expand Down Expand Up @@ -157,28 +157,6 @@ private static byte[] digitSequence(byte first, byte last, boolean includeFirst,
return bytes;
}

// quickly find the index of chars within Constants.BASE64_DIGITS
private static int getNumByteIndex(byte value) {
if(value == '+') {
return 0;
}
if (value == '/') {
return 1;
}
// ['0'-'9'] maps to [2 - 11] indexes
if (value >= '0' && value <= '9') {
return value - '0' + 2;
}

// ['A'-'Z'] maps to [12-37] indexes
if (value >= 'A' && value <= 'Z') {
return (value - 'A') + 12;
}

// ['a'-'z'] maps to [38-64] indexes
return (value - 'a') + 38;
}

// quickly find the index of chars within Constants.HEX_DIGITS
private static int getHexByteIndex(byte value) {
// ['0'-'9'] maps to [0-9] indexes
Expand Down Expand Up @@ -226,8 +204,15 @@ public int hashCode() {
}

public String toString() {
return (new String(bottom, StandardCharsets.UTF_8)) + '/' + (new String(top, StandardCharsets.UTF_8))
+ ':' + openBottom + '/' + openTop + ':' + isCIDR +" (" + super.toString() + ")";
if(isCIDR) {
return (new String(bottom, StandardCharsets.UTF_8)) + '/' + (new String(top, StandardCharsets.UTF_8))
+ ':' + openBottom + '/' + openTop + ':' + isCIDR + " (" + super.toString() + ")";
} else {
return "" +
ComparableNumber.toIntVals(new String(bottom, StandardCharsets.UTF_8)) + '/' +
ComparableNumber.toIntVals(new String(top, StandardCharsets.UTF_8))
+ ':' + openBottom + '/' + openTop + ':' + isCIDR + " (" + super.toString() + ")";
}
}

private static byte[] doubleToComparableBytes(double d) {
Expand Down
6 changes: 5 additions & 1 deletion src/main/software/amazon/event/ruler/ValuePatterns.java
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@ public int hashCode() {

@Override
public String toString() {
return "VP:" + pattern + " (" + super.toString() + ")";
if(type() == MatchType.NUMERIC_EQ) {
return "VP:" + ComparableNumber.toIntVals(pattern) + " (" + super.toString() + ")";
} else {
return "VP:" + pattern + " (" + super.toString() + ")";
}
}
}
Loading

0 comments on commit 1406b64

Please sign in to comment.