Add a fixed-point mapping type by polyfractal · Pull Request #15939 · elastic/elasticsearch

polyfractal · 2016-01-12T20:24:33Z

Adds a new fixed mapping type, which provides storage of real-valued numbers in a long via an associated scaling factor. Uses a base-10 scaling factor because base-2 can give strange results (to a human) when searching, due to it not lining up with discrete digits.

I felt very much like a bull in a china shop here... unclear if all the right things were modified/overridden, or if there is a cleaner way to do this.

Related to #13625

nik9000 · 2016-01-12T20:50:32Z

core/src/main/java/org/elasticsearch/index/fielddata/plain/SortedNumericDVIndexFieldData.java

scalingFactor?

jpountz · 2016-01-13T09:40:28Z

I only skimmed through the patch but I think you are on the right track. I think we still need to sort out the following questions:

what should the API be, ie should users be able to pass a numer of decimal digits, decimal bits or even arbitrary scaling factors?
how can we make conversions efficient? In particular the conversion from fixed-point to double since it could run billions of times for a single search request
should we round down or to the closest value?

@rmuir Maybe you have opinions on this?

base-2 can give strange results (to a human) when searching, due to it not lining up with discrete digits

Why is it an issue? Don't floats have the same problem since they can't represent most values accurately?

jpountz · 2016-01-13T09:44:40Z

core/src/main/java/org/elasticsearch/index/mapper/fixed/FixedPointFieldMapper.java

I think we need to be careful here. For instance if includeLower is true but parseDoubleValue rounds lowerTerm down, then we need to create a long range that has includeLower=false

Actually it's even worse than I thought given that doubles are approximate as well. Maybe we should make this API take a BigDecimal.

colings86 · 2016-01-13T12:01:11Z

core/src/main/java/org/elasticsearch/index/mapper/fixed/FixedPointFieldMapper.java

Shouldn't we use ParseField objects here?

polyfractal · 2016-01-13T15:18:13Z

Why is it an issue? Don't floats have the same problem since they can't represent most values accurately?

Floats do, but it's a lot less noticeable in my opinion, since it happens at very extreme values. For most values that people actually work with, the error is unnoticeable. E.g. If I search for 1.23, I don't expect 1.25 to come back as a hit.

It may just be me, but I find base-2 very unintuitive for the end-user:

A user cares about the number of digits to preserve, not the number of bits. Specifying 2 digits is conceptually easier to understand than 6 bits
Specifying 2 digits always gives you two digits. Specifying 5 bits gives you 1-2 digits of precision. So we can either let users specify num_bits and hope they understand how it lines up with base-10, or we can let users specify num_digits and over-estimate the number of bits required so we can guarantee that number of digits.
Since base-2 only accurately represents powers of 2, I think we'll get a lot of complaints about accuracy because humans think in base-10, and the error is noticeable when you are only saving two digits worth of the fractional portion. It's not technically a bug, but it is confusing

rmuir · 2016-01-13T20:24:08Z

what should the API be, ie should users be able to pass a numer of decimal digits, decimal bits or even arbitrary scaling factors?
how can we make conversions efficient? In particular the conversion from fixed-point to double since it could run billions of times for a single search request
should we round down or to the closest value?

@rmuir Maybe you have opinions on this?

It depends on the goals. I will say this: if its named fixed then immediately what springs to mind to me is a true fixed-point type, e.g. something you can use for financial data and all the use cases along with that. This is a fair amount of work, it means mathematical operations etc have to work, it means we can never involve double, etc. Think about all the cases like even a script used in an aggregation doing a sum: not all script engines even have bigdecimal support. So I personally think its a challenging task.

On the other hand, if we just want an optimized way to encode a double, but with some restraints on range/precision, that e.g. takes advantage of things like GCD compression already implemented in docvalues, then maybe we treat it just like that, and figure out an alternative name?

jpountz · 2016-01-13T21:47:47Z

One thing I was thinking about: decimal encoding has the benefit that it would more likely leverage gcd compression in doc values. For instance if 3 decimal places are configured but only one is actually used then Lucene will figure out all values share 100 as a common factor and compress accordingly. This also means that we might not have to make the scale configurable: we could pick a fixed scale that should be enough for common use-cases and then rely on Lucene for using the right number of bits?

polyfractal · 2016-01-13T23:34:27Z

On the other hand, if we just want an optimized way to encode a double, but with some restraints on range/precision, that e.g. takes advantage of things like GCD compression already implemented in docvalues, then maybe we treat it just like that, and figure out an alternative name?

Yeah, this was my goal personally. I just want better compression for when you don't need the fully dynamic range of doubles. I even have a big, fat warning in the docs that this is not equivalent to Decimal and similar currency types, since it always casts back to Double for arithmetic. Just a storage optimization really.

++ to renaming to something more clear, I have no attachment to fixed. :) Random ideas:

variable_precision: to illustrate that you can change the precision. But it sounds like each value may have its own precision or something
real_storage: e.g. real-valued number optimized for storage (yuck)
truncated_real: e.g. real-valued with truncated precision
compressed_real, compressed_float
scaled_real, scaled_float

I dunno, those are all terrible.

@jpountz: Not sure I understand. Do you mean to pick a scale that provides, say, 6 digits of sigfigs and simply use that? I think that's fine on the fractional side of things (since Lucene will compress down unused bits), but it arbitrarily limits the range for not much change in complexity? Maybe i'm misunderstanding?

rmuir · 2016-01-13T23:49:31Z

OK, it makes a lot more sense if we just try to do the optimized thing, and defer true fixed. I figure as far as naming goes, ill throw out sloppy as an option.

If we do it this way, then we use DoubleDocValues and other apis just like now, its just that how we encode the double itself is different. Like an alternative floating point encoding. The main alternative would be to use an IEEE-half float (https://en.wikipedia.org/wiki/Half-precision_floating-point_format)

The advantage here is a wider range than a half-float, so larger numbers can be represented. We can also simplify a little bit, simply reject values like NaN/Inf/ and so on.

The main disadvantage over existing float/half-float is, uncertainty over how much space is really being saved (especially if its configurable, which i would avoid if possible). Those standardized formats are simpler to think about and you know what space is required.

So it would be good to see numbers, where less precision is used and it really saves space. Keep in mind in 3.0, dynamically mapped floating point values already go to float instead of double so they are only using 32-bits per value.

rmuir · 2016-01-13T23:59:12Z

Also when thinking about some alternative format like this, keep in mind the bits-per-value "schedule" used by docvalues (https://github.com/apache/lucene-solr/blob/6f15d0282f17ab49ab434c54605f4b94f6c4d037/lucene/core/src/java/org/apache/lucene/util/packed/DirectWriter.java#L154-L156)

We round up to values that are efficient to encode/decode, so it does not make sense to think about saving a bit or two per value in general: e.g. using 13 bits per value is really no better than 16.

polyfractal · 2016-01-14T14:32:40Z

Makes sense, lemme run some real-world tests under varying conditions to see how much space is actually saved (will be easy to compare against float/double, and easy enough to extrapolate to half-float)

I think we should still add half-float too, I opened a ticket for it a while ago too: #13626

polyfractal · 2016-01-15T20:58:44Z

Alright, did some tests. Summary is that I think @jpountz's suggestion was right: we should consider limiting this to ~3 digits and not allow customization. In that configuration it can save considerable space for certain use-cases.

Testing methodology was:

Create index with fixed, double and float mapped (with decimal_places between 1-10, depending on the test)
Index 1 million docs
Refresh, explicit flush, wait for all merges to stop
Measure disk size of the three fields
Repeat steps 2-4

The data being indexed was: Math.sin((i / 1000)) * 100 + (normal(0,1) * 3.0). E.g. a sine wave with a periodicity of ~160 docs, an amplitude of 100 and gaussian noise added to each point so they aren't identical each period. I figured this was a good tradeoff to approximate "real" data, since values trend in the same direction, have some noise but isn't completely random.

Values are sent to Elasticsearch as a string ("value" : "1.23") so that each type coerces on it's own. Sizes are inclusive of all components (postings, norms, doc_values, etc).

Fixed point compresses better for lower precision, as expected. At one SigFig it's about twice as good as a float:

The compression advantage decreases linearly until we hit 7 SigFigs, at which point it's about identical to a float:

And at 8+ SigFigs it is larger and no longer useful if your goal is to save space:

Predictably, if you increase the dynamic range of the data to +/- 10,000 (e.g. Math.sin((i / 1000)) * 10000 + (normal(0,1) * 3.0), the gap closes a lot quicker and fixed is equivalent to float by four SigFigs:

And a very narrow range the opposite ocurrs, giving you more sigfigs before it runs out of steam.

So I think the general conclusion here is that we can probably set this to ~3 SigFigs and it'll work nicely for most people that are storing real-values but don't need extreme precision or a very large range (e.g. server metrics and the like).

jpountz · 2016-01-18T12:42:48Z

I re-read this discussion and figured out I had not really compiled this comment:

On the other hand, if we just want an optimized way to encode a double, but with some restraints on range/precision, that e.g. takes advantage of things like GCD compression already implemented in docvalues, then maybe we treat it just like that, and figure out an alternative name?

If the significant digits are what we are after then maybe we should keep using the original double/float/half-float representation indeed and allow users to configure how many bits of the mantissa they want to keep. At indexing time we would zero out the last bits that we don't care about (and doc values would compress with gcd compression) and at search time we would still just have to call Float.intBitsToFloat. The only search-time overhead would be an integer multiplication due to gcd (de)compression. I assume that it is what was meant?

For instance (assuming my math is correct) then we would have 3 significant digits for 16 bits of storage on a float, and for 12 bits of storage on a half-float.

We don't even need to make things configurable, we could just register field types for the numbers that round nicely, eg. "float3" (3 significant figures for 16 bits) and "double4" (4 significant figures for 24 bits, maybe even 5 if we round instead of rounding down).

rjernst · 2016-01-18T19:53:17Z

I am strongly against something like "float3" and "double4" as field types. I think we need to minimize the field types we have, as there is a real cost for a user to find what data type to specify, as opposed to tweaking settings for a field type to optimize storage/query/whatever.

rmuir · 2016-01-18T22:42:49Z

At indexing time we would zero out the last bits that we don't care about (and doc values would compress with gcd compression) and at search time we would still just have to call Float.intBitsToFloat. The only search-time overhead would be an integer multiplication due to gcd (de)compression. I assume that it is what was meant?

I wasn't trying to imply any particular implementation, more just how we think about it. I was hoping we could narrow the issue to "lets create an alternative encoding that saves space at the expense of range/precision" but otherwise the API is the same as float/double.

I think if this is our goal, we should still consider the IEEE half-float. There would be less surprises, its standardized, maybe even hardware support in the future (e.g. intel cpus can do conversion of vectors at least). It seems like the lowest hanging fruit, that would very easily only use 16bpv.

Its true we could take advantage of docvalues and try to invent something that is maybe more "interesting" (e.g. if a larger range is needed, will use 24bpv or whatever), but maybe we should start with the standard stuff? We have to be a little careful inventing our own encoding of this stuff, and also we should take care for it to not be too fragile.

But it might be easier as a start to just add a "simple" 16bpv option and think about how we want the apis to work, e.g. address ryan's concerns and so on.

polyfractal added 4 commits January 12, 2016 15:12

$@polyfractal$

Checkpoint

35fac68

$@polyfractal$

Add tests

114028e

$@polyfractal$

Add documentation

9b82ad2

$@polyfractal$

Cleanup

878e1b0

$@polyfractal$ polyfractal added review :Search Foundations/Mapping Index mappings, including merging and defining field types labels Jan 12, 2016

nik9000 reviewed Jan 12, 2016
View reviewed changes

core/src/main/java/org/elasticsearch/index/fielddata/plain/SortedNumericDVIndexFieldData.java

Copy link

Member

nik9000 Jan 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalingFactor?

jpountz reviewed Jan 13, 2016
View reviewed changes

clintongormley added the >feature label Jan 13, 2016

colings86 reviewed Jan 13, 2016
View reviewed changes

core/src/main/java/org/elasticsearch/index/mapper/fixed/FixedPointFieldMapper.java

Copy link

Contributor

colings86 Jan 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use ParseField objects here?

Conversation

polyfractal commented Jan 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nik9000 Jan 12, 2016

Choose a reason for hiding this comment

Uh oh!

jpountz commented Jan 13, 2016

Uh oh!

jpountz Jan 13, 2016

Choose a reason for hiding this comment

Uh oh!

jpountz Jan 13, 2016

Choose a reason for hiding this comment

Uh oh!

colings86 Jan 13, 2016

Choose a reason for hiding this comment

Uh oh!

polyfractal commented Jan 13, 2016

Uh oh!

rmuir commented Jan 13, 2016

Uh oh!

jpountz commented Jan 13, 2016

Uh oh!

polyfractal commented Jan 13, 2016

Uh oh!

rmuir commented Jan 13, 2016

Uh oh!

rmuir commented Jan 13, 2016

Uh oh!

polyfractal commented Jan 14, 2016

Uh oh!

polyfractal commented Jan 15, 2016

Uh oh!

jpountz commented Jan 18, 2016

Uh oh!

rjernst commented Jan 18, 2016

Uh oh!

rmuir commented Jan 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

$@polyfractal$ polyfractal commented Jan 12, 2016 •

edited

Loading