Add a fixed-point mapping type#15939
Conversation
|
I only skimmed through the patch but I think you are on the right track. I think we still need to sort out the following questions:
@rmuir Maybe you have opinions on this?
Why is it an issue? Don't floats have the same problem since they can't represent most values accurately? |
There was a problem hiding this comment.
I think we need to be careful here. For instance if includeLower is true but parseDoubleValue rounds lowerTerm down, then we need to create a long range that has includeLower=false
There was a problem hiding this comment.
Actually it's even worse than I thought given that doubles are approximate as well. Maybe we should make this API take a BigDecimal.
There was a problem hiding this comment.
Shouldn't we use ParseField objects here?
Floats do, but it's a lot less noticeable in my opinion, since it happens at very extreme values. For most values that people actually work with, the error is unnoticeable. E.g. If I search for It may just be me, but I find base-2 very unintuitive for the end-user:
|
It depends on the goals. I will say this: if its named On the other hand, if we just want an optimized way to encode a |
|
One thing I was thinking about: decimal encoding has the benefit that it would more likely leverage gcd compression in doc values. For instance if 3 decimal places are configured but only one is actually used then Lucene will figure out all values share 100 as a common factor and compress accordingly. This also means that we might not have to make the scale configurable: we could pick a fixed scale that should be enough for common use-cases and then rely on Lucene for using the right number of bits? |
Yeah, this was my goal personally. I just want better compression for when you don't need the fully dynamic range of doubles. I even have a big, fat warning in the docs that this is not equivalent to ++ to renaming to something more clear, I have no attachment to
I dunno, those are all terrible. @jpountz: Not sure I understand. Do you mean to pick a scale that provides, say, 6 digits of sigfigs and simply use that? I think that's fine on the fractional side of things (since Lucene will compress down unused bits), but it arbitrarily limits the range for not much change in complexity? Maybe i'm misunderstanding? |
|
OK, it makes a lot more sense if we just try to do the optimized thing, and defer true If we do it this way, then we use DoubleDocValues and other apis just like now, its just that how we encode the The advantage here is a wider range than a half-float, so larger numbers can be represented. We can also simplify a little bit, simply reject values like NaN/Inf/ and so on. The main disadvantage over existing float/half-float is, uncertainty over how much space is really being saved (especially if its configurable, which i would avoid if possible). Those standardized formats are simpler to think about and you know what space is required. So it would be good to see numbers, where less precision is used and it really saves space. Keep in mind in 3.0, dynamically mapped floating point values already go to |
|
Also when thinking about some alternative format like this, keep in mind the bits-per-value "schedule" used by docvalues (https://github.com/apache/lucene-solr/blob/6f15d0282f17ab49ab434c54605f4b94f6c4d037/lucene/core/src/java/org/apache/lucene/util/packed/DirectWriter.java#L154-L156) We round up to values that are efficient to encode/decode, so it does not make sense to think about saving a bit or two per value in general: e.g. using 13 bits per value is really no better than 16. |
|
Makes sense, lemme run some real-world tests under varying conditions to see how much space is actually saved (will be easy to compare against float/double, and easy enough to extrapolate to half-float) I think we should still add half-float too, I opened a ticket for it a while ago too: #13626 |
|
Alright, did some tests. Summary is that I think @jpountz's suggestion was right: we should consider limiting this to ~3 digits and not allow customization. In that configuration it can save considerable space for certain use-cases. Testing methodology was:
The data being indexed was: Values are sent to Elasticsearch as a string ( Fixed point compresses better for lower precision, as expected. At one SigFig it's about twice as good as a float: The compression advantage decreases linearly until we hit 7 SigFigs, at which point it's about identical to a float: And at 8+ SigFigs it is larger and no longer useful if your goal is to save space: Predictably, if you increase the dynamic range of the data to +/- 10,000 (e.g. And a very narrow range the opposite ocurrs, giving you more sigfigs before it runs out of steam. So I think the general conclusion here is that we can probably set this to ~3 SigFigs and it'll work nicely for most people that are storing real-values but don't need extreme precision or a very large range (e.g. server metrics and the like). |
|
I re-read this discussion and figured out I had not really compiled this comment:
If the significant digits are what we are after then maybe we should keep using the original double/float/half-float representation indeed and allow users to configure how many bits of the mantissa they want to keep. At indexing time we would zero out the last bits that we don't care about (and doc values would compress with gcd compression) and at search time we would still just have to call Float.intBitsToFloat. The only search-time overhead would be an integer multiplication due to gcd (de)compression. I assume that it is what was meant? For instance (assuming my math is correct) then we would have 3 significant digits for 16 bits of storage on a float, and for 12 bits of storage on a half-float. We don't even need to make things configurable, we could just register field types for the numbers that round nicely, eg. "float3" (3 significant figures for 16 bits) and "double4" (4 significant figures for 24 bits, maybe even 5 if we round instead of rounding down). |
|
I am strongly against something like "float3" and "double4" as field types. I think we need to minimize the field types we have, as there is a real cost for a user to find what data type to specify, as opposed to tweaking settings for a field type to optimize storage/query/whatever. |
I wasn't trying to imply any particular implementation, more just how we think about it. I was hoping we could narrow the issue to "lets create an alternative encoding that saves space at the expense of range/precision" but otherwise the API is the same as float/double. I think if this is our goal, we should still consider the IEEE half-float. There would be less surprises, its standardized, maybe even hardware support in the future (e.g. intel cpus can do conversion of vectors at least). It seems like the lowest hanging fruit, that would very easily only use 16bpv. Its true we could take advantage of docvalues and try to invent something that is maybe more "interesting" (e.g. if a larger range is needed, will use 24bpv or whatever), but maybe we should start with the standard stuff? We have to be a little careful inventing our own encoding of this stuff, and also we should take care for it to not be too fragile. But it might be easier as a start to just add a "simple" 16bpv option and think about how we want the apis to work, e.g. address ryan's concerns and so on. |




Adds a new
fixedmapping type, which provides storage of real-valued numbers in alongvia an associated scaling factor. Uses a base-10 scaling factor because base-2 can give strange results (to a human) when searching, due to it not lining up with discrete digits.I felt very much like a bull in a china shop here... unclear if all the right things were modified/overridden, or if there is a cleaner way to do this.
Related to #13625