-
Notifications
You must be signed in to change notification settings - Fork 107
address data corruption in chunk encoding for long chunks with >=4.55 hours of nulls #1126
Conversation
As for how to address, there's 2 separate concerns: remediation and long term fix remediation (being able to decode the corrupted chunk data we currently have)This can use some more thinking, but I see 2 solutions:
long term fix
|
i've gone with this approach for now. when reading the first point, we clone the stream, read the upcoming dod, make adjustments as needed, and restore the stream. |
using a quick and dirty benchmark with
alloc_objects overhead of |
656a61c
to
38c708d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see any changes here to how chunks are handled. Just changes to comments/docs a unit test and an updated dependency.
@woodsaj look at the last 2 commits, they modify tsz.go ; because it's vendored, GH doesn't show the changes by default. if we take this path, i'll move these into our go-tsz fork so that dep doesn't complain. |
return true | ||
} | ||
if dod+int32(tDelta) < 0 { | ||
it.tDelta += 16384 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding it right that this will only work if the tDelta has been wrapped around once, but f.e. if we had 10h chunks then this fix might not work if the tDelta has been wrapped twice? I guess for us that's fine, it's just something we should remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct.
note that the plan is to, after putting the remediation in place, deprecate this chunk format asap (at least for >4h chunks) and start using a chunk format that doesn't have this bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if that copying of .br
could be saved if .dod()
would have a small read buffer that can store 1 return value. If there's a value in the read buffer when .dod()
is called it returns the content of the read buffer and clears it, otherwise it goes to the bstream as it does now. After that first call site of .dod()
where we know that on the next call to .dod()
we'll need to get the same value returned one more time, we could then just put that value into the read buffer. That would save the copying of that bstream, but it makes everything a little more complicated, so it might not be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your current fix works for most cases. As replay already pointed out though, there is a possibility that it will wrap more than once, considering our max chunkspan is 24 hours. I think this also means we will need to use 17 bits to cover the entire range, if we decide to pursue that course of action.
As to the implementation, based on your numbers it doesn't seem like the extra copying / allocation is a big deal. If that changes we could look at implementing a peek function to avoid the allocations, although that of course brings its own computational overhead. Just a thought.
Other than it looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I added one comment about how I think it could possibly improved, but I'm not sure if my suggestion is really better than the current state. I also prefer if some more people take a look at this, because I'm not fully confident about it, as this isn't simple.
38c708d
to
8ab46e9
Compare
8ab46e9
to
a3e8219
Compare
gorilla / go-tsz only has 14bits for the delta between the
t0
of the chunk and the first point.2^14 is 16384 seconds so about 4.55 hours.
Thus:
More detailed explanation of the problem based on an example testcase "no data for 5 hours, then 1 hour of 60s dense data"
it starts with t0 is 1540728000, which is stored in full.
the first point should have delta of 5h (18000), but due to the 14bit overflow
we instead store the delta of 1616 (18000-2^14)
When decoding the first point, the timestamp should have been:
1540728000 + 18000 = 1540746000
instead we get:
1540728000 + 1616 = 1540729616
From the 2nd point and onwards, we use delta of delta encoding.
so the 2nd point has a dod of -17940 (because delta should change from 18000 to 60s),
this is supported fine and we store this dod.
However, at decode time, what should have happened is:
1540746000 + 18000 -17940 = 1540746060
Instead, what happens is:
1540729616 + 1616 -17940 = 1540713292
all subsequent points have dod 0, so instead of delta:
60+0=60s
they get:
(1616 -17940) + 0 = -16324
and they keep going back in time