Skip to content

Conversation

@nongli
Copy link
Contributor

@nongli nongli commented Dec 1, 2015

This bug was exposed as memory corruption in Timsort which uses copyMemory to copy
large regions that can overlap. The prior implementation did not handle this case
half the time and always copied forward, resulting in the data being corrupt.

This bug was exposed as memory corruption in Timsort which uses copyMemory to copy
large regions that can overlap. The prior implementation did not handle this case
half the time and always copied forward, resulting in the data being corrupt.
@srowen
Copy link
Member

srowen commented Dec 1, 2015

Certainly that fixes the obvious issue here. Is a single call to _UNSAFE.copyMemory safe in this respect? for example imagine moving a copying a big block of memory where dst = src - 1. Maybe a unit test or two to cover the case you're fixing and these corner cases too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some benchmarks, there is no performance difference between chunked copy than single _UNSAFE.copyMemory, agree with @srowen , single call to _UNSAFE.copyMemory should be enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting at something different -- does _UNSAFE.copyMemory work reliably with overlapping memory regions? I hope so, I imagine so. But this method's correctness still depends on it, so it's probably worth unit-testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies I don't think it was chunked this way for perf. Looking at the comment for UNSAFE_COPY_THRESHOLD, it seems to be for some other purpose. I'm not familiar with this enough to judge if we need that.

@srowen I'm looking to see if I can find docs on what unsafe does but haven't found anything definitive yet. The test case you originally suggested is very similar to what I have no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm i don't think the copy block by block is done to improve performance. I believe it was done to allow safe points when copying a very large region.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a question. Both dstOffset and srcOffset are offsets. They are not absolute addresses. Based on the values of dstOffset and srcOffset, we are unable to know if source is behind target, right?

Based on the code changes, I do not understand how it works. Could anybody give me a hint? Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies Thank you for your explanation! The code changes look confusing for the C/C++ people, I think. Do you think we should add the comments for clarifying your point for better readability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies copyMemory is a general function. How can we know they are from the same object or not? Like in the following function putNewKey, the function caller has to guarantee they do not break our hidden assumptions when calling the function copyMemory, right?

I am just afraid the other coders might not realize we have such an assumption in this low-level general function copyMemory. Let me know if my concern is valid. Thanks!

      // --- Append the key and value data to the current data page --------------------------------
      final Object base = currentPage.getBaseObject();
      long offset = currentPage.getBaseOffset() + pageCursor;
      final long recordOffset = offset;
      Platform.putInt(base, offset, keyLength + valueLength + 4);
      Platform.putInt(base, offset + 4, keyLength);
      offset += 8;
      Platform.copyMemory(keyBase, keyOffset, base, offset, keyLength);
      offset += keyLength;
      Platform.copyMemory(valueBase, valueOffset, base, offset, valueLength);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the "hidden assumptions"? The src and dst can be same object or not, the offset can overlap or not, we don't have a limitation for copyMemory right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JVM, two objects (or memory block) can't overlap each other (assuming the two offsets are in valid range). The only case that src and dst could overlap is they are exactly the same value (pointer), so we can only compare the offsets.

After this patch, we can use copyMemory() without worrying about the src and dst are overlapped or not, it always works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, I see. Thank you very much!

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46967 has finished for PR 10068 at commit ebf4d10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Dec 1, 2015

Thanks! Merging to master and branch 1.6.

asfgit pushed a commit that referenced this pull request Dec 1, 2015
This bug was exposed as memory corruption in Timsort which uses copyMemory to copy
large regions that can overlap. The prior implementation did not handle this case
half the time and always copied forward, resulting in the data being corrupt.

Author: Nong Li <[email protected]>

Closes #10068 from nongli/spark-12030.

(cherry picked from commit 2cef1cd)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in 2cef1cd Dec 1, 2015
@nongli nongli deleted the spark-12030 branch December 1, 2015 21:48
@rxin
Copy link
Contributor

rxin commented Dec 1, 2015

This should also go into branch-1.5.

@rxin
Copy link
Contributor

rxin commented Dec 1, 2015

(unless this problem doesn't exist)

@nongli
Copy link
Contributor Author

nongli commented Dec 2, 2015

This is only triggered when copying large (larger than threshold) regions. Seems safer to backport this than audit the older use cases.

@yhuai
Copy link
Contributor

yhuai commented Dec 2, 2015

oh, i see. We actually can merge it to branch 1.5. Before this commit, Platform.scala had not been changed since 1.5.0.

asfgit pushed a commit that referenced this pull request Dec 2, 2015
This bug was exposed as memory corruption in Timsort which uses copyMemory to copy
large regions that can overlap. The prior implementation did not handle this case
half the time and always copied forward, resulting in the data being corrupt.

Author: Nong Li <[email protected]>

Closes #10068 from nongli/spark-12030.

(cherry picked from commit 2cef1cd)
Signed-off-by: Yin Huai <[email protected]>
@yhuai
Copy link
Contributor

yhuai commented Dec 2, 2015

Merged into 1.5 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants