Skip to content

Conversation

@BsoBird
Copy link
Contributor

@BsoBird BsoBird commented Dec 14, 2023

fix #4935

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 14, 2023

@deniskuzZ Hi, I resubmitted my PR and made the changes you suggested.

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 14, 2023

@deniskuzZ However, although this solves the problem, I have another doubt. If we use static cache, then the elements in SplitGroup.cache[] will never be deleted, when the user changes the partition, the elements here are not updated, so is there a problem?
For PRE-JOB this is not too much of a problem, but for LLAP it is clearly very problematic.
So. really. maybe set cache is non-static is a not bad idea.At least this way, the data in the cache won't be leaked.

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 14, 2023

@zhangbutao Hi butao. I thought we could discuss this together.

@zhangbutao
Copy link
Contributor

@deniskuzZ However, although this solves the problem, I have another doubt. If we use static cache, then the elements in SplitGroup.cache[] will never be deleted, when the user changes the partition, the elements here are not updated, so is there a problem? For PRE-JOB this is not too much of a problem, but for LLAP it is clearly very problematic. So. really. maybe set cache is non-static is a not bad idea.At least this way, the data in the cache won't be leaked.

Yes, I think the static cache will never be cleaned for LLAP, so this maybe lead some issue. But it seems this cache has been in there for long long time (HIVE-8409 -> HIVE-9976), and i don't have deep knowledge about this code at present. I'd like to hear others opinions.

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 14, 2023

@deniskuzZ However, although this solves the problem, I have another doubt. If we use static cache, then the elements in SplitGroup.cache[] will never be deleted, when the user changes the partition, the elements here are not updated, so is there a problem? For PRE-JOB this is not too much of a problem, but for LLAP it is clearly very problematic. So. really. maybe set cache is non-static is a not bad idea.At least this way, the data in the cache won't be leaked.

Yes, I think the static cache will never be cleaned for LLAP, so this maybe lead some issue. But it seems this cache has been in there for long long time (HIVE-8409 -> HIVE-9976), and i don't have deep knowledge about this code at present. I'd like to hear others opinions.

yea. Currently, the solution I'm using in our production environment is to use a non-static cache. So far it's working fine and I haven't observed any OOM issues due to the large number of partitions. Incidentally, our ICEBERG table has close to 100,000 partitions.

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 14, 2023

@deniskuzZ However, although this solves the problem, I have another doubt. If we use static cache, then the elements in SplitGroup.cache[] will never be deleted, when the user changes the partition, the elements here are not updated, so is there a problem? For PRE-JOB this is not too much of a problem, but for LLAP it is clearly very problematic. So. really. maybe set cache is non-static is a not bad idea.At least this way, the data in the cache won't be leaked.

Yes, I think the static cache will never be cleaned for LLAP, so this maybe lead some issue. But it seems this cache has been in there for long long time (HIVE-8409 -> HIVE-9976), and i don't have deep knowledge about this code at present. I'd like to hear others opinions.

@zhangbutao Hi. butao.
I read the code for HIVE-16079, and I think that we can use non-static caching without re-triggering the OOM problem. This is because HIVE-16079 handles this by INTERNER all strings and properties objects. For example, CopyOnFirstWriteProperties::INTERNER. Since we already have a static cache based on the JVM, I don't think it's necessary to set SplitGroup.cache[] to be static, given the scenario in which LLAP works.

@sonarqubecloud
Copy link

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 19, 2023

@deniskuzZ Hi. Can we discuss this issue further? I'm always worried that we'll forget about it over time. Tks.

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 19, 2023

@deniskuzZ If it is indeed too late to discuss and review this patch in the near future, then please at least add it to the https://issues.apache.org/jira/browse/HIVE-27945

@deniskuzZ
Copy link
Member

@deniskuzZ If it is indeed too late to discuss and review this patch in the near future, then please at least add it to the https://issues.apache.org/jira/browse/HIVE-27945

apologies @BsoBird, it's been a busy week, please give me some time to validate this. Note: we'll probably fast-forward branch-4 in Jan to include recent bug-fixes from master

@deniskuzZ
Copy link
Member

yea. Currently, the solution I'm using in our production environment is to use a non-static cache.

any noticeable performance degradation?

@deniskuzZ
Copy link
Member

@deniskuzZ However, although this solves the problem, I have another doubt. If we use static cache, then the elements in SplitGroup.cache[] will never be deleted, when the user changes the partition, the elements here are not updated, so is there a problem? For PRE-JOB this is not too much of a problem, but for LLAP it is clearly very problematic. So. really. maybe set cache is non-static is a not bad idea.At least this way, the data in the cache won't be leaked.

Yes, I think the static cache will never be cleaned for LLAP, so this maybe lead some issue. But it seems this cache has been in there for long long time (HIVE-8409 -> HIVE-9976), and i don't have deep knowledge about this code at present. I'd like to hear others opinions.

@zhangbutao Hi. butao. I read the code for HIVE-16079, and I think that we can use non-static caching without re-triggering the OOM problem. This is because HIVE-16079 handles this by INTERNER all strings and properties objects. For example, CopyOnFirstWriteProperties::INTERNER. Since we already have a static cache based on the JVM, I don't think it's necessary to set SplitGroup.cache[] to be static, given the scenario in which LLAP works.

@deniskuzZ However, although this solves the problem, I have another doubt. If we use static cache, then the elements in SplitGroup.cache[] will never be deleted, when the user changes the partition, the elements here are not updated, so is there a problem? For PRE-JOB this is not too much of a problem, but for LLAP it is clearly very problematic. So. really. maybe set cache is non-static is a not bad idea.At least this way, the data in the cache won't be leaked.

Yes, I think the static cache will never be cleaned for LLAP, so this maybe lead some issue. But it seems this cache has been in there for long long time (HIVE-8409 -> HIVE-9976), and i don't have deep knowledge about this code at present. I'd like to hear others opinions.

Could we replace this map-based cache with caffeine cache?

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 22, 2023

yea. Currently, the solution I'm using in our production environment is to use a non-static cache.

any noticeable performance degradation?

I think... no. At least now I think LLAP runs faster than spark. Maybe I haven't observed the impact of using non-static caching.

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 22, 2023

Could we replace this map-based cache with caffeine cache?

I think we can not do that. Because if we just change it to caffeine, it will still trigger properties.equals().
If we don't use static-cache, there is no point in using caffeine at all.

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 22, 2023

@deniskuzZ If it is indeed too late to discuss and review this patch in the near future, then please at least add it to the https://issues.apache.org/jira/browse/HIVE-27945

apologies @BsoBird, it's been a busy week, please give me some time to validate this. Note: we'll probably fast-forward branch-4 in Jan to include recent bug-fixes from master

That's great news, sir.

@deniskuzZ
Copy link
Member

Could we replace this map-based cache with caffeine cache?

I think we can not do that. Because if we just change it to caffeine, it will still trigger properties.equals(). If we don't use static-cache, there is no point in using caffeine at all.

Cache is currently reused between multiple tasks. Caffeine was proposed as a way to define an eviction policy. I am checking the code cause existing structure Map<Map<Path, PartitionDesc>, Map<Path, PartitionDesc>> looks very weird.

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 22, 2023

Could we replace this map-based cache with caffeine cache?

I think we can not do that. Because if we just change it to caffeine, it will still trigger properties.equals(). If we don't use static-cache, there is no point in using caffeine at all.

Cache is currently reused between multiple tasks. Caffeine was proposed as a way to define an eviction policy. I am checking the code cause existing structure Map<Map<Path, PartitionDesc>, Map<Path, PartitionDesc>> looks very weird.

yea. i think so... very strange

@deniskuzZ
Copy link
Member

deniskuzZ commented Dec 22, 2023

Could we replace this map-based cache with caffeine cache?

I think we can not do that. Because if we just change it to caffeine, it will still trigger properties.equals(). If we don't use static-cache, there is no point in using caffeine at all.

Cache is currently reused between multiple tasks. Caffeine was proposed as a way to define an eviction policy. I am checking the code cause existing structure Map<Map<Path, PartitionDesc>, Map<Path, PartitionDesc>> looks very weird.

yea. i think so... very strange

from HIVE-9976

There's a static cache in there which seems a little strange. Will create a follow up jira to investigate this. For now I've changed that to a ConcurrentMap since split generation can run in parallel.

is there an existing test that makes use of that cache in getFromPathRecursively?

@deniskuzZ
Copy link
Member

deniskuzZ commented Dec 23, 2023

in the whole project there is just single test that uses that cache and not from HiveSplitGenerator
https://github.com/apache/hive/pull/4969/files
http://ci.hive.apache.org/job/hive-precommit/job/PR-4969/1/

mvn test -Dtest=TestHiveFileFormatUtils#testGetPartitionDescFromPathRecursively

we should add the q test that would trigger that part of code

@BsoBird I think it's ok to remove static since this cache has static modifier only because the original signature of generateGroupedSplits (before HIVE-8409 refactor) was also static. Also no need for ConcurrentHashMap in that case.

public static Multimap<Integer, InputSplit> generateGroupedSplits 

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 24, 2023

in the whole project there is just single test that uses that cache and not from HiveSplitGenerator https://github.com/apache/hive/pull/4969/files http://ci.hive.apache.org/job/hive-precommit/job/PR-4969/1/

mvn test -Dtest=TestHiveFileFormatUtils#testGetPartitionDescFromPathRecursively

we should add the q test that would trigger that part of code

@BsoBird I think it's ok to remove static since this cache has static modifier only because the original signature of generateGroupedSplits (before HIVE-8409 refactor) was also static. Also no need for ConcurrentHashMap in that case.

public static Multimap<Integer, InputSplit> generateGroupedSplits 

So, are we going to go with the first version of the programme?

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 24, 2023

Hi @deniskuzZ . I will reopen the previous PR if we are sure to go with the first version of the programme.

@zhangbutao
Copy link
Contributor

in the whole project there is just single test that uses that cache and not from HiveSplitGenerator https://github.com/apache/hive/pull/4969/files http://ci.hive.apache.org/job/hive-precommit/job/PR-4969/1/

mvn test -Dtest=TestHiveFileFormatUtils#testGetPartitionDescFromPathRecursively

we should add the q test that would trigger that part of code

@BsoBird I think it's ok to remove static since this cache has static modifier only because the original signature of generateGroupedSplits (before HIVE-8409 refactor) was also static. Also no need for ConcurrentHashMap in that case.

public static Multimap<Integer, InputSplit> generateGroupedSplits 

Wow! Definitely a good catch! Remove static & replace ConcurrentHashMap with HashMap should be a right way.

@BsoBird
Copy link
Contributor Author

BsoBird commented Dec 25, 2023

@deniskuzZ @zhangbutao Actually, I think it would be ideal to provide a unified cache that all code that needs to use the partitioned cache should reference. The cache needs to maintain correctness. Similar to HIVE-16079. But this may require some extra work. We can fix this first and then introduce a jira that improves it.

@BsoBird BsoBird closed this Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants