Skip to content

Conversation

@ishnagy
Copy link

@ishnagy ishnagy commented Oct 21, 2025

What changes were proposed in this pull request?

I am proposing to remove the tight dependency coupling between hive-exec and hadoop-yarn-registry (to address HIVE-29284).

Why are the changes needed?

hive-exec pulls in hadoop-yarn-registry as a direct dependency, but registry classes are only used for building a local resource map for LLAP. This map can be built without the actual class instances getting loaded, using class name literals only. Removing hadoop-yarn-registry as a direct dependency will prevent pulling in its whole transitive dependency tree when one only wants to use hive-exec functionality without LLAP. (e.g apache spark)

Does this PR introduce any user-facing change?

No

How was this patch tested?

This patch will be tested by the pre merge tests executed for this pull request.

@deniskuzZ
Copy link
Member

hadoop-yarn-registry is also not needed for clusters managed by Kubernetes.

@ishnagy
Copy link
Author

ishnagy commented Nov 2, 2025

Hi @deniskuzZ

thanks for the comments,
and sorry for the delayed reply, I was AFK for the last few days.

I'll add my responses to the individual threads.

@ishnagy
Copy link
Author

ishnagy commented Nov 3, 2025

hadoop-yarn-registry is also not needed for clusters managed by Kubernetes.

@deniskuzZ
I tried to look for a k8s related hadoop-yarn-registry dependency, but I don't seem to find where it is. Can you help me with some pointers where to look for this?

@sonarqubecloud
Copy link

sonarqubecloud bot commented Nov 3, 2025

deniskuzZ
deniskuzZ previously approved these changes Nov 5, 2025
Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@deniskuzZ
Copy link
Member

deniskuzZ commented Nov 5, 2025

hadoop-yarn-registry is also not needed for clusters managed by Kubernetes.

@deniskuzZ I tried to look for a k8s related hadoop-yarn-registry dependency, but I don't seem to find where it is. Can you help me with some pointers where to look for this?

Kubernetes manages resources on its own, it doesn't need YARN
cc @abstractdog, please correct me if i am wrong

@deniskuzZ deniskuzZ dismissed their stale review November 5, 2025 15:49

would the jar still be present in Hive dist/packaging?

@deniskuzZ
Copy link
Member

deniskuzZ commented Nov 5, 2025

@ishnagy, would the jar still be included in the Hive distribution tar.gz? Try building with the -Pdist profile and check whether it still present in the lib directory.

@abstractdog
Copy link
Contributor

hadoop-yarn-registry is also not needed for clusters managed by Kubernetes.

@deniskuzZ I tried to look for a k8s related hadoop-yarn-registry dependency, but I don't seem to find where it is. Can you help me with some pointers where to look for this?

Kubernetes manages resources on its own, it doesn't need YARN cc @abstractdog, please correct me if i am wrong

YARN is not needed, correct

regarding the pom changes: see this comment in the scope of TEZ-4008, please try to update dependencies to hadoop-registry if possible (if this part is already changed :) )
the reason we use such dependency is that we reuse the ServiceRecord class from the registry module (regardless of the underlying framework, which takes care of resource management)

@deniskuzZ
Copy link
Member

deniskuzZ commented Nov 6, 2025

@abstractdog, @ayushtkn, I can’t find RegistryOperations.class in any of the Hive dist jars under the lib directory. Is it pulled in from Hadoop?

find . -name "*.jar" -exec sh -c 'jar tf "$1" | grep -q "RegistryOperations.class" && echo "$1"' _ {} \; 

@ishnagy, are you using core classifier for hive-exec jar in Spark? I think the whole refactor is not needed.

@abstractdog
Copy link
Contributor

@abstractdog, @ayushtkn, I can’t find RegistryOperations.class in any of the Hive dist jars under the lib directory. Is it pulled in from Hadoop?

find . -name "*.jar" -exec sh -c 'jar tf "$1" | grep -q "RegistryOperations.class" && echo "$1"' _ {} \; 

@ishnagy, are you using core classifier for hive-exec jar in Spark? I think the whole refactor is not needed.

RegistryOperations is just a "random" class to locate the registry jar, I believe, what we really use from that artifact is ServiceRecord

@ishnagy
Copy link
Author

ishnagy commented Nov 6, 2025

@ishnagy, would the jar still be included in the Hive distribution tar.gz? Try building with the -Pdist profile and check whether it still present in the lib directory.

checking

@ishnagy
Copy link
Author

ishnagy commented Nov 6, 2025

oh, ok, I think I get the source of the confusion now (or at least some of it).

hadoop-YARN-registry is just a historical artifact in naming, it started out originally as something yarn specific, but later the functionality proved to be generic enough to be pulled up to hadoop-registry (without any additional yarn deps). the old artifact name was kept as a compatibility measure, a dummy "wrapper" artifact depending on hadoop-registry only.

(I thought you're telling meg to remove some additional "yarn" deps from some other k8s related component)

@ishnagy
Copy link
Author

ishnagy commented Nov 6, 2025

I have found the registry classes only in hive-jdbc-4.2.0-SNAPSHOT-standalone.jar, either before or after my patch:

% find apache-hive-4.2.0*bin | grep jar$ | xargs -I% bash -c 'unzip -Z1 % | grep org/apache/hadoop/registry/ | sed "s#^#%:#"' | cut -d: -f1 | sort | uniq -c
    130 apache-hive-4.2.0-SNAPSHOT-bin.master/jdbc/hive-jdbc-4.2.0-SNAPSHOT-standalone.jar
    130 apache-hive-4.2.0-SNAPSHOT-bin.patched/jdbc/hive-jdbc-4.2.0-SNAPSHOT-standalone.jar

not sure, if that's what expected, but at least the patch doesn't seem to remove any hadoop registry classes from the distributables.

*I used

mvn install -DskipTests -Pdist

to build the project.

@ishnagy
Copy link
Author

ishnagy commented Nov 6, 2025

https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/pom.xml

  <dependencies>

    <!-- The registry moved to Hadoop commons, this is just a stub pom. -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-registry</artifactId>
    </dependency>

  </dependencies>

@deniskuzZ
Copy link
Member

deniskuzZ commented Nov 6, 2025

there are no registry classes in Hive dist. maybe we should just change the scope for hadoop-registry to provided and skip the code refactor?

@ishnagy
Copy link
Author

ishnagy commented Nov 6, 2025

there are no registry classes in Hive dist. maybe we should just change the scope for hadoop-registry to provided and skip the code refactor?

hmm, it may work.

I'll have to rerun a few builds to check if this is enough from the spark side. I'll try to post an update tomorrow with my results.

@ishnagy
Copy link
Author

ishnagy commented Nov 6, 2025

there are no registry classes in Hive dist. [...]

not to abandon the "provided" dep method,
but in the meantime,
I'm a bit confused about the terminology, isn't hive-jdbc-4.2.0-SNAPSHOT-standalone.jar part of the "Hive distribution"?

@deniskuzZ
Copy link
Member

deniskuzZ commented Nov 6, 2025

not to abandon the "provided" dep method, but in the meantime, I'm a bit confused about the terminology, isn't hive-jdbc-4.2.0-SNAPSHOT-standalone.jar part of the "Hive distribution"?

It is, but I don’t understand why the JDBC jar ends up shading hadoop-yarn-server-resourcemanager. It was introduced by HIVE-19956, but the description is vague. it might just be a legacy leftover, because I don't see how it relates to the JDBC at all.

In any case, your issue is with hive-exec bringing transitive dependencies, isn't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants