Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Apr 9, 2021

In #2437 we seem to hit an issue where the tests are failing because of resource problems.

Github infra handles the timeouts in this case, but when there is a timeout the logs are not collected.

We should handle the timeouts on the tests itself. This is a good practice and in this case we might be able to squeeze out more information about the problems from the logs.

Used a 40s timeout in most tests (usually they run around 5-10 s on my local machine), and used 100s for longer tests (they run around 20-30 s on my local machine). Let's see how they behave on the CI infra.

@github-actions github-actions bot added the MR label Apr 9, 2021
@marton-bod
Copy link
Collaborator

Can we use

@Rule
public Timeout timeout = new Timeout(40000, TimeUnit.MILLISECONDS);

as a global timeout rule in the class, and only override it for that single test method where more time is needed?

@pvary
Copy link
Contributor Author

pvary commented Apr 9, 2021

Can we use

@Rule
public Timeout timeout = new Timeout(40000, TimeUnit.MILLISECONDS);

as a global timeout rule in the class, and only override it for that single test method where more time is needed?

I have not known about this before. Added timeout by a global rule

@pvary
Copy link
Contributor Author

pvary commented Apr 12, 2021

Based on the logs the current issue is that the HMS thread pool is exhausted:

	
StdErr [pool-33-thread-1] WARN org.apache.thrift.server.TThreadPoolServer - Task has been rejected by ExecutorService 10 times till timedout, reason: java.util.concurrent.RejectedExecutionException: Task org.apache.thrift.server.TThreadPoolServer$WorkerProcess@4c3ca2ea rejected from java.util.concurrent.ThreadPoolExecutor@7dcefa7e[Running, pool size = 5, active threads = 5, queued tasks = 0, completed tasks = 12]
StdErr [24fd6de1-23be-4f21-8aab-b48ebd01f071 Time-limited test] WARN hive.metastore - set_ugi() not successful, Likely cause: new client talking to old server. Continuing without it.
StdErr org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
StdErr 	at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
StdErr 	at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
StdErr 	at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
StdErr 	at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
StdErr 	at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
StdErr 	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
StdErr 	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:4247)
StdErr 	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:4233)
StdErr 	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:498)
StdErr 	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:247)
StdErr 	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:70)
StdErr 	at jdk.internal.reflect.GeneratedConstructorAccessor169.newInstance(Unknown Source)
StdErr 	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
StdErr 	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
StdErr 	at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1707)
StdErr 	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:83)
StdErr 	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133)
StdErr 	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
StdErr 	at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3600)
StdErr 	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3652)
StdErr 	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3632)
StdErr 	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:493)
StdErr 	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
StdErr 	at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295)
StdErr 	at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204)
StdErr 	at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
StdErr 	at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320)
StdErr 	at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530)
StdErr 	at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:500)
StdErr 	at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:265)
StdErr 	at org.apache.iceberg.mr.hive.TestHiveShell.executeStatement(TestHiveShell.java:138)
StdErr 	at org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithEngine.testJoinTablesSupportedTypes(TestHiveIcebergStorageHandlerWithEngine.java:266)

This might, or might not be the root cause.

@pvary
Copy link
Contributor Author

pvary commented Apr 13, 2021

We need the pool size change. Failed with the original size a second time

Copy link
Contributor

@massdosage massdosage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, fun!

@pvary
Copy link
Contributor Author

pvary commented Apr 13, 2021

@RussellSpitzer: Could you please review?

Thanks,
Peter

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me :)

public void start() {
// Create a copy of the HiveConf for the metastore
metastore.start(new HiveConf(hs2Conf));
metastore.start(new HiveConf(hs2Conf), 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why the pool is exhausted? In the past, we had a few leaks in the Spark catalog code which led to this. It can be also a valid use case too if we simply need a larger pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, worth creating an issue if we want to take a look later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it related to the recent change around how we manage the pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 ideas:

  • Maybe this was caused by the Hive: Configure catalog type on table level. #2129 where we start to handle / cache the catalogs differently
  • Or we added some more queries and Hive creates / caches HMSClients and we might ended up using a different codepath.

Created #2474

@pvary
Copy link
Contributor Author

pvary commented Apr 14, 2021

Thanks @marton-bod, @massdosage and @RussellSpitzer for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants