Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive metastore exception fail some task of connector #218

Open
shuhanding opened this issue Aug 8, 2017 · 6 comments
Open

Hive metastore exception fail some task of connector #218

shuhanding opened this issue Aug 8, 2017 · 6 comments

Comments

@shuhanding
Copy link

shuhanding commented Aug 8, 2017

Hi,

We have connectors run distributed mode and set task number as 16. Then some of the tasks begin failing one by one by Hive meta store error(probably 2 or 3 tasks failed per day).But the connector will still be going well because not all of them failed.

We can restart that failed task, but want to find the root reason. Did anyone see this error before? Or give some advises. Thanks

I got failed workers of our connector by error like:
[2017-08-08 20:49:06,966] ERROR Task bid_parquet_prod_v00-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:449)
java.lang.RuntimeException: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:226)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:103)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:429)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:250)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:179)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:139)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:220)
... 12 more
Caused by: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:109)
at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:662)
at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:659)
... 4 more
Caused by: MetaException(message:Got exception: java.io.IOException Failed to move to trash: hdfs://nameservice1/topics/prod/bid_parquet_prod_v00/topic_name/year=2017/month=06/day=21/hour=14)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51637)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51596)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result.read(ThriftHiveMetastore.java:51519)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1667)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1651)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:606)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:600)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:152)
at com.sun.proxy.$Proxy52.appendPartition(Unknown Source)
at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:97)
at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:91)
at io.confluent.connect.hdfs.hive.HiveMetaStore.doAction(HiveMetaStore.java:87)
at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:103)
... 6 more
[2017-08-08 20:49:06,967] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask:450)
]

@User1773
Copy link

User1773 commented Jan 12, 2018

Hello,

I have same issue (see trace below), could someone give some advises and explanations. Thanks a lot!!!
I am using Kafka 0.10.1.0 and confluent 3.1.1.

[2018-01-12 14:00:00,264] ERROR Task SYS_4G_PCMD_RAW-7 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:404)
java.lang.RuntimeException: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:239)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:103)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:384)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:240)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:233)
... 12 more
Caused by: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:136)
at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:662)
at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:659)
... 4 more
Caused by: MetaException(message:Got exception: java.io.IOException Failed to move to trash: hdfs://namenodeHA/apps/hdfs-writer/warehouse/CA4MN/topics/SYS_4G_PCMD_RAW/pktime=20180112140000)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51637)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51596)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result.read(ThriftHiveMetastore.java:51519)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1667)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1651)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:606)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:600)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:152)
at com.sun.proxy.$Proxy53.appendPartition(Unknown Source)
at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:124)
at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:118)
at io.confluent.connect.hdfs.hive.HiveMetaStore.doAction(HiveMetaStore.java:114)
at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:130)
... 6 more
[2018-01-12 14:00:00,265] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask:405)

@RussWinkler
Copy link

RussWinkler commented Jan 25, 2018

We have also been experiencing this issue on a single KC instance running in distributed mode (by itself). It seems to occur only when a message with a new value for the field the topic is being partitioned on. Kafka 0.10.1.1 & Confluent 3.1.2

Task prod-connector-17 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:404) java.lang.RuntimeException: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:226) at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:103) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:384) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:240) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:220) ... 12 more Caused by: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:109) at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:662) at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:659) ... 4 more Caused by: MetaException(message:Got exception: java.io.IOException Failed to move to trash: hdfs://namenode:9000/user/topics/event/day=2018-01-25) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51637) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51596) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result.read(ThriftHiveMetastore.java:51519) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1667) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1651) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:606) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:600) at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:152) at com.sun.proxy.$Proxy52.appendPartition(Unknown Source) at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:97) at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:91) at io.confluent.connect.hdfs.hive.HiveMetaStore.doAction(HiveMetaStore.java:87) at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:103) ... 6 more

@User1773
Copy link

Let me give you more information about the issue, we have 5 connectors, each one listening to one topic. We have 8 workers writing in //. The issue does not happened with every connector but with one of them. This issue seems really problematic as in our case, the connector is systematically re balanced in our worker-distributed context. And during re-balancing the workers is not writing any more. Then we have some delay in the writing process, which is bad, since we have to make sure that the data is available in less than 5 minutes in our case.

@User1773
Copy link

May someone could help, the issue is identified since beginning Aug. 2017?

@prasanna1433
Copy link

prasanna1433 commented Feb 28, 2018

We are also seeing this issue. This is a critical one because of the data loss that is happening in the connector. We where able to see the records that are missed in the .Trash/ in hdfs and it is located under the same folder structure prefixed with .Trash/ .

We are thinking may be we can catch the exception that is happening in the function addHivePartition(final String location) and handle it in order for the data to be brought into the desired folder rather than .Trash folder.

I some case if we increase the rotation.schedule.timeout.ms to more than an hour we are not seeing the movement to often but the exception is coming and sometimes the data is moved to .Trash

Please let me if you guys think of any other options that we can try here @Cricket007 @ewencp @kkonstantine @maxzheng

@prasanna1433
Copy link

@ewencp @aayars Have you guys came across this issue ? This issue is still open but we changes to rotate.schedule.interval.ms to more that an hour we didn't see items being moved to .Trash folder. But the error is coming in a regular and we do see occasional data loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants