Hive metastore exception fail some task of connector #218

shuhanding · 2017-08-08T21:35:22Z

Hi,

We have connectors run distributed mode and set task number as 16. Then some of the tasks begin failing one by one by Hive meta store error(probably 2 or 3 tasks failed per day).But the connector will still be going well because not all of them failed.

We can restart that failed task, but want to find the root reason. Did anyone see this error before? Or give some advises. Thanks

I got failed workers of our connector by error like:
[2017-08-08 20:49:06,966] ERROR Task bid_parquet_prod_v00-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:449)
java.lang.RuntimeException: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:226)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:103)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:429)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:250)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:179)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:139)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:220)
... 12 more
Caused by: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:109)
at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:662)
at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:659)
... 4 more
Caused by: MetaException(message:Got exception: java.io.IOException Failed to move to trash: hdfs://nameservice1/topics/prod/bid_parquet_prod_v00/topic_name/year=2017/month=06/day=21/hour=14)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51637)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51596)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result.read(ThriftHiveMetastore.java:51519)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1667)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1651)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:606)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:600)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:152)
at com.sun.proxy.$Proxy52.appendPartition(Unknown Source)
at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:97)
at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:91)
at io.confluent.connect.hdfs.hive.HiveMetaStore.doAction(HiveMetaStore.java:87)
at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:103)
... 6 more
[2017-08-08 20:49:06,967] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask:450)
]

User1773 · 2018-01-12T15:59:00Z

Hello,

I have same issue (see trace below), could someone give some advises and explanations. Thanks a lot!!!
I am using Kafka 0.10.1.0 and confluent 3.1.1.

[2018-01-12 14:00:00,264] ERROR Task SYS_4G_PCMD_RAW-7 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:404)
java.lang.RuntimeException: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:239)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:103)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:384)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:240)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:233)
... 12 more
Caused by: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception
at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:136)
at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:662)
at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:659)
... 4 more
Caused by: MetaException(message:Got exception: java.io.IOException Failed to move to trash: hdfs://namenodeHA/apps/hdfs-writer/warehouse/CA4MN/topics/SYS_4G_PCMD_RAW/pktime=20180112140000)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51637)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51596)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result.read(ThriftHiveMetastore.java:51519)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1667)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1651)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:606)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:600)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:152)
at com.sun.proxy.$Proxy53.appendPartition(Unknown Source)
at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:124)
at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:118)
at io.confluent.connect.hdfs.hive.HiveMetaStore.doAction(HiveMetaStore.java:114)
at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:130)
... 6 more
[2018-01-12 14:00:00,265] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask:405)

RussWinkler · 2018-01-25T16:44:15Z

We have also been experiencing this issue on a single KC instance running in distributed mode (by itself). It seems to occur only when a message with a new value for the field the topic is being partitioned on. Kafka 0.10.1.1 & Confluent 3.1.2

Task prod-connector-17 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:404) java.lang.RuntimeException: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:226) at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:103) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:384) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:240) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:220) ... 12 more Caused by: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Hive MetaStore exception at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:109) at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:662) at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:659) ... 4 more Caused by: MetaException(message:Got exception: java.io.IOException Failed to move to trash: hdfs://namenode:9000/user/topics/event/day=2018-01-25) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51637) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51596) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result.read(ThriftHiveMetastore.java:51519) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1667) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1651) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:606) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:600) at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:152) at com.sun.proxy.$Proxy52.appendPartition(Unknown Source) at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:97) at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:91) at io.confluent.connect.hdfs.hive.HiveMetaStore.doAction(HiveMetaStore.java:87) at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:103) ... 6 more

User1773 · 2018-02-19T10:10:47Z

Let me give you more information about the issue, we have 5 connectors, each one listening to one topic. We have 8 workers writing in //. The issue does not happened with every connector but with one of them. This issue seems really problematic as in our case, the connector is systematically re balanced in our worker-distributed context. And during re-balancing the workers is not writing any more. Then we have some delay in the writing process, which is bad, since we have to make sure that the data is available in less than 5 minutes in our case.

User1773 · 2018-02-19T10:12:04Z

May someone could help, the issue is identified since beginning Aug. 2017?

prasanna1433 · 2018-02-28T23:26:52Z

We are also seeing this issue. This is a critical one because of the data loss that is happening in the connector. We where able to see the records that are missed in the .Trash/ in hdfs and it is located under the same folder structure prefixed with .Trash/ .

We are thinking may be we can catch the exception that is happening in the function addHivePartition(final String location) and handle it in order for the data to be brought into the desired folder rather than .Trash folder.

I some case if we increase the rotation.schedule.timeout.ms to more than an hour we are not seeing the movement to often but the exception is coming and sometimes the data is moved to .Trash

Please let me if you guys think of any other options that we can try here @Cricket007 @ewencp @kkonstantine @maxzheng

prasanna1433 · 2018-03-08T21:44:06Z

@ewencp @aayars Have you guys came across this issue ? This issue is still open but we changes to rotate.schedule.interval.ms to more that an hour we didn't see items being moved to .Trash folder. But the error is coming in a regular and we do see occasional data loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive metastore exception fail some task of connector #218

Hive metastore exception fail some task of connector #218

shuhanding commented Aug 8, 2017 •

edited

Loading

User1773 commented Jan 12, 2018 •

edited

Loading

RussWinkler commented Jan 25, 2018 •

edited

Loading

User1773 commented Feb 19, 2018

User1773 commented Feb 19, 2018

prasanna1433 commented Feb 28, 2018 •

edited

Loading

prasanna1433 commented Mar 8, 2018

Hive metastore exception fail some task of connector #218

Hive metastore exception fail some task of connector #218

Comments

shuhanding commented Aug 8, 2017 • edited Loading

User1773 commented Jan 12, 2018 • edited Loading

RussWinkler commented Jan 25, 2018 • edited Loading

User1773 commented Feb 19, 2018

User1773 commented Feb 19, 2018

prasanna1433 commented Feb 28, 2018 • edited Loading

prasanna1433 commented Mar 8, 2018

shuhanding commented Aug 8, 2017 •

edited

Loading

User1773 commented Jan 12, 2018 •

edited

Loading

RussWinkler commented Jan 25, 2018 •

edited

Loading

prasanna1433 commented Feb 28, 2018 •

edited

Loading