-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exceptions when hdfs is down or hard disk is full #143
Comments
@skyahead These issues clound be classfiled to two types: |
@lakeofsand I have some code to prevent these exceptions form happening and make Kafka Connect survive network breakdown and HDFS outage, etc automatically. But Our QA reported missing records last week and I am in the process of resolving this issue. Will definitely report back when I fix this issue. |
@skyahead it seems like these three exceptions all stem from an inability to reconnect to HDFS even though the tasks are retrying to establish a connection. The first one looks like maybe potentially a problem with trying to append to the WAL repeatedly when the state machine gets stuck in the TEMP_FILE_CLOSED state. Seems like this could be related to our inability to properly re-establish a lease but without more log it's difficult to say for sure. The second one looks similar except we're stuck in the SHOULD_ROTATE state. The third one looks like we've never even established a lease and again can be related to the same problem. My comments on #142 I believe are relevant here and can help you out if we properly null out the writer when we fail to close it. With that explanation, do you think we can close this as a dupe and reopen if the problem persists after pulling in an adapted version of #142? For reference, this is the state machine I'm talking about. |
@cotedm Could you have another look at the current PR, when you get a chance? |
Hitting a very similar issue when a rolling restart of HDFS datanodes is executed (1 datanode every 120seconds). Following ERROR is indefinitely looping. Impacted tasks appears to be random. Some survive the datanodes rolling restart some not.
|
Still experiencing a similar issue with Confluent 4.1.0:
Our current resolution is to remove the entire In general, I'd prefer shutting down a connector or a task to endlessly looping exceptions. |
Have you waited for > 1 hour after this happened? If yes, something else
was going one. If not, please try.
Waiting 1 hour is a hard limit of hdfs client code and can not be avoided
by current design, which keeps the exactly one guarantee between Kafka and
HDFS.
If you delete the logs/etc, then yes, you do not have to wait for this
annoying timeout, But you will lose the exactly once feature.
…On Thu, Jul 19, 2018 at 5:58 AM, Joris Borgdorff ***@***.***> wrote:
Still experiencing a similar issue with Confluent 4.1.0:
[2018-07-19 09:39:57,033] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
org.apache.kafka.connect.errors.DataException: Error creating writer for log file hdfs://hdfs-namenode-1:8020/logs/mytopic/2/log
at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:91)
at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:105)
at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:601)
at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:256)
at io.confluent.connect.hdfs.TopicPartitionWriter.write(TopicPartitionWriter.java:321)
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:374)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:109)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:524)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:302)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:205)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:173)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Cannot obtain block length for LocatedBlock{BP-848146280-172.19.0.5-1531990635873:blk_1073741849_1025; getBlockSize()=35; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[172.19.0.6:9866,DS-44ed10ee-5e44-4189-b4c5-5f8579a21184,DISK], DatanodeInfoWithStorage[172.19.0.4:9866,DS-642d35fe-fe6b-43ae-82be-2c5817dbf478,DISK], DatanodeInfoWithStorage[172.19.0.3:9866,DS-b3765134-b9e3-4a99-bb2e-897ee45f3f76,DISK]]}
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:428)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:336)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:264)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
at io.confluent.connect.hdfs.wal.WALFile$Reader.openFile(WALFile.java:551)
at io.confluent.connect.hdfs.wal.WALFile$Reader.<init>(WALFile.java:436)
at io.confluent.connect.hdfs.wal.WALFile$Writer.<init>(WALFile.java:156)
at io.confluent.connect.hdfs.wal.WALFile.createWriter(WALFile.java:75)
at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:73)
... 17 more
[2018-07-19 09:39:57,037] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
org.apache.kafka.connect.errors.DataException: Error creating writer for log file hdfs://hdfs-namenode-1:8020/logs/android_empatica_e4_electrodermal_activity/0/log
at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:91)
at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:105)
at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:601)
at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:256)
at io.confluent.connect.hdfs.TopicPartitionWriter.write(TopicPartitionWriter.java:321)
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:374)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:109)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:524)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:302)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:205)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:173)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Cannot obtain block length for LocatedBlock{BP-848146280-172.19.0.5-1531990635873:blk_1073741854_1030; getBlockSize()=35; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[172.19.0.3:9866,DS-b3765134-b9e3-4a99-bb2e-897ee45f3f76,DISK], DatanodeInfoWithStorage[172.19.0.6:9866,DS-44ed10ee-5e44-4189-b4c5-5f8579a21184,DISK], DatanodeInfoWithStorage[172.19.0.4:9866,DS-642d35fe-fe6b-43ae-82be-2c5817dbf478,DISK]]}
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:428)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:336)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:264)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
at io.confluent.connect.hdfs.wal.WALFile$Reader.openFile(WALFile.java:551)
at io.confluent.connect.hdfs.wal.WALFile$Reader.<init>(WALFile.java:436)
at io.confluent.connect.hdfs.wal.WALFile$Writer.<init>(WALFile.java:156)
at io.confluent.connect.hdfs.wal.WALFile.createWriter(WALFile.java:75)
at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:73)
... 17 more
Our current resolution is to remove the entire /logs directory whenever
this occurs, but that seems a heavy solution. This state is also hard to
detect without parsing the logs. Or does it affect change the output on
host:8083/connectors/hdfs-sink/status as well?
In general, I'd prefer shutting down a connector or a task to endlessly
looping exceptions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#143 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAhjmf2tNk8ZlSjPXzIA4rx7FVUiRi89ks5uIFg8gaJpZM4KfIDK>
.
|
@skyahead will try to wait next time. If this behaviour is expected and automatically resolved by the code, I would not expect the logs filling up with stack traces for every (very frequent) retry though. A message that it will be automatically resolved would also be useful. |
Similarly, we still see TopicPartitionWriter stuck in SHOULD_ROTATE loop:
@skyahead I wonder if we should bring back the change from the original commit of yours to try/finally |
If a hard disk in the hdfs cluster is full, or making it very bad, if we shut down the hdfs cluster (e.g., run stop-dfs.sh) while the Kafka connect is writing into it, then depending on luck, I keep getting one of all of the following exceptions.
After I restarted the hdfs cluster (i.e., running start-dfs.sh. BTW, this seems not something we should be doing in production, but our QA are doing extreme tests), the connectors keep throwing these exceptions for ever.
Exception one:
Exception two:
Exception three:
The text was updated successfully, but these errors were encountered: