-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exceptions when network is broken #141
Comments
Does the connector call recoverLease? |
@cmccabe the only lease acquisition I'm aware of happens here: https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/wal/FSWAL.java#L73 I haven't seen an attempt at doing a lease recovery. Would that help avoid this though? Seems like the task is mistakenly trying to get a new lease and not able to do so because it already has one. |
@cotedm sorry for replying so late. Yes, parsing error messages is a bad idea. And I figured out a way to resolve this issue without doing that :-) It turns out that the HDFS FileSystem has a final static CACHE, which is used by HDFS Connect in two places.
Out concern is on the second place. ALL the reading and writing of data files in /+tmp and /logs share a SINGLE FileSystem object, which is cached in the static CACHE. If the network breaks and then resumes, this CACHE item is NOT regenerated. With this same item, the HDFS cluster will see a same DFSClient that is already the lease, now trying to get the lease again. To solve it, therefore, we have to clear this cache entry properly. Please have a look if the new change does the trick. |
@cmccabe I feel like recoverLease may break the exactly-once semantics in scenarios like below. Say somehow a Connect instance is re-balanced and new tasks started on a new node, while the old instance is NOT really dead but the communication between itself and Kafka brokers is delayed. In this case, if recoverLease() hit HDFS first on the new instance, and if the old instance is still writing into the same files, the some records can be lost. |
Quoting from another thread, which I believes applies to the issue you are currently facing:
This is the link to the original thread: |
Perhaps try FileSystem#newInstance instead of FileSystem#get |
@sanxiago , can you give more insight on the potential for performance degradation in the Our resistance to using |
You will need to evaluate by use case, there will be added latency when disabling the cache. |
I have a hdfs connector running on my machine, which reads from a Kafka Cluster and then writes into a HDFS cluster. When I disable my machine's network connection, and then enable it after a while, various exceptions throw.
The problem is the hdfs connector will keep trying to acquire lease on WAL files for ever, but never get it.
The exceptions that the hdfs connector throws are:
From the hdfs's namenode log, I can see the problem is that the hdfs connector is still holding the lease:
The text was updated successfully, but these errors were encountered: