-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem renewing Kerberos ticket #225
Comments
i am seeing the same issue, however it seems the re-login code did work correctly |
out setup is HA hdfs, we dont use hive. below are logs from one connect worker. the process runs fine for about 3 days and then we start getting these errors.
|
Any update on this behavior ? Facing similar issue. |
Bump. Also seeing this issue. |
I'm running into this issue currently with Confluent Connect 5.4.1 as well. I've been spending quite some time trying to track down where the issue is. I'm not a developer, but it does appear this problem might lie somewhere in the Hadoop client libraries (hadoop-common UserGroupInformation class). Following the debug logging, I've found the following:
The strange thing about this is in my case, I could see in my logs that it was renewing exactly as expected. Then suddenly 1.5 hours after the previous renewal I started getting the errors that I didn't have valid credentials. My ticket lifetime is 8 hours for each renewal. This leads me to believe something in the Hadoop client libraries is NOT actually renewing the ticket as expected, but it only fails sometimes. Other times things DO renew as expected and I don't get the errors. Maybe this bug is related? Complete guess -> https://issues.apache.org/jira/browse/HADOOP-13433 |
I attempted running the connector using Hadoop 2.7.7 (I just updated the pom.xml to use Hadoop 2.7.7 and then rebuilt it with the new dependencies). We still saw this issue even after using Hadoop 2.7.7 (which includes the fix for the bug I gave in my last comment). Me and @jkt628 continued debugging further, and we think there could be some kind of race condition occurring in the getTGT() call of the Hadoop common code. We think that potentially this function is finding the wrong TGT. We haven't found a definitive way to prove that. That being the case, we think there is some weird scenario occurring where the Hadoop code is renewing a ticket that is not actually being used by the connect sink. As a result, the sink starts throwing errors after the original ticket expires. We are testing a one-line change in the DataWriter class that forces the HDFS sink to instead use an externally managed keytab (on the system using kinit). The change is simply this on line 173: I'll report back after we've gone through our set of tests. |
This one change led to the HDFS sink using an external kerberos ticket and all of our tests proved successful. Here is what we tested:
We are running this modified version of the connector in production at this point. In 6+ days, we have not seen a single instance of this kerberos authentication issue recur with hosts using the updated code. UPDATE: We have continued running this modified code in production and we have not had any recurrences of this issue. This really looks like the hadoop-common library is somehow getting things mixed up somehow when each task controls its own ticket. |
Confirming that this is a viable workaround while the Hadoop library issue is fixed/investigated. |
FYI https://issues.apache.org/jira/browse/HDFS-16165 was opened to track this issue and reproduction model has been made to consistently reproduce the issue. |
Hi @heaje, we have been running more than 200 HDFS Sink connectors using your suggestion for 1 week and we had zero issues of kerberos. We tried migrate to HDFS sink 3 but we still got kerberos issue about ticket renew Thank you very much |
@heaje hi, do you have an actual lifehack for 10.1.x versions? |
Funny enough, I just started working today on making a patch on version 10.1.11. I should hopefully have something within a week. |
@heaje i will expand the question: did you have the experience of launching your patch in the strimzi-operator environment (example-basic-manifest)? thanks a lot |
I haven't ever run anything in strimzi-operator. We just run this on bare-metal. |
A draft of the patch can be found at https://gist.github.com/heaje/924d070daedea904f7b4172b487e2433. I HAVE NOT TESTED THIS. I have no environment to test this in for at least a couple of weeks. You are free to give it a shot, but again, this is UNTESTED. I threw the patch in a Gist because this is in no way something that should ever be upstreamed into the HDFS sink code. |
@heaje are you sure there is need to remove: |
@lciolecki - Yes, that must be removed. The entire purpose of this code removal is to force the connector to use an externally managed keytab. Since the keytab is externally managed, that means some external service needs to take care of renewing the ticket. That said, I have been using the code from that draft for the past 3 months without any issues. |
Ahh, I understood now, e.g. we can renew Kerberos's ticket by the cron? Thanks, man! |
Yes, that is correct. However you go about renewing the ticket, something external to Kafka Connect will need to do it. |
@heaje Thank you for the v10 code . Thank you again for this brilliant code. |
@heaje Thank you for the investigation and the proposed workaround! |
Hi,
I am using the following docker image to run my workers: confluentinc/cp-kafka-connect:3.2.2.
I looked at the existing issues and it does look a lot like 178, but it seems to me that this problem would have been fixed in this version.
So what happens is that my worker is running fine until the Kerberos ticket expires. Once it is not valid anymore, the worker is not able to renew the ticket, stop working and just ouptut a bunch of those in the logs:
In my configuration I do use the FQDN for
hive.metastore.uris
andhdfs.namenode.principal
, and I use the HA url forhdfs.url
(hdfs://ha-url/).What if find concerning is that there is no reference to the Connect code in the Stacktrace.
The text was updated successfully, but these errors were encountered: