You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a topic containing roughly 42 millions messages that I want to dump to HDFS. At first I had only one worker instance with the following connector:
I had the feeling that it was not consuming enough from the topic (on and off, low volume). I followed those steps to add two other worker instances:
Pause the connector through the REST API
Stopped the running instance
Started the three worker instances sequentially (docker containers, same CONNECT_GROUP_ID)
Resumes the connector through the REST API
It has been running for more than an hour and what I see now is that only one worker is sometimes actually commiting to HDFS but in a irregular way: a few thousands messages during 1 or 2 minutes and then nothing for 5 to 10 minutes. Most of the time it is just outputting those in the logs for the three workers:
....
[2017-08-31 13:46:32,845] INFO Started recovery for topic partition topicname-12 (io.confluent.connect.hdfs.TopicPartitionWriter)
[2017-08-31 13:46:34,138] INFO Finished recovery for topic partition topicname-15 (io.confluent.connect.hdfs.TopicPartitionWriter)
[2017-08-31 13:46:34,138] INFO Started recovery for topic partition topicname-14 (io.confluent.connect.hdfs.TopicPartitionWriter)
[2017-08-31 13:46:34,165] INFO Finished recovery for topic partition topicname-7 (io.confluent.connect.hdfs.TopicPartitionWriter)
.....
On one of the worker I get a few of those:
[2017-08-31 13:48:05,417] INFO Cannot acquire lease on WAL hdfs://hdfsurl///hdfs/path/topicname/logs/topicname/12/log (io.confluent.connect.hdfs.wal.FSWAL)
[2017-08-31 13:48:06,285] INFO Cannot acquire lease on WAL hdfs://hdfsurl///hdfs/path/topicname/logs/topicname/8/log (io.confluent.connect.hdfs.wal.FSWAL)
...
To me it looks like there are a lot of rebalance that are done and I don't understand why this is mainly one worker that is writing to HDFS. I thought that 42M messages would have been ingested much faster through Connect.
Right now the lag is reducing very very slowly and it doesn't seem like it will come to 0 anytime soon.
Do you have some guidance on how to configure properly a cluster of worker instances? Do you see anything weird with my configuration?
Thanks.
The text was updated successfully, but these errors were encountered:
Actually looking at the lag on my topic partitions, it seems like this one worker that was producing to HDFS did consume all of the messages for the partitions that were assigned to it. 9 partitions have been completely consumed (topic has 28 partitions), others still have a ~1M lag. So it confirms that only one worker is doing its job in my cluster...
talnicolas
changed the title
Tasks repartition with multiple workers
Problem consuming topic with multiple workers
Aug 31, 2017
Hi,
I have a topic containing roughly 42 millions messages that I want to dump to HDFS. At first I had only one worker instance with the following connector:
I had the feeling that it was not consuming enough from the topic (on and off, low volume). I followed those steps to add two other worker instances:
It has been running for more than an hour and what I see now is that only one worker is sometimes actually commiting to HDFS but in a irregular way: a few thousands messages during 1 or 2 minutes and then nothing for 5 to 10 minutes. Most of the time it is just outputting those in the logs for the three workers:
On one of the worker I get a few of those:
To me it looks like there are a lot of rebalance that are done and I don't understand why this is mainly one worker that is writing to HDFS. I thought that 42M messages would have been ingested much faster through Connect.
Right now the lag is reducing very very slowly and it doesn't seem like it will come to 0 anytime soon.
Do you have some guidance on how to configure properly a cluster of worker instances? Do you see anything weird with my configuration?
Thanks.
The text was updated successfully, but these errors were encountered: