Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeBasedPartitioner is inefficient in number of files created in a partition #242

Open
patrikni opened this issue Oct 9, 2017 · 4 comments

Comments

@patrikni
Copy link

patrikni commented Oct 9, 2017

Hi,
I had a 5 minute time based partition configured. Topic had 10 partitions. The time partitioner created much more file that was necessary.
I have expected to create just 10 files because of number of topic partitions but it created 15 files.

Files created in one 5 minute partiton:
23549 2017-09-29 11:50 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+9+0000388981+0000388981.avro
23588 2017-09-29 11:50 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+2+0000388777+0000388777.avro
23556 2017-09-29 11:51 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+5+0000388826+0000388826.avro
23591 2017-09-29 11:51 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+1+0000388400+0000388400.avro
23676 2017-09-29 11:52 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+0+0000389094+0000389094.avro
23642 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+7+0000389390+0000389390.avro
24045 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+9+0000388982+0000388983.avro
24039 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+0+0000389095+0000389096.avro
25836 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+2+0000388778+0000388783.avro
25453 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+4+0000389299+0000389303.avro
25541 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+6+0000388335+0000388339.avro
24475 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+8+0000388930+0000388932.avro
24545 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+1+0000388401+0000388403.avro
24120 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+3+0000389927+0000389928.avro
24000 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+5+0000388827+0000388828.avro

Configuration:
{
"config": {
"connect.hdfs.keytab": "/home/usr_kfk/usr_kfk.keytab",
"connect.hdfs.principal": "[email protected]",
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"flush.size": "10000000",
"format.class": "io.confluent.connect.hdfs.avro.AvroFormat",
"hadoop.conf.dir": "/etc/hadoop/conf",
"hdfs.authentication.kerberos": "true",
"hdfs.namenode.principal": "hdfs/[email protected]",
"hdfs.url": "hdfs://HOST-ns",
"hive.conf.dir": "/etc/hive/conf",
"hive.database": "kafkadb",
"hive.integration": "false",
"hive.metastore.uris": "thrift://srvhdp04:9083",
"kerberos.ticket.renew.period.ms": "3600000",
"logs.dir": "/data/speed/logs",
"name": "sbd-sink",
"partition.duration.ms": "300000",
"partitioner.class": "io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm/",
"rotate.schedule.interval.ms": "300000",
"schema.compatibility": "BACKWARD",
"tasks.max": "1",
"topics": "ACCOUNT_CORE",
"topics.dir": "/data/speed/topics"
},

It causes unnecessary consumption of HDFS files and it also slow down the HIVE queries.
I made a local fix to test the behavior and it can be fixed. I can provide it as an example.

@kkonstantine
Copy link
Member

Can you say which version of the connector are you using?

Also do you anticipate schema changes in the records that might have resulted in file rotation based on the rules here ?

@patrikni
Copy link
Author

It is kafka-connect-hdfs version 3.2.1.

@patrikni
Copy link
Author

It is not caused by schema changes. Schema is almost constant. It is caused by different times used by TopicPartitionerWriter and TimeBasedPartitioner.

@OneCricketeer
Copy link

The Partitioner has a configuration for which time to use, if that's your concern. However, I don't believe your flush size is being respected. If you look at the file names, the last two numbers are the offsets of the topic within the file. The difference is much less than what your settings are.

This depends on the Hive queries you're using, but perhaps you shouldn't use minute accuracy in your partitions. You can roll files in 5 minute intervals, in an hourly directory, for example. Plus, this allows you to have larger files on HDFS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants