TimeBasedPartitioner is inefficient in number of files created in a partition #242

patrikni · 2017-10-09T10:34:40Z

Hi,
I had a 5 minute time based partition configured. Topic had 10 partitions. The time partitioner created much more file that was necessary.
I have expected to create just 10 files because of number of topic partitions but it created 15 files.

Files created in one 5 minute partiton:
23549 2017-09-29 11:50 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+9+0000388981+0000388981.avro
23588 2017-09-29 11:50 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+2+0000388777+0000388777.avro
23556 2017-09-29 11:51 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+5+0000388826+0000388826.avro
23591 2017-09-29 11:51 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+1+0000388400+0000388400.avro
23676 2017-09-29 11:52 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+0+0000389094+0000389094.avro
23642 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+7+0000389390+0000389390.avro
24045 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+9+0000388982+0000388983.avro
24039 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+0+0000389095+0000389096.avro
25836 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+2+0000388778+0000388783.avro
25453 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+4+0000389299+0000389303.avro
25541 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+6+0000388335+0000388339.avro
24475 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+8+0000388930+0000388932.avro
24545 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+1+0000388401+0000388403.avro
24120 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+3+0000389927+0000389928.avro
24000 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+5+0000388827+0000388828.avro

Configuration:
{
"config": {
"connect.hdfs.keytab": "/home/usr_kfk/usr_kfk.keytab",
"connect.hdfs.principal": "[email protected]",
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"flush.size": "10000000",
"format.class": "io.confluent.connect.hdfs.avro.AvroFormat",
"hadoop.conf.dir": "/etc/hadoop/conf",
"hdfs.authentication.kerberos": "true",
"hdfs.namenode.principal": "hdfs/[email protected]",
"hdfs.url": "hdfs://HOST-ns",
"hive.conf.dir": "/etc/hive/conf",
"hive.database": "kafkadb",
"hive.integration": "false",
"hive.metastore.uris": "thrift://srvhdp04:9083",
"kerberos.ticket.renew.period.ms": "3600000",
"logs.dir": "/data/speed/logs",
"name": "sbd-sink",
"partition.duration.ms": "300000",
"partitioner.class": "io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm/",
"rotate.schedule.interval.ms": "300000",
"schema.compatibility": "BACKWARD",
"tasks.max": "1",
"topics": "ACCOUNT_CORE",
"topics.dir": "/data/speed/topics"
},

It causes unnecessary consumption of HDFS files and it also slow down the HIVE queries.
I made a local fix to test the behavior and it can be fixed. I can provide it as an example.

kkonstantine · 2017-10-27T21:29:12Z

Can you say which version of the connector are you using?

Also do you anticipate schema changes in the records that might have resulted in file rotation based on the rules here ?

patrikni · 2017-10-31T09:52:04Z

It is kafka-connect-hdfs version 3.2.1.

patrikni · 2017-10-31T09:59:55Z

It is not caused by schema changes. Schema is almost constant. It is caused by different times used by TopicPartitionerWriter and TimeBasedPartitioner.

OneCricketeer · 2018-01-12T01:32:21Z

The Partitioner has a configuration for which time to use, if that's your concern. However, I don't believe your flush size is being respected. If you look at the file names, the last two numbers are the offsets of the topic within the file. The difference is much less than what your settings are.

This depends on the Hive queries you're using, but perhaps you shouldn't use minute accuracy in your partitions. You can roll files in 5 minute intervals, in an hourly directory, for example. Plus, this allows you to have larger files on HDFS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeBasedPartitioner is inefficient in number of files created in a partition #242

TimeBasedPartitioner is inefficient in number of files created in a partition #242

patrikni commented Oct 9, 2017

kkonstantine commented Oct 27, 2017

patrikni commented Oct 31, 2017

patrikni commented Oct 31, 2017

OneCricketeer commented Jan 12, 2018

TimeBasedPartitioner is inefficient in number of files created in a partition #242

TimeBasedPartitioner is inefficient in number of files created in a partition #242

Comments

patrikni commented Oct 9, 2017

kkonstantine commented Oct 27, 2017

patrikni commented Oct 31, 2017

patrikni commented Oct 31, 2017

OneCricketeer commented Jan 12, 2018