You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I had a 5 minute time based partition configured. Topic had 10 partitions. The time partitioner created much more file that was necessary.
I have expected to create just 10 files because of number of topic partitions but it created 15 files.
It causes unnecessary consumption of HDFS files and it also slow down the HIVE queries.
I made a local fix to test the behavior and it can be fixed. I can provide it as an example.
The text was updated successfully, but these errors were encountered:
It is not caused by schema changes. Schema is almost constant. It is caused by different times used by TopicPartitionerWriter and TimeBasedPartitioner.
The Partitioner has a configuration for which time to use, if that's your concern. However, I don't believe your flush size is being respected. If you look at the file names, the last two numbers are the offsets of the topic within the file. The difference is much less than what your settings are.
This depends on the Hive queries you're using, but perhaps you shouldn't use minute accuracy in your partitions. You can roll files in 5 minute intervals, in an hourly directory, for example. Plus, this allows you to have larger files on HDFS
Hi,
I had a 5 minute time based partition configured. Topic had 10 partitions. The time partitioner created much more file that was necessary.
I have expected to create just 10 files because of number of topic partitions but it created 15 files.
Files created in one 5 minute partiton:
23549 2017-09-29 11:50 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+9+0000388981+0000388981.avro
23588 2017-09-29 11:50 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+2+0000388777+0000388777.avro
23556 2017-09-29 11:51 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+5+0000388826+0000388826.avro
23591 2017-09-29 11:51 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+1+0000388400+0000388400.avro
23676 2017-09-29 11:52 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+0+0000389094+0000389094.avro
23642 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+7+0000389390+0000389390.avro
24045 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+9+0000388982+0000388983.avro
24039 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+0+0000389095+0000389096.avro
25836 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+2+0000388778+0000388783.avro
25453 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+4+0000389299+0000389303.avro
25541 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+6+0000388335+0000388339.avro
24475 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+8+0000388930+0000388932.avro
24545 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+1+0000388401+0000388403.avro
24120 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+3+0000389927+0000389928.avro
24000 2017-09-29 11:55 /year=2017/month=09/day=29/hour=11/minute=50/ACCOUNT_CORE+5+0000388827+0000388828.avro
Configuration:
{
"config": {
"connect.hdfs.keytab": "/home/usr_kfk/usr_kfk.keytab",
"connect.hdfs.principal": "[email protected]",
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"flush.size": "10000000",
"format.class": "io.confluent.connect.hdfs.avro.AvroFormat",
"hadoop.conf.dir": "/etc/hadoop/conf",
"hdfs.authentication.kerberos": "true",
"hdfs.namenode.principal": "hdfs/[email protected]",
"hdfs.url": "hdfs://HOST-ns",
"hive.conf.dir": "/etc/hive/conf",
"hive.database": "kafkadb",
"hive.integration": "false",
"hive.metastore.uris": "thrift://srvhdp04:9083",
"kerberos.ticket.renew.period.ms": "3600000",
"logs.dir": "/data/speed/logs",
"name": "sbd-sink",
"partition.duration.ms": "300000",
"partitioner.class": "io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm/",
"rotate.schedule.interval.ms": "300000",
"schema.compatibility": "BACKWARD",
"tasks.max": "1",
"topics": "ACCOUNT_CORE",
"topics.dir": "/data/speed/topics"
},
It causes unnecessary consumption of HDFS files and it also slow down the HIVE queries.
I made a local fix to test the behavior and it can be fixed. I can provide it as an example.
The text was updated successfully, but these errors were encountered: