Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka Connect HDFS produces 4 bytes Parquet file on restart #230

Open
n-petrachkov opened this issue Sep 21, 2017 · 1 comment
Open

Kafka Connect HDFS produces 4 bytes Parquet file on restart #230

n-petrachkov opened this issue Sep 21, 2017 · 1 comment

Comments

@n-petrachkov
Copy link

Kafka Connect HDFS is running in distributed mode (the same problem was observed in standalone mode).
Sometimes after Kafka Connect HDFS is restarted, small (4 bytes) parquet files appear in some landing directories on HDFS.
Kafka Connect HDFS reads those invalid Parquet files and cannot start anymore.
Kafka Connect HDFS version: 3.3.0

Stacktrace:

ERROR Exception on topic partition table_name-2:  (io.confluent.connect.hdfs.TopicPartitionWriter)
java.io.IOException: Could not read footer: java.lang.RuntimeException: hdfs://staging/staging/raw/kafka_topics/table_name/partition=2/table_name+2+0010255764+0010265763.parquet is not a Parquet file (too small)
    at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
    at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:188)
    at org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:114)
    at org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:47)
    at org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:192)
    at io.confluent.connect.hdfs.parquet.ParquetFileReader.getSchema(ParquetFileReader.java:43)
    at io.confluent.connect.hdfs.TopicPartitionWriter.write(TopicPartitionWriter.java:275)
    at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:234)
    at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:103)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:435)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:251)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:180)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: hdfs://staging/staging/raw/kafka_topics/table_name/partition=2/table_name+2+0010255764+0010265763.parquet is not a Parquet file (too small)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
    at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
    at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
    ... 4 more
@vdesabou
Copy link
Member

vdesabou commented May 9, 2022

This should be fixed by #614 starting from 10.1.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants