-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24805][SQL] Do not ignore avro files without extensions by default #21769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
35063ef
760f98e
8562a8d
a7d078e
3b75c27
bb1098f
a48770c
91e40e7
a7f3835
1c53251
134c724
85cdf87
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -39,6 +39,7 @@ import org.apache.spark.sql.types._ | |
| class AvroSuite extends SparkFunSuite { | ||
| val episodesFile = "src/test/resources/episodes.avro" | ||
| val testFile = "src/test/resources/test.avro" | ||
| val episodesWithoutExtension = "src/test/resources/episodesAvro" | ||
|
|
||
| private var spark: SparkSession = _ | ||
|
|
||
|
|
@@ -623,7 +624,7 @@ class AvroSuite extends SparkFunSuite { | |
| spark.read.avro("*/*/*/*/*/*/*/something.avro") | ||
| } | ||
|
|
||
| intercept[FileNotFoundException] { | ||
| intercept[java.io.IOException] { | ||
| TestUtils.withTempDir { dir => | ||
| FileUtils.touch(new File(dir, "test")) | ||
| spark.read.avro(dir.toString) | ||
|
||
|
|
@@ -809,4 +810,16 @@ class AvroSuite extends SparkFunSuite { | |
| assert(readDf.collect().sameElements(writeDf.collect())) | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-24805: reading files without .avro extension") { | ||
|
||
| val df1 = spark.read.avro(episodesWithoutExtension) | ||
| assert(df1.count == 8) | ||
|
|
||
| val schema = new StructType() | ||
| .add("title", StringType) | ||
| .add("air_date", StringType) | ||
| .add("doctor", IntegerType) | ||
| val df2 = spark.read.schema(schema).avro(episodesWithoutExtension) | ||
| assert(df2.count == 8) | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running queries. The option
avro.mapred.ignore.inputs.without.extensionis not set inconf. This is a bug inspark-avro.Please read the value from
options. It would be good to have a new test case withavro.mapred.ignore.inputs.without.extensionas true.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
avro.mapred.ignore.inputs.without.extensionis hadoop's parameter. This PR aims to change the default behavior only. I would prefer to do not convert the hadoop parameter to Avro datasource option here.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is how people use the option so far: databricks/spark-avro#71 (comment) . Probably we should discuss seperatly from this PR how we could fix the "bug" and could not break backward compatibily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Hadoop config can be changed like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we submit a separate PR to add a new option for AVRO? We should not rely on hadoopConf to control the behaviors of AVRO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR: #21798 Please, have a look at it.