Skip to content

Conversation

@lxsmnv
Copy link

@lxsmnv lxsmnv commented Feb 20, 2017

What changes were proposed in this pull request?

The root cause of the problem is that when spark is inferring schema from the csv file, it tries to resolve the file path pattern more then once by calling DataSouce.resolveRelation each time.

So, if we have file path like:
<...>/test*
and the actual file with name: test{00-1}.txt
Then from the initial call of DataSouce.resolveRelation the pattern will be resolved to /<...>/test{00-1}.txt. When it tries to infer schema for csv file, it calls DataSouce.resolveRelation the second time. The second attempt to resolve the path pattern fails because the file name /<...>/test{00-1}.txt is considered as a pattern and not as actual file and if there no file that match that pattern the whole DataSouce.resolveRelation fails.

The idea behind the fix is quite straightforward:
The part of DataSouce.resolveRelation that creates Hadoop Relation based on a resolved(actual) file names moved to separate function createHadoopRelation. CSVFileFormat.createBaseDataset calls this new function instead of DataSouce.resolveRelation, that caused unnecessary file path resolution.

How was this patch tested?

manual tests

This contribution is my original work and I license the work to the project under the project’s open source license.
Please review http://spark.apache.org/contributing.html before opening a pull request.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@maropu
Copy link
Member

maropu commented Feb 20, 2017

Could you add tests for this pr?

* @return Hadoop relation object
*/
def createHadoopRelation(format: FileFormat,
globPaths: Array[Path]): BaseRelation = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do twice getOrInferFileFormatSchema. One is before calling createHadoopRelation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I will fix this. Looks like merge issue.
@maropu I will add tests.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 20, 2017

@lxsmnv, could you check if this is a more general problem? I suspect it is not only a CSV specific issue. IIRC, I tested several cases with other datasources and it did not work correctly in some cases when I saw this JIRA.

@lxsmnv
Copy link
Author

lxsmnv commented Feb 21, 2017

@viirya I've removed a duplicate call for getOrInferFileFormatSchema. Thanks for pointing out.
@maropu I've added the test case.

@lxsmnv
Copy link
Author

lxsmnv commented Feb 21, 2017

@HyukjinKwon the problem that I have found was in CSVFileFormat. So its more of csv specific. However, it can be a problem of some other data source types, but not all - it will depend on data source implementation. If the same problem with other data source types exists, there can be some degree of a common nature of that problem for those datasource types and the fix may require a some significant amount of work and changes.

My fix is quite simple and doesn't introduce much changes. For now, I would suggest to merge it. Otherwise aiming a more generic solution now may end up in neither options being implemented.

If you tried to reproduce this issue with our datasource types, can you create a new ticket and provide the details about the tests you have done and I will have a look at it and think about more generic approach.

* @return Hadoop relation object
*/
def createHadoopRelation(format: FileFormat,
globPaths: Array[Path]): BaseRelation = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this inlined.

@viirya
Copy link
Member

viirya commented Feb 21, 2017

Actually I have a simpler fix like this:

--- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
@@ -213,7 +213,12 @@ class SparkHadoopUtil extends Logging {
   }

   def globPathIfNecessary(pattern: Path): Seq[Path] = {
-    if (isGlobPath(pattern)) globPath(pattern) else Seq(pattern)
+    val fs = pattern.getFileSystem(conf)
+    if (fs.exists(pattern)) {
+      Seq(pattern)
+    } else {
+      if (isGlobPath(pattern)) globPath(pattern) else Seq(pattern)
+    }
   }

@viirya
Copy link
Member

viirya commented Feb 23, 2017

@lxsmnv What do you think?

@lxsmnv
Copy link
Author

lxsmnv commented Feb 23, 2017

@viirya Looks good to me :) and more generic. Do you want me to update my pull request or you have the other pull request with that fix?

@lxsmnv
Copy link
Author

lxsmnv commented Feb 23, 2017

However, resolving path patterns and checking file existence multiple times is a bit awkward but it is a more general problem - all this data source stuff needs refactoring.

@lxsmnv
Copy link
Author

lxsmnv commented Feb 23, 2017

There is a possible problem about adding file existence check to globPathIfNecessary that I have just realized. If the user will provide pattern that is exactly the same as existing file name e.g. we have file with name test* and user provides pattern test* meaning pattern but not exact file name. globPathIfNecessary with proposed modification will resolve it to only one file - test*. It will change the existing behaviour but I am not sure if this is practically a big issue.

@gatorsmile
Copy link
Member

Since this could cause the behavior change, how about we first close this PR?

@gatorsmile
Copy link
Member

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

@asfgit asfgit closed this in b32bd00 Jun 27, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close stale PRs, mostly the same instances with apache#18017

I believe the author in apache#14807 removed his account.

Closes apache#7075
Closes apache#8927
Closes apache#9202
Closes apache#9366
Closes apache#10861
Closes apache#11420
Closes apache#12356
Closes apache#13028
Closes apache#13506
Closes apache#14191
Closes apache#14198
Closes apache#14330
Closes apache#14807
Closes apache#15839
Closes apache#16225
Closes apache#16685
Closes apache#16692
Closes apache#16995
Closes apache#17181
Closes apache#17211
Closes apache#17235
Closes apache#17237
Closes apache#17248
Closes apache#17341
Closes apache#17708
Closes apache#17716
Closes apache#17721
Closes apache#17937

Added:
Closes apache#14739
Closes apache#17139
Closes apache#17445
Closes apache#18042
Closes apache#18359

Added:
Closes apache#16450
Closes apache#16525
Closes apache#17738

Added:
Closes apache#16458
Closes apache#16508
Closes apache#17714

Added:
Closes apache#17830
Closes apache#14742

## How was this patch tested?

N/A

Author: hyukjinkwon <[email protected]>

Closes apache#18417 from HyukjinKwon/close-stale-pr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants