-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18726][SQL]resolveRelation for FileFormat DataSource don't need to listFiles twice #17081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
0082b76
6b5454a
f1da0a4
f79f12c
a8c1dea
60fa037
9a73947
850094c
c39eb26
f3332cb
9cadd41
28c8158
92618b3
f6ec4fe
3e495a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -86,7 +86,7 @@ case class DataSource( | |
| lazy val providingClass: Class[_] = DataSource.lookupDataSource(className) | ||
| lazy val sourceInfo: SourceInfo = sourceSchema() | ||
| private val caseInsensitiveOptions = CaseInsensitiveMap(options) | ||
|
|
||
| private lazy val fileStatusCache = FileStatusCache.getOrCreate(sparkSession) | ||
| /** | ||
| * Get the schema of the given FileFormat, if provided by `userSpecifiedSchema`, or try to infer | ||
| * it. In the read path, only managed tables by Hive provide the partition columns properly when | ||
|
|
@@ -122,7 +122,7 @@ case class DataSource( | |
| val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory) | ||
| SparkHadoopUtil.get.globPathIfNecessary(qualified) | ||
| }.toArray | ||
| new InMemoryFileIndex(sparkSession, globbedPaths, options, None) | ||
| new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This also impacts the streaming code path. If it is fine to streaming, the code changes look good to me.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have make it local only in the no streaming FileFormat match case~ |
||
| } | ||
| val partitionSchema = if (partitionColumns.isEmpty) { | ||
| // Try to infer partitioning, because no DataSource in the read path provides the partitioning | ||
|
|
@@ -364,7 +364,12 @@ case class DataSource( | |
| catalogTable.get, | ||
| catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize)) | ||
| } else { | ||
| new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema)) | ||
| new InMemoryFileIndex( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to create file status cache as a local variable, pass it to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, I think it is more reasonable~ thanks~ |
||
| sparkSession, | ||
| globbedPaths, | ||
| options, | ||
| Some(partitionSchema), | ||
| fileStatusCache) | ||
|
||
| } | ||
|
|
||
| HadoopFsRelation( | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the lifetime of this cache?