-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30648][SQL] Support filters pushdown in JSON datasource #27366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
102 commits
Select commit
Hold shift + click to select a range
a4c6c93
Add SQL config and push filters down to JSON
MaxGekk ac7c730
Add a test to JsonSuite
MaxGekk b0ff6c9
Push filters to JacksonParser
MaxGekk a79dacd
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk bb12fd5
Refactor the test
MaxGekk ccc0940
Add convertRootObject
MaxGekk b7f17b1
Add JsonPredicate to JsonFilters
MaxGekk 521a685
Implemented JsonFilters.reset
MaxGekk a8486bf
Implemented allPredicates
MaxGekk b0d6939
Add buildPredicates()
MaxGekk e814986
Simplify buildPredicates()
MaxGekk c05b1e9
Refactoring buildPredicates()
MaxGekk 15f0390
Pass StructType to JsonFilters
MaxGekk 02aca76
Embed code to indexedPredicates
MaxGekk bd1d093
Simplify skipRow and reset
MaxGekk 1c64b37
renaming
MaxGekk f83b93a
Deduplicate code
MaxGekk dd547ce
Bug fix literals
MaxGekk 0ada227
Adopt test for complex filters to JsonFilters
MaxGekk f0d6a72
Add JacksonParserSuite
MaxGekk 52e65d0
Add a benchmark
MaxGekk c60b332
Check spark.sql.json.filterPushdown.enabled in JsonFilters
MaxGekk 617197a
Update benchmark results for JDK 8
MaxGekk 03da0b2
Add comments to StructFilters
MaxGekk a122fb7
Add comments to JsonFilters
MaxGekk 0aa8499
Add a test for malformed JSON records
MaxGekk ee53875
Add more cases in JsonSuite
MaxGekk 3381607
Update benchmark results on jdk 11
MaxGekk 144f5a7
Dedup code to toPredicate
MaxGekk 94a22e1
Dedup code in JacksonParser
MaxGekk 5d0ead1
fix coding style
MaxGekk bd1853c
Bug fix: convert Option to Array explicitly
MaxGekk 330aae7
Remove empty line in JsonBenchmark
MaxGekk 449a7e5
Fix indentation in convertObject
MaxGekk 4527660
Check correct SQL config in JsonScanBuilder
MaxGekk 675682b
Add a test for pushed filters to JsonScanBuilder to JsonSuite
MaxGekk 23191e9
Set default value for filters in JacksonParser
MaxGekk 67a74ad
Fix typo: mep -> map
MaxGekk 4a7f0b0
Remove unused import in CSVScanBuilder.scala
MaxGekk d9bb50f
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk 1230f60
Fix indentation in filterToExpression
MaxGekk 6e0aa47
size -> length
MaxGekk a455977
Add test "case sensitivity of filters references" to JsonSuite
MaxGekk 942b9a9
Compute set of schema field names only once
MaxGekk 39b4487
Add test "case sensitivity of filters references" to CSVSuite
MaxGekk e53171b
Regen results of CSVBenchmark and JsonBenchmark on JDK 8
MaxGekk f4c63fa
Regen results of CSVBenchmark and JsonBenchmark on JDK 11
MaxGekk bd5b9a9
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk a583247
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk 9c76267
Change year pattern for legacy parser: uuuu -> yyyy
MaxGekk 3279fcb
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk e78bacc
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk dc66f82
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk 02cd63d
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk 443992a
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk 648c23b
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk 1c4f281
Remove duplicate import in JsonSuite
MaxGekk 0e6ffb5
Re-gen benchmarks results on JDK 8
MaxGekk 01a7ee3
Re-gen JSON and CSV benchmark results on JDK 11
MaxGekk 4e623b3
Filter out not-supported filters
MaxGekk 262e3c7
Merge remote-tracking branch 'origin/master' into json-filters-pushdown
MaxGekk f2d0cad
Merge remote-tracking branch 'origin/master' into json-filters-pushdown
MaxGekk db1ac35
Re-gen benchmarks on JDK 8
MaxGekk 4c37c9a
Re-gen benchmarks on JDK 11
MaxGekk 8bfd599
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk e08b6e0
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk 9012456
Set version 3.1.0 for the SQL config spark.sql.json.filterPushdown.en…
MaxGekk 31ad92c
Add an assert to `skipRow()`
MaxGekk 50b9bb2
Merge remote-tracking branch 'origin/master' into json-filters-pushdown
MaxGekk 9a8ba45
Replace schema.fieldIndex(attr) by index
MaxGekk 90559de
Remove s"
MaxGekk 38eb601
Add a comment about benchmarks for filters w/ nested column attributes
MaxGekk 0d44c04
Merge remote-tracking branch 'remotes/origin/master' into json-filter…
MaxGekk e57ebd1
Update JsonBenchmark-jdk11-results.txt
MaxGekk 36412ca
Update JsonBenchmark-results.txt
MaxGekk b7bdcff
Merge remote-tracking branch 'origin/master' into json-filters-pushdown
MaxGekk eb79544
Update JsonBenchmark-jdk11-results.txt
MaxGekk 0a133ad
Update JsonBenchmark-results.txt
MaxGekk d4b88d4
Merge remote-tracking branch 'origin/master' into json-filters-pushdown
MaxGekk 6921415
Simplify `if else`
MaxGekk 0a1e575
Exit earlier from skipRow
MaxGekk 3df60c1
Add a comment for checkFilterRefs
MaxGekk 649d187
Make toRef() private
MaxGekk 2173343
Add comments for JsonFilters
MaxGekk e55bb50
Use StructFilters.pushedFilters()
MaxGekk 60cd07a
Add a comment for toRef
MaxGekk 8ecede6
can be places -> can be placed
MaxGekk 77bd18e
And -> Or
MaxGekk 864ba7d
Move refCount
MaxGekk 0155b05
Fix comments
MaxGekk 193a57a
Fix comments in StructFilters
MaxGekk 35c056e
Fix comments in CSVFilters
MaxGekk 43f75a6
Fix comments in JsonFilters
MaxGekk 4d5fe2c
`index`` -> `index`
MaxGekk ba7db8b
Refactoring: adding types
MaxGekk 0ca1417
Add protected to createFilters() in StructFiltersSuite
MaxGekk 18dea26
Fix indentation in comments
MaxGekk 3bf3270
Move common assumption from JsonFilters to StructFilters
MaxGekk 3f7d338
Add tests back
MaxGekk 6938ec5
JIRA for TODO
MaxGekk fc725bc
`non` -> `not`
MaxGekk 57524d6
Merge remote-tracking branch 'origin/master' into json-filters-pushdown
MaxGekk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
166 changes: 166 additions & 0 deletions
166
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/StructFilters.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,166 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.catalyst | ||
|
|
||
| import scala.util.Try | ||
|
|
||
| import org.apache.spark.sql.catalyst.StructFilters._ | ||
| import org.apache.spark.sql.catalyst.expressions._ | ||
| import org.apache.spark.sql.sources | ||
| import org.apache.spark.sql.types.{BooleanType, StructType} | ||
|
|
||
| /** | ||
| * The class provides API for applying pushed down filters to partially or | ||
| * fully set internal rows that have the struct schema. | ||
| * | ||
| * `StructFilters` assumes that: | ||
| * - `reset()` is called before any `skipRow()` calls for new row. | ||
| * | ||
| * @param pushedFilters The pushed down source filters. The filters should refer to | ||
| * the fields of the provided schema. | ||
| * @param schema The required schema of records from datasource files. | ||
| */ | ||
| abstract class StructFilters(pushedFilters: Seq[sources.Filter], schema: StructType) { | ||
|
|
||
| protected val filters = StructFilters.pushedFilters(pushedFilters.toArray, schema) | ||
|
|
||
| /** | ||
| * Applies pushed down source filters to the given row assuming that | ||
| * value at `index` has been already set. | ||
| * | ||
| * @param row The row with fully or partially set values. | ||
| * @param index The index of already set value. | ||
| * @return `true` if currently processed row can be skipped otherwise false. | ||
| */ | ||
| def skipRow(row: InternalRow, index: Int): Boolean | ||
|
|
||
| /** | ||
| * Resets states of pushed down filters. The method must be called before | ||
| * precessing any new row otherwise `skipRow()` may return wrong result. | ||
| */ | ||
| def reset(): Unit | ||
|
|
||
| /** | ||
| * Compiles source filters to a predicate. | ||
| */ | ||
| def toPredicate(filters: Seq[sources.Filter]): BasePredicate = { | ||
| val reducedExpr = filters | ||
| .sortBy(_.references.length) | ||
| .flatMap(filterToExpression(_, toRef)) | ||
| .reduce(And) | ||
| Predicate.create(reducedExpr) | ||
| } | ||
|
|
||
| // Finds a filter attribute in the schema and converts it to a `BoundReference` | ||
| private def toRef(attr: String): Option[BoundReference] = { | ||
| // The names have been normalized and case sensitivity is not a concern here. | ||
| schema.getFieldIndex(attr).map { index => | ||
dongjoon-hyun marked this conversation as resolved.
Show resolved
Hide resolved
cloud-fan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| val field = schema(index) | ||
| BoundReference(index, field.dataType, field.nullable) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| object StructFilters { | ||
| private def checkFilterRefs(filter: sources.Filter, fieldNames: Set[String]): Boolean = { | ||
| // The names have been normalized and case sensitivity is not a concern here. | ||
| filter.references.forall(fieldNames.contains) | ||
MaxGekk marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
|
|
||
| /** | ||
| * Returns the filters currently supported by the datasource. | ||
| * @param filters The filters pushed down to the datasource. | ||
| * @param schema data schema of datasource files. | ||
| * @return a sub-set of `filters` that can be handled by the datasource. | ||
| */ | ||
| def pushedFilters(filters: Array[sources.Filter], schema: StructType): Array[sources.Filter] = { | ||
| val fieldNames = schema.fieldNames.toSet | ||
| filters.filter(checkFilterRefs(_, fieldNames)) | ||
| } | ||
|
|
||
| private def zip[A, B](a: Option[A], b: Option[B]): Option[(A, B)] = { | ||
dongjoon-hyun marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| a.zip(b).headOption | ||
| } | ||
|
|
||
| private def toLiteral(value: Any): Option[Literal] = { | ||
| Try(Literal(value)).toOption | ||
| } | ||
|
|
||
| /** | ||
| * Converts a filter to an expression and binds it to row positions. | ||
| * | ||
| * @param filter The filter to convert. | ||
| * @param toRef The function converts a filter attribute to a bound reference. | ||
| * @return some expression with resolved attributes or `None` if the conversion | ||
| * of the given filter to an expression is impossible. | ||
| */ | ||
| def filterToExpression( | ||
| filter: sources.Filter, | ||
| toRef: String => Option[BoundReference]): Option[Expression] = { | ||
| def zipAttributeAndValue(name: String, value: Any): Option[(BoundReference, Literal)] = { | ||
| zip(toRef(name), toLiteral(value)) | ||
| } | ||
| def translate(filter: sources.Filter): Option[Expression] = filter match { | ||
| case sources.And(left, right) => | ||
| zip(translate(left), translate(right)).map(And.tupled) | ||
| case sources.Or(left, right) => | ||
| zip(translate(left), translate(right)).map(Or.tupled) | ||
| case sources.Not(child) => | ||
| translate(child).map(Not) | ||
| case sources.EqualTo(attribute, value) => | ||
| zipAttributeAndValue(attribute, value).map(EqualTo.tupled) | ||
| case sources.EqualNullSafe(attribute, value) => | ||
| zipAttributeAndValue(attribute, value).map(EqualNullSafe.tupled) | ||
| case sources.IsNull(attribute) => | ||
| toRef(attribute).map(IsNull) | ||
| case sources.IsNotNull(attribute) => | ||
| toRef(attribute).map(IsNotNull) | ||
| case sources.In(attribute, values) => | ||
| val literals = values.toSeq.flatMap(toLiteral) | ||
| if (literals.length == values.length) { | ||
| toRef(attribute).map(In(_, literals)) | ||
| } else { | ||
| None | ||
| } | ||
| case sources.GreaterThan(attribute, value) => | ||
| zipAttributeAndValue(attribute, value).map(GreaterThan.tupled) | ||
| case sources.GreaterThanOrEqual(attribute, value) => | ||
| zipAttributeAndValue(attribute, value).map(GreaterThanOrEqual.tupled) | ||
| case sources.LessThan(attribute, value) => | ||
| zipAttributeAndValue(attribute, value).map(LessThan.tupled) | ||
| case sources.LessThanOrEqual(attribute, value) => | ||
| zipAttributeAndValue(attribute, value).map(LessThanOrEqual.tupled) | ||
| case sources.StringContains(attribute, value) => | ||
| zipAttributeAndValue(attribute, value).map(Contains.tupled) | ||
| case sources.StringStartsWith(attribute, value) => | ||
| zipAttributeAndValue(attribute, value).map(StartsWith.tupled) | ||
| case sources.StringEndsWith(attribute, value) => | ||
| zipAttributeAndValue(attribute, value).map(EndsWith.tupled) | ||
| case sources.AlwaysTrue() => | ||
| Some(Literal(true, BooleanType)) | ||
| case sources.AlwaysFalse() => | ||
| Some(Literal(false, BooleanType)) | ||
| } | ||
| translate(filter) | ||
| } | ||
| } | ||
|
|
||
| class NoopFilters extends StructFilters(Seq.empty, new StructType()) { | ||
| override def skipRow(row: InternalRow, index: Int): Boolean = false | ||
| override def reset(): Unit = {} | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.