Skip to content

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Feb 11, 2022

What changes were proposed in this pull request?

skip non-references filter during binding metadata-based filiter

Why are the changes needed?

this issue is from #35055.

reproduce:

CREATE TABLE t (c1 int) USING PARQUET;

SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.BooleanSimplification;

SELECT * FROM t WHERE c1 = 1 AND 2 > 1;

and the error msg:

java.util.NoSuchElementException: next on empty iterator
	at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
	at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
	at scala.collection.mutable.LinkedHashSet$$anon$1.next(LinkedHashSet.scala:89)
	at scala.collection.IterableLike.head(IterableLike.scala:109)
	at scala.collection.IterableLike.head$(IterableLike.scala:108)
	at org.apache.spark.sql.catalyst.expressions.AttributeSet.head(AttributeSet.scala:69)
	at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.$anonfun$listFiles$3(PartitioningAwareFileIndex.scala:85)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listFiles(PartitioningAwareFileIndex.scala:84)
	at org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions$lzycompute(DataSourceScanExec.scala:249)

Does this PR introduce any user-facing change?

yes, a bug fix

How was this patch tested?

add a new test

@ulysses-you
Copy link
Contributor Author

cc @Yaohua628 @cloud-fan

@github-actions github-actions bot added the SQL label Feb 11, 2022
Copy link
Contributor

@Yaohua628 Yaohua628 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! LGTM

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

@ulysses-you
Copy link
Contributor Author

thank you @HyukjinKwon , addressed the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use rand() > 0.5, can we trigger the bug without disabling the optimizer rule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can not trigger this bug since we can not push down indeterministic filter. see code in FileSourceStrategy

val normalizedFilters = DataSourceStrategy.normalizeExprs(
filters.filter(_.deterministic), l.output)
val partitionColumns =
l.resolve(
fsRelation.partitionSchema, fsRelation.sparkSession.sessionState.analyzer.resolver)
val partitionSet = AttributeSet(partitionColumns)
// this partitionKeyFilters should be the same with the ones being executed in
// PruneFileSourcePartitions
val partitionKeyFilters = DataSourceStrategy.getPushedDownFilters(partitionColumns,
normalizedFilters)
// subquery expressions are filtered out because they can't be used to prune buckets or pushed
// down as data filters, yet they would be executed
val normalizedFiltersWithoutSubqueries =
normalizedFilters.filterNot(SubqueryExpression.hasSubquery)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you resolve the conflict, @ulysses-you ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, if you don't mind, could you add a minimal test coverage at FileIndexSuite.scala instead of SQLQuerySuite, @ulysses-you ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @dongjoon-hyun for the reminder !

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @ulysses-you and all.
Merged to master.

@ulysses-you
Copy link
Contributor Author

thank you all !

@ulysses-you ulysses-you deleted the SPARK-38182 branch February 18, 2022 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants