Skip to content

Add a script to build the criteo distribution of spark#6

Merged
AnthonyTruchet merged 1 commit intocriteo-forks:criteo-1.6from
AnthonyTruchet:criteo-1.6
Oct 13, 2016
Merged

Add a script to build the criteo distribution of spark#6
AnthonyTruchet merged 1 commit intocriteo-forks:criteo-1.6from
AnthonyTruchet:criteo-1.6

Conversation

@AnthonyTruchet
Copy link
Copy Markdown

Add a script in dev/criteo to build the criteo distribution of Spark

Tested locally to build the spark distribution.

@AnthonyTruchet AnthonyTruchet merged commit ec566fa into criteo-forks:criteo-1.6 Oct 13, 2016
Willymontaz pushed a commit to Willymontaz/spark that referenced this pull request Apr 2, 2019
## What changes were proposed in this pull request?

This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added.

**Before**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)criteo-forks#6,(1 + a#0)criteo-forks#7,(A#0 + 1)criteo-forks#8,(1 + A#0)criteo-forks#9], functions=[], output=[(a + 1)criteo-forks#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)criteo-forks#6, (1 + a#0)criteo-forks#7, (A#0 + 1)criteo-forks#8, (1 + A#0)criteo-forks#9, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)criteo-forks#6,(1 + a#0) AS (1 + a#0)criteo-forks#7,(A#0 + 1) AS (A#0 + 1)criteo-forks#8,(1 + A#0) AS (1 + A#0)criteo-forks#9], functions=[], output=[(a#0 + 1)criteo-forks#6,(1 + a#0)criteo-forks#7,(A#0 + 1)criteo-forks#8,(1 + A#0)criteo-forks#9])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

**After**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)criteo-forks#6], functions=[], output=[(a + 1)criteo-forks#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)criteo-forks#6, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)criteo-forks#6], functions=[], output=[(a#0 + 1)criteo-forks#6])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

## How was this patch tested?

Pass the Jenkins tests (with a new testcase)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#12590 from dongjoon-hyun/SPARK-14830.

(cherry picked from commit 6e63201)
Signed-off-by: Michael Armbrust <michael@databricks.com>
jetoile pushed a commit that referenced this pull request Mar 13, 2024
backport [apache#31856](apache#31856) for branch-3.1

### What changes were proposed in this pull request?

Skip capture maven repo config for views.

### Why are the changes needed?

Due to the bad network, we always use the thirdparty maven repo to run test. e.g.,
```
build/sbt "test:testOnly *SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxxxx
```

It's failed with such error msg
```
[info] - show-tblproperties.sql *** FAILED *** (128 milliseconds)
[info] show-tblproperties.sql
[info] Expected "...rredTempViewNames [][]", but got "...rredTempViewNames [][
[info] view.sqlConfig.spark.sql.maven.additionalRemoteRepositories xxxxx]" Result did not match for query #6
[info] SHOW TBLPROPERTIES view (SQLQueryTestSuite.scala:464)
```

It's not necessary to capture the maven config to view since it's a session level config.
 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

manual test pass
```
build/sbt "test:testOnly *SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxx
```

Closes apache#31856 from ulysses-you/skip-maven-config.

Authored-by: ulysses-you <ulyssesyou18gmail.com>
Signed-off-by: Kent Yao <yaoapache.org>

Closes apache#31879 from ulysses-you/SPARK-34766-3-1.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jetoile pushed a commit that referenced this pull request Mar 6, 2025
…anRelationPushDown

### What changes were proposed in this pull request?

Add the timezone information to a cast expression when the destination type requires it.

### Why are the changes needed?

When current_timestamp() is materialized as a string, the timezone information is gone (e.g., 2024-12-27 10:26:27.684158) which prohibits further optimization rules from being applied to the affected data source.

For example,

```
Project [1735900357973433#10 AS current_timestamp()#6]
+- 'Project [cast(2025-01-03 10:32:37.973433#11 as timestamp) AS 1735900357973433#10]
   +- RelationV2[2025-01-03 10:32:37.973433#11] xxx
```

-> This query fails to execute because the injected cast expression lacks the timezone information.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#49549 from changgyoopark-db/SPARK-50870.

Authored-by: changgyoopark-db <changgyoo.park@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 24abb0f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants