-
Notifications
You must be signed in to change notification settings - Fork 2
Added more types for conversion #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add more types with String type
…ersion, basic test is working
|
@icexelloss, @wesm we have added more types to the conversion. If you could take a look and make sure it looks ok from an Arrow perspective, that would be great -the StringType in particular. Good news is with the latest code from Arrow, the conversion of a small DataFrame with longs and doubles works! I'll keep working with more types/data to test. btw, we are planning on moving the conversion code out of Dataset.scala and into a more suitable location, hopefully that won't affect your code. I'll leave this open for a day or so for discussion. |
|
cool! see #10 -- we should do plenty of unit testing in Scala as well, let me know how I can assist over the next couple weeks. |
|
Thanks @wesm! I'm having an issue now with testing this out on larger DataFrames, like 500k rows, 10 cols of doubles. It works fine with smaller sizes, but at about this size I start getting a seg fault in Any ideas what could be going on or where to look? I tracked it down to this line in numpy, https://github.com/numpy/numpy/blob/v1.11.1/numpy/core/src/multiarray/array_assign_array.c#L96 but I can't figure out why it's going there. Here is the stack trace: |
|
It's a memory lifetime issue -- see the Ideally, the underlying PyBytesReader (https://github.com/apache/arrow/blob/master/python/src/pyarrow/io.h#L84) should retain a reference to the underlying PyBytes object. I'll open an Arrow JIRA in the meantime, I suggest you either convert the Arrow batch immediately to pandas (not letting the |
|
Ahh, I see. Works great after holding a ref to |
|
If it were me, I would try to carefully rebase and do FF-merges when working on an integration branch like this. It might be worth doing a |
|
Good point, I'll fix it up |
## What changes were proposed in this pull request?
This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added.
**Before**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
: +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9], functions=[], output=[(a + 1)#5])
: +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + A#0)#9, 200), None
+- WholeStageCodegen
: +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9])
: +- INPUT
+- LocalTableScan [a#0], [[1],[2]]
```
**After**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
: +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5])
: +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, 200), None
+- WholeStageCodegen
: +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6])
: +- INPUT
+- LocalTableScan [a#0], [[1],[2]]
```
## How was this patch tested?
Pass the Jenkins tests (with a new testcase)
Author: Dongjoon Hyun <[email protected]>
Closes apache#12590 from dongjoon-hyun/SPARK-14830.
(cherry picked from commit 6e63201)
Signed-off-by: Michael Armbrust <[email protected]>
### Why are the changes needed? `EnsureRequirements` adds `ShuffleExchangeExec` (RangePartitioning) after Sort if `RoundRobinPartitioning` behinds it. This will cause 2 shuffles, and the number of partitions in the final stage is not the number specified by `RoundRobinPartitioning. **Example SQL** ``` SELECT /*+ REPARTITION(5) */ * FROM test ORDER BY a ``` **BEFORE** ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=#11] +- Exchange RoundRobinPartitioning(5), false, [id=#9] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` **AFTER** ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 5), true, [id=#11] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run suite Tests and add new test for this. Closes apache#26946 from stczwd/RoundRobinPartitioning. Lead-authored-by: lijunqing <[email protected]> Co-authored-by: stczwd <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
Using Arrow 0.1.1-SNAPSHOT artifact, added more types for testing