-
Notifications
You must be signed in to change notification settings - Fork 2.8k
ZEPPELIN-1411. UDF with pyspark not working - object has no attribute 'parseDataType' #1404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
\cc @Leemoonsoo Please help review, thanks |
|
LGTM |
|
It guess it is because PythonInterpreter depends on python environment, so there's no test for it yet. |
|
right. probably another PR, but I think we could use travis' addons support to install python via apt-get https://docs.travis-ci.com/user/installing-dependencies/ |
|
CI failed because of selenium, I think |
|
could you kick off CI again? Let's merge this after |
|
Thanks @zjffdu for the contribution. Actually, we do have some tests for pyspark already. If it's not too much difficult, adding unit test for this case would be really beneficial. |
|
I tested this branch with given example, but it doesn't work for me. I'm not sure it's problem of this patch or not. |
|
@Leemoonsoo Can you guide me how to run this test ? I try to run it using maven, but fails, seems it depends on something. |
|
Once you build zeppelin, Then you can run this test, like Let me know if it does not work. |
|
@Leemoonsoo , I follow the above command, but seems it doesn't work. I check |
|
@zjffdu Right, it looks like AbstractTestRestApi need to be improved when CI is not defined. And then try run the test cases, so |
632f148 to
68ae3a1
Compare
|
@Leemoonsoo , I updated the unit test. And also made a little change |
|
Thanks @zjffdu. I think second ci test profile failure is relevant. Could you check? |
68ae3a1 to
4922de1
Compare
f3db4f2 to
a142d45
Compare
a142d45 to
73175da
Compare
32cbff6 to
a4fda47
Compare
4384346 to
1ac1233
Compare
| // set spark home for pyspark | ||
| sparkIntpSetting.getProperties().setProperty("spark.home", sparkHome); | ||
| sparkIntpSetting.getProperties().setProperty("zeppelin.spark.useHiveContext", "false"); | ||
| pySpark = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disable HiveContext, otherwise will hit the issue of multiple derby instance.
| } else { | ||
| sparkIntpSetting.getProperties() | ||
| .setProperty("master", "spark://" + getHostname() + ":7071"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow user to specify SPARK_MASTER, so that can run other modes (like yarn-client)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is testing code only, but doesn't seem like we are using this in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for local system test when user want to run it in other modes (e.g. yarn-client).
| return sparkHome; | ||
| } | ||
| sparkHome = getSparkHomeRecursively(new File(System.getProperty(ZeppelinConfiguration.ConfVars.ZEPPELIN_HOME.getVarName()))); | ||
| System.out.println("SPARK HOME detected " + sparkHome); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow user to specify SPARK_HOME, so that can use existing spark cluster
| sc.stop(); | ||
| sc = null; | ||
| sparkSession = null; | ||
| if (classServer != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set sparkSession as null, so that it will be created again if the interpreter is scoped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop should be called on sparkSession before sc.stop()
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SparkSession
(as of now this is ok since sparkSession.stop() simply calls sc.stop() but this could change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, when sparkSession is not null (spark 2.0), sparkSession.stop() should be called first.
|
@Leemoonsoo Finally got the unit test passed (The remaining one failure is irrelevant). Actually the test failure is caused by several bugs of
The root cause of this issue is that the The second bug is that we should also set The third bug is that we should disable |
1ac1233 to
ad0b7b0
Compare
ad0b7b0 to
40b080a
Compare
|
@zjffdu Thanks for great work! |
… 'parseDataType' The root cause is that SQLContext's signature changes in spark 2.0. Spark 1.6 ``` def __init__(self, sparkContext, sqlContext=None): ``` Spark 2.0 ``` def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): ``` So we need to create SQLContext using named parameters, otherwise it would take intp.getSQLContext() as sparkSession which cause the issue. [Bug Fix] * [ ] - Task * https://issues.apache.org/jira/browse/ZEPPELIN-1411 Tested using the example code in ZEPPELIN-1411.  * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Jeff Zhang <[email protected]> Closes #1404 from zjffdu/ZEPPELIN-1411 and squashes the following commits: 40b080a [Jeff Zhang] retry 4922de1 [Jeff Zhang] log more logging for travis CI diangnose 4fe033d [Jeff Zhang] add unit test 296c63f [Jeff Zhang] ZEPPELIN-1411. UDF with pyspark not working - object has no attribute 'parseDataType' (cherry picked from commit c61f1fb) Signed-off-by: Lee moon soo <[email protected]>
… 'parseDataType' ### What is this PR for? The root cause is that SQLContext's signature changes in spark 2.0. Spark 1.6 ``` def __init__(self, sparkContext, sqlContext=None): ``` Spark 2.0 ``` def __init__(self, sparkContext, sparkSession=None, jsqlContext=None): ``` So we need to create SQLContext using named parameters, otherwise it would take intp.getSQLContext() as sparkSession which cause the issue. ### What type of PR is it? [Bug Fix] ### Todos * [ ] - Task ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-1411 ### How should this be tested? Tested using the example code in ZEPPELIN-1411. ### Screenshots (if appropriate)  ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Jeff Zhang <[email protected]> Closes apache#1404 from zjffdu/ZEPPELIN-1411 and squashes the following commits: 40b080a [Jeff Zhang] retry 4922de1 [Jeff Zhang] log more logging for travis CI diangnose 4fe033d [Jeff Zhang] add unit test 296c63f [Jeff Zhang] ZEPPELIN-1411. UDF with pyspark not working - object has no attribute 'parseDataType'

What is this PR for?
The root cause is that SQLContext's signature changes in spark 2.0.
Spark 1.6
Spark 2.0
So we need to create SQLContext using named parameters, otherwise it would take intp.getSQLContext() as sparkSession which cause the issue.
What type of PR is it?
[Bug Fix]
Todos
What is the Jira issue?
How should this be tested?
Tested using the example code in ZEPPELIN-1411.
Screenshots (if appropriate)
Questions: