Skip to content

Conversation

@rolmovel
Copy link

Actually, with Zeppelin we can use Spark SQL UDFs perfectly fine.

We developed a custom UDF library that parses absolute and relative dates. Feeding this library into Spark SQL using the standard UDF mechanism is suboptimal, since each UDF call is repeated for each row of the queried table.

Example:

select * from my_table where agg_date >= parseDate(“-5d”)

This repeats the call to parseDate(...) for every single row of 'my_table'.

Even worse, if we filter for a date range like in:

select * from my_table where agg_date >= parseDate(“-5d”) and agg_date <= parseDate(“now”)

the call to parseDate(...) is performed twice for each row in the table.

Since Spark's UDFs do not have a concept of 'execution context' we were not able to overcome the problem.

We implemented a mechanism of UDF evaluation in Zeppelin, before the query parameters are sent to the interpreter. Parametrizing queries as usual in Zeppelin, in Zeppelin's input forms you can now enter expressions like:

eval:parseDate("-5d")

or:

eval:com.company.custom.udf.UDFUtility.parseDate("-5d")

this is similar to how standard SQL works, where parameters are evaluated before being sent to the execution engine.

You can find more info in the org.apache.zeppelin.display.Evaluator javadoc.

The above mentioned query over a table of 1 million records lasts about 1 minute. Applying this PR the execution time is reduced to 15 seconds.

Rodrigo Olmo Velasco and others added 2 commits September 22, 2015 16:56
…ut fields. Expression will be evaluated server-side by Zeppelin before being sent to the interpreter.
@bzz
Copy link
Member

bzz commented Oct 6, 2015

Looks interesting, thank you for contributing!

Please help me to understand, am I right that these changes potentially affect all interpreter's syntax and code-wise are not localised to your particular use-case with spark sql?

@lucarosellini
Copy link
Contributor

Hi @bzz,
this feature is unaware of the underlying interpreter the code is being sent to, no interpreter specific code has been changed.
We've tested it successfully with spark sql, hive and markdown interpreters, it should work with any other interpreter as well.

@bzz
Copy link
Member

bzz commented Jan 5, 2016

@lucarosellini thanks for the explanation!

@rolmovel Could you merge latest master in to resolve conflicts as well as update zeppelin-distribution/src/bin_license/LICENSE with new dependencies added?

@corneadoug
Copy link
Contributor

@rolmovel If this PRs is still needed, can we try to rebase it?

@felixcheung
Copy link
Member

This looks to be an unique and important feature to have, will be great to have this in Zeppelin

@asfgit asfgit closed this in c38a0a0 May 9, 2018
asfgit pushed a commit that referenced this pull request May 9, 2018
close #83
close #86
close #125
close #133
close #139
close #146
close #193
close #203
close #246
close #262
close #264
close #273
close #291
close #299
close #320
close #347
close #389
close #413
close #423
close #543
close #560
close #658
close #670
close #728
close #765
close #777
close #782
close #783
close #812
close #822
close #841
close #843
close #878
close #884
close #918
close #989
close #1076
close #1135
close #1187
close #1231
close #1304
close #1316
close #1361
close #1385
close #1390
close #1414
close #1422
close #1425
close #1447
close #1458
close #1466
close #1485
close #1492
close #1495
close #1497
close #1536
close #1545
close #1561
close #1577
close #1600
close #1603
close #1678
close #1695
close #1739
close #1748
close #1765
close #1767
close #1776
close #1783
close #1799
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants