[SPARK-16281][SQL] Implement parse_url SQL function#14008
[SPARK-16281][SQL] Implement parse_url SQL function#14008janplus wants to merge 19 commits intoapache:masterfrom
Conversation
|
cc @rxin and @cloud-fan |
| expression[StringTrimLeft]("ltrim"), | ||
| expression[JsonTuple]("json_tuple"), | ||
| expression[FormatString]("printf"), | ||
| expression[ParseUrl]("parse_url"), |
There was a problem hiding this comment.
OK, Thank you for review. I'll fix this.
|
@dongjoon-hyun can you help review this one? |
|
Oh. Sure. @rxin |
| */ | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", | ||
| extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO\n" |
There was a problem hiding this comment.
Hi, @janplus .
There is a limitation of Scala 2.10 compiler. For extended, "+" breaks build.
Please use one single """ """ string like SubstringIndex https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L498 .
There was a problem hiding this comment.
Hi, @dongjoon-hyun .
Thank you for review. I'll fix this.
|
Hi, @janplus . |
|
@rxin and @dongjoon-hyun Thanks for your review.
I have tried to not use varargs, but a separate constructor that accept two args does not help. As there isn't a magic key to make |
| def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = { | ||
| if (url == null || partToExtract == null) { | ||
| null | ||
| } else { |
There was a problem hiding this comment.
is this optimization mainly for when the url is literal?
There was a problem hiding this comment.
Yes. When the url column has many same values.
There was a problem hiding this comment.
you can follow XPathBoolean to optimize for literal case.
There was a problem hiding this comment.
Thought we judge on the url string, the main purpose is to cache the URL object.
As We must handle the exceptions caused by invalid urls, the approach of XPathBoolean seems not suitable.
| } | ||
| } | ||
|
|
||
| def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = { |
There was a problem hiding this comment.
Could you make this private?
|
cc @cloud-fan @rxin @liancheng |
| 'query=1' | ||
| > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') | ||
| '1'""") | ||
| case class ParseUrl(children: Seq[Expression]) |
There was a problem hiding this comment.
again we should not use Seq[Expression] here. We should just have a 3-arg ctor, and then add a 2-arg ctor.
There was a problem hiding this comment.
Then we should think of a good default value for the 3rd argument. We should avoid using null as we assume the children of expression won't be null in a lot of places. How about using empty string as the default value for key?
There was a problem hiding this comment.
As I explained before, I can hardly find a magic key that may let us treat parse_url(url, part, magic key) as parse_url(url, part). I have doubt on empty string, eg.
hive> select parse_url("http://spark/path?=1", "QUERY", "");
1hive> select parse_url("http://spark/path?=1", "QUERY");
=1
Any suggestion on this?
There was a problem hiding this comment.
Well, I don't have a strong preference here, Seq[Expression] doesn't look so bad to me. @rxin what do you think?
There was a problem hiding this comment.
What if we use # as the default value and check on that? It is not a valid URL key is it?
There was a problem hiding this comment.
Anyway I don't have a super strong preference here either. It might be more clear to not use a hacky # value.
There was a problem hiding this comment.
Yes, # is not a valid URL key. And I agree with you on not using a hacky value.
| > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') | ||
| '1'""") | ||
| case class ParseUrl(children: Seq[Expression]) | ||
| extends Expression with ImplicitCastInputTypes with CodegenFallback { |
There was a problem hiding this comment.
here -- i don't think it makes a lot of sense to use ImplicitCastInputTypes here, since we are talking about urls. Why don't we just use ExpectsInputTypes
There was a problem hiding this comment.
I am trying to make spark's behavior mostly like hive.
As hive does implicit cast for key, eg
hive> select parse_url("http://spark/path?1=v", "QUERY", 1);
v
Should we keep the same in spark?
There was a problem hiding this comment.
I think it's OK in this case to not follow. This function is so esoteric that I doubt people will complain. If they do, we can always add the implicit casting later.
There was a problem hiding this comment.
OK, I'll use ExpectsInputTypes.
There was a problem hiding this comment.
Actually let's just keep it. Might as well since the code is already written.
There was a problem hiding this comment.
Well, I have missed this comment and finished the change...
|
cc @rxin @cloud-fan Thank you for review
|
| // If the url is a constant, cache the URL object so that we don't need to convert url | ||
| // from UTF8String to String to URL for every row. | ||
| @transient private lazy val cachedUrl = children(0) match { | ||
| case Literal(url: UTF8String, _) => if (url ne null) getUrl(url) else null |
There was a problem hiding this comment.
it can be case Literal(url: UTF8String, _) if url != null => getUrl(url)
There was a problem hiding this comment.
Oh yes, it's simpler.
|
LGTM except one minor comment, thanks for working on it! |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
|
cc @cloud-fan Thank you. |
|
retest this please |
|
Test build #61983 has finished for PR 14008 at commit
|
|
It seems failed the |
|
retest this please |
|
Test build #61987 has finished for PR 14008 at commit
|
|
Test build #3173 has finished for PR 14008 at commit
|
|
Thanks - merging in master/2.0. |
## What changes were proposed in this pull request? This PR adds parse_url SQL functions in order to remove Hive fallback. A new implementation of #13999 ## How was this patch tested? Pass the exist tests including new testcases. Author: wujian <jan.chou.wu@gmail.com> Closes #14008 from janplus/SPARK-16281. (cherry picked from commit f5fef69) Signed-off-by: Reynold Xin <rxin@databricks.com>
|
Thanks @rxin @dongjoon-hyun @cloud-fan @liancheng |
|
Congratulations on your first commit, @janplus ! |
What changes were proposed in this pull request?
This PR adds parse_url SQL functions in order to remove Hive fallback.
A new implementation of #13999
How was this patch tested?
Pass the exist tests including new testcases.