-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support function urlparse #20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nctl144 based on the testing I did, this looks correct. Do we have tests that check passing scheme?
urlparse4/cgurl.pyx
Outdated
@@ -221,14 +223,14 @@ class SplitResultNamedTuple(tuple): | |||
return stdlib_urlunsplit(self) | |||
|
|||
|
|||
def urlsplit(url): | |||
def urlsplit(url, scheme='', allow_fragments=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is allow_fragments
used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean is it used in Scrapy? I included it because it is used in function urlparse (https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L373). I don't really know if we should include it in this project since allow_fragments' only use is combining fragments with query.
urlparse4/cgurl.pyx
Outdated
@@ -245,3 +247,18 @@ def urljoin(base, url, allow_fragments=True): | |||
return joined_url | |||
|
|||
return stdlib_urljoin(base, url, allow_fragments=allow_fragments) | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean to uncomment it? or add in anotheer PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will work on implementing this function now, Konstantin!
Hey @lopuhin , we do have urlsplit test with scheme passed https://github.com/nctl144/urlparse4/blob/master/tests/test_urlparse.py#L764 |
Hey Konstantin, after checking the performance test, I got the performance of urlparse is 0.19 sec (if it does not have to decode the result) and 0.22sec (if it has to decode the result), compared to 0.13 and 0.15 sec of urlsplit. There might still be a way to further optimize this function 😄 |
I will work on the failed tests now 😄 particularly creating a tuple class for this function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, please have a look at using literal byte strings, I didn't put comment on every instance.
urlparse4/cgurl.pyx
Outdated
|
||
cimport cython | ||
|
||
|
||
uses_params = [scheme.encode('utf-8') for scheme in ['', 'ftp', 'hdl', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be written as [b'', b'ftp', ...
, without the need to call encoding, I think it will be much more clear
ParseStandardURL(url, len(url), parsed) | ||
elif CompareSchemeComponent(url, url_scheme, kMailToScheme): | ||
""" | ||
Discuss: Is this correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic looks good to me, kMailToScheme -> ParseMailtoURL, what would you like to discuss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create an issue to discuss about this. I believe there's a test that urlparse4 did not pass because of this 😄
urlparse4/cgurl.pyx
Outdated
""" | ||
this function can be modified to enhance the performance? | ||
""" | ||
slash, semcol = '/'.encode('utf-8'), ';'.encode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'/'.encode('utf-8')
can be replaced with b'/'
, and same below
Hey @lopuhin , as I am working on adding more funcs to the project as mentioned in #18, I realized to support urlparse, we will need to add some options to the current urlsplit function, particularly option scheme and allow_fragments.
Currently, I know that if the input does not have scheme and the scheme option is specified, then the output will contain the input scheme. That's why I created this PR. Do you think that is correct? Please let me know 😄