Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support function urlparse #20

Merged
merged 54 commits into from
Jun 26, 2018
Merged

Support function urlparse #20

merged 54 commits into from
Jun 26, 2018

Conversation

malloxpb
Copy link
Member

@malloxpb malloxpb commented Jun 15, 2018

Hey @lopuhin , as I am working on adding more funcs to the project as mentioned in #18, I realized to support urlparse, we will need to add some options to the current urlsplit function, particularly option scheme and allow_fragments.

Currently, I know that if the input does not have scheme and the scheme option is specified, then the output will contain the input scheme. That's why I created this PR. Do you think that is correct? Please let me know 😄

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nctl144 based on the testing I did, this looks correct. Do we have tests that check passing scheme?

@@ -221,14 +223,14 @@ class SplitResultNamedTuple(tuple):
return stdlib_urlunsplit(self)


def urlsplit(url):
def urlsplit(url, scheme='', allow_fragments=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is allow_fragments used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean is it used in Scrapy? I included it because it is used in function urlparse (https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L373). I don't really know if we should include it in this project since allow_fragments' only use is combining fragments with query.

@@ -245,3 +247,18 @@ def urljoin(base, url, allow_fragments=True):
return joined_url

return stdlib_urljoin(base, url, allow_fragments=allow_fragments)
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to uncomment it? or add in anotheer PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will work on implementing this function now, Konstantin!

@malloxpb
Copy link
Member Author

@malloxpb
Copy link
Member Author

malloxpb commented Jun 18, 2018

Hey Konstantin, after checking the performance test, I got the performance of urlparse is 0.19 sec (if it does not have to decode the result) and 0.22sec (if it has to decode the result), compared to 0.13 and 0.15 sec of urlsplit. There might still be a way to further optimize this function 😄

@malloxpb
Copy link
Member Author

malloxpb commented Jun 18, 2018

I will work on the failed tests now 😄 particularly creating a tuple class for this function

@malloxpb malloxpb closed this Jun 25, 2018
@malloxpb malloxpb reopened this Jun 26, 2018
Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please have a look at using literal byte strings, I didn't put comment on every instance.


cimport cython


uses_params = [scheme.encode('utf-8') for scheme in ['', 'ftp', 'hdl',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be written as [b'', b'ftp', ..., without the need to call encoding, I think it will be much more clear

ParseStandardURL(url, len(url), parsed)
elif CompareSchemeComponent(url, url_scheme, kMailToScheme):
"""
Discuss: Is this correct?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic looks good to me, kMailToScheme -> ParseMailtoURL, what would you like to discuss?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create an issue to discuss about this. I believe there's a test that urlparse4 did not pass because of this 😄

"""
this function can be modified to enhance the performance?
"""
slash, semcol = '/'.encode('utf-8'), ';'.encode('utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'/'.encode('utf-8') can be replaced with b'/', and same below

@malloxpb malloxpb merged commit d3e8eee into master Jun 26, 2018
@malloxpb malloxpb deleted the urlparse branch June 26, 2018 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants