Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize memory consumption of idnadata, take 3 #24

Merged
merged 5 commits into from
Mar 15, 2016

Conversation

slingamn
Copy link
Contributor

@slingamn slingamn commented Mar 1, 2016

I'm sorry I keep changing my mind here. I think I finally have an approach with no significant drawbacks:

  1. As in optimize memory consumption of idnadata, take 2 #23, stop storing the DISALLOWED codepoints, as well as any scripts that aren't special-cased by RFC 5892.
  2. As in the master branch, take advantage of the fact that the script and PVALID codepoint lists have long runs in them --- but instead of expanding those runs into a flat hash-set at runtime, leave them as a sequence of (start, end) tuples. These tuples can be binary-searched for integer membership with the bisect module. The binary search is faster than the one in optimize memory consumption of idnadata, take 2 #23 because it costs O(log(# runs)), not O(log(# codepoints)).

Memory consumption is down under 2 MB, the naive benchmark is not significantly affected (the increase is less than 10%, with overall time still dominated by punycode), and there's no need to package a large opaque binary datafile.

Caveats about the Unicode version still apply (again, the idnadata rebuild is an isolated commit and can be rebased out).

@slingamn
Copy link
Contributor Author

slingamn commented Mar 9, 2016

Let me know if I can clarify, document, or change anything :-)

@kjd
Copy link
Owner

kjd commented Mar 9, 2016

Thanks! I am at the ICANN meeting in Marrakech this week so I won't have an opportunity to work on this until next week. I hope then I can push out a new release with this included.

kjd added a commit that referenced this pull request Mar 15, 2016
Optimizations to reduce memory footprint
@kjd kjd merged commit 0813933 into kjd:master Mar 15, 2016
@slingamn
Copy link
Contributor Author

Thanks very much!

Just wanted to confirm: the outcome of the discussion was to move to Unicode version 8.0?

@kjd
Copy link
Owner

kjd commented Mar 15, 2016

That issue is not resolved, but possible moot with respect to these particular patches, as the classification of code points into scripts is theoretically purely additive, and this won't add any additional PVALID characters which are still derived from 6.3. I'll confirm this assumption before pushing a new version to PyPI.

kjd added a commit that referenced this pull request Mar 20, 2016
Optimizations to reduce memory footprint
@kjd
Copy link
Owner

kjd commented Mar 20, 2016

v2.1 has been pushed that contains this. I made some minor changes that relate to how the data is encapsulated so it is easier to diff and debug, but the functionality is the same.

Thanks again!

@slingamn
Copy link
Contributor Author

Thanks!

The wheel uploaded here is for Python 2 only --- should it be for both 2 and 3? https://pypi.python.org/pypi/idna/2.1

@kjd
Copy link
Owner

kjd commented Mar 21, 2016

There should be a universal one there now. I am new at using wheels....

@slingamn
Copy link
Contributor Author

Thanks, that worked. You might want to take down the 2-only wheel? Not critical, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants