-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyyaml does not support literals in unicode over codepoint 0xffff #25
Comments
Any updates on this? It's a significant limitation for loading Unicode YAML files (since all emoji are > 0xFFFF and commonly used now) |
+1, had a crash on encountering an emoji someone wrote in a YAML config file |
Going to add my voice to this as wanting to see this addressed as well. |
All,
Okay. I want to write the code to address this defect. I guess I need to
write unit tests as well. Could anyone give me some heads up on how to get
Travis working?
Thanks and best regards,
Peter
…On Sat, May 6, 2017 at 9:18 PM, Samantha Marshall ***@***.***> wrote:
Going to add my voice to this as wanting to see this addressed as well.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA-AtWWVYC7oUON5AgNvhvwRUkfl_4khks5r3FcagaJpZM4HGN9Z>
.
--
Email: [email protected]
WWW: http://www.pkmurphy.com.au/
|
@peterkmurphy is the patch in the original post (abit poorly formatted for display on github) not sufficient for submitting a PR? |
Samantha,
It looks like the code has been done for me! Well, that's good. But I want
to do some testing.
What I've done is make a fork of the code base at:
https://github.com/peterkmurphy/pyyaml
I seem to have got Travis working as well (to my surprise - it was as easy
as pie!). So if the code works with all the environments in store (Python
2.6, 2.7, 3.3 to 3.5 and PyPy), then we can say that the patch should be
good to go. Give me a day or so.
I hope that helps.
Best regards,
Peter
…On Sat, May 6, 2017 at 10:09 PM, Samantha Marshall ***@***.*** > wrote:
@peterkmurphy <https://github.com/peterkmurphy> is the patch in the
original post (abit poorly formatted for display on github) not sufficient
for submitting a PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA-AtWgxRC9WswNchXw4PA3EzB47haBUks5r3GMLgaJpZM4HGN9Z>
.
--
Email: [email protected]
WWW: http://www.pkmurphy.com.au/
|
All,
A heads up for everybody. According to Travis, the code passes the
_existing_ tests (all 1264 of them) in Python 2.x, 3.x and PyPy. But since
none of the tests were written with extended Unicode in mind, I think I
better write some, shouldn't I?
That shouldn't stop people trying my fork with "interesting" text - one
containing emojis or Cuneiform or Tengwar. But I wouldn't consider it
completely tested yet.
Best regards,
Peter
…On Sun, May 7, 2017 at 9:05 AM, Peter Murphy ***@***.***> wrote:
Samantha,
It looks like the code has been done for me! Well, that's good. But I want
to do some testing.
What I've done is make a fork of the code base at:
https://github.com/peterkmurphy/pyyaml
I seem to have got Travis working as well (to my surprise - it was as easy
as pie!). So if the code works with all the environments in store (Python
2.6, 2.7, 3.3 to 3.5 and PyPy), then we can say that the patch should be
good to go. Give me a day or so.
I hope that helps.
Best regards,
Peter
On Sat, May 6, 2017 at 10:09 PM, Samantha Marshall <
***@***.***> wrote:
> @peterkmurphy <https://github.com/peterkmurphy> is the patch in the
> original post (abit poorly formatted for display on github) not sufficient
> for submitting a PR?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#25 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AA-AtWgxRC9WswNchXw4PA3EzB47haBUks5r3GMLgaJpZM4HGN9Z>
> .
>
--
Email: ***@***.***
WWW: http://www.pkmurphy.com.au/
--
Email: [email protected]
WWW: http://www.pkmurphy.com.au/
|
@peterkmurphy yes more tests! |
|
All,
A heads up on the issue. The patch seems to work with some tests I've run.
For example, if I have the following file:
😀😁😂😃😄😅😆😇
😈😉😊😋😌😍😎😏
😐😑😒😓😔😕😖😗
😘😙😚😛😜😝😞😟
😠😡😢😣😤😥😦😧
😨😩😪😫😬😭😮😯
😰😱😲😳😴😵😶😷
😸😹😺😻😼😽😾😿
🙀🙁🙂🙃🙄🙅🙆🙇
🙈🙉🙊🙋🙌🙍🙎🙏
Then I can use PyYAML to load it and unload it.
The problem I am having is with the testing code, which will need a lot of
unwrangling. There are some frankly ... "bizarre" assumptions in what won't
work in parseable code, except it sometimes will be parseable by accident.
For example:
def test_unicode_input_errors(unicode_filename, verbose=False):
data = open(unicode_filename, 'rb').read().decode('utf-8')
for input in [data.encode('latin1', 'ignore'), # <--- Look at this!
data.encode('utf-16-be'), data.encode('utf-16-le'),
codecs.BOM_UTF8+data.encode('utf-16-be'),
codecs.BOM_UTF16_BE+data.encode('utf-16-le'),
codecs.BOM_UTF16_LE+data.encode('utf-8')+'!']:
try:
yaml.load(input)
except yaml.YAMLError, exc:
if verbose:
print exc
else:
raise AssertionError("expected an exception")
The idea: let's cause some bizarre combinations of byte sequences, attempt
to parse it, and if it _doesn't_ throw a YAMLError, raise an exception.
Except that when one does data.encode('latin1', 'ignore') on data, one
results in ten line breaks, which is happily parseable as YAML. So no
exception raised, so AssertionError.
What should I do in this case - remove test_unicode_input_errors from the
PyYaml testing code? Yes, the number of tests will go down, which is
generally not a good thing, but if the tests are based on dodgy
assumptions...
Peter
…On Mon, May 8, 2017 at 5:12 PM, Adam Johnson ***@***.***> wrote:
yaml.load('👍')
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA-AtenATgNrySHLiA_wZsO-OLVgeMKeks5r3sBNgaJpZM4HGN9Z>
.
--
Email: [email protected]
WWW: http://www.pkmurphy.com.au/
|
Can you open a PR so it's easier to review? |
Adam:
I don't think a patch request is a good idea at this stage, as there are
three errors when you run the tests. If you wish, you can check out the
latest at
https://github.com/peterkmurphy/pyyaml and run python setup.py test to see
the errors yourself.
Best regards,
Peter
…On Tue, May 9, 2017 at 11:16 PM, Adam Johnson ***@***.***> wrote:
Can you open a PR so it's easier to review?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA-AtY0n5J5j3ch4KK8lNIrLHVoFCk2Fks5r4GcxgaJpZM4HGN9Z>
.
--
Email: [email protected]
WWW: http://www.pkmurphy.com.au/
|
With a PR, we can comment on the code directly. Btw I'm in favour of removing the tests, I think you're fixing the behaviour. |
Adam,
Thank you. I'll decide what to do tomorrow.
Best regards,
Peter
…On Tue, May 9, 2017 at 11:46 PM, Adam Johnson ***@***.***> wrote:
With a PR, we can comment on the code directly. Btw I'm in favour of
removing the tests, I think you're fixing the behaviour.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA-AtVX-zkKDfGoU3-jxZAMPv5bTU4goks5r4G5HgaJpZM4HGN9Z>
.
--
Email: [email protected]
WWW: http://www.pkmurphy.com.au/
|
Adam (and others),
A pull request has been created.
#63
Best regards,
Peter
On Tue, May 9, 2017 at 11:54 PM, Peter Murphy <[email protected]>
wrote:
… Adam,
Thank you. I'll decide what to do tomorrow.
Best regards,
Peter
On Tue, May 9, 2017 at 11:46 PM, Adam Johnson ***@***.***>
wrote:
> With a PR, we can comment on the code directly. Btw I'm in favour of
> removing the tests, I think you're fixing the behaviour.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#25 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AA-AtVX-zkKDfGoU3-jxZAMPv5bTU4goks5r4G5HgaJpZM4HGN9Z>
> .
>
--
Email: ***@***.***
WWW: http://www.pkmurphy.com.au/
--
Email: [email protected]
WWW: http://www.pkmurphy.com.au/
|
So, is this fix getting pulled in? |
Just thought I'd share, in case others are in the same boat: After patching/working around this for a while, we realized https://pypi.python.org/pypi/ruamel.yaml handles higher Unicodes just fine, and it's worked out well for us. |
this was merged in #63 but still hasn't been released :( |
This took me hours to find the source of our server to server communication issue. What is the reason to dump a unicode control char (unicode: This is just ridiculous. A sometimes one-way-only-converter. Edit: Whats the reason i see unicode in yaml instead of utf-8 (encoding='utf-8) ? This looks also not correct to me. |
Fixed by #63 |
See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=806826
the yaml spec says that
“The allowed character range explicitly excludes the surrogate
block #xD800-#xDFFF, DEL #x7F, the C0 control block #x0-#x1F
(except for #x9, #xA, and #xD), the C1 control block #x80-#x9F,
#xFFFE, and #xFFFF.”
however pyyaml has chosen to negate that check and apply it to only
plane 0. This means that any yaml document that contains unicode
literals in higher planes will fail to parse (and, on output, use the
rather unfriendly \Uxxxxxxxx format).
The attached patch fixes this in a minimally intrusive way, by
extending the checks to cover the additional codepoints where
appropriate. A better fix would be to use the check as the spec
specifies it, but that would be a bigger change.
Index: pyyaml-3.11/lib/yaml/emitter.py
--- pyyaml-3.11.orig/lib/yaml/emitter.py
+++ pyyaml-3.11/lib/yaml/emitter.py
@@ -8,9 +8,13 @@
all = ['Emitter', 'EmitterError']
+import sys
+
from error import YAMLError
from events import *
+has_ucs4 = sys.maxunicode > 0xffff
+
class EmitterError(YAMLError):
pass
@@ -701,7 +705,8 @@ class Emitter(object):
line_breaks = True
if not (ch == u'\n' or u'\x20' <= ch <= u'\x7E'):
if (ch == u'\x85' or u'\xA0' <= ch <= u'\uD7FF'
Index: pyyaml-3.11/lib/yaml/reader.py
--- pyyaml-3.11.orig/lib/yaml/reader.py
+++ pyyaml-3.11/lib/yaml/reader.py
@@ -19,7 +19,9 @@ all = ['Reader', 'ReaderError']
from error import YAMLError, Mark
-import codecs, re
+import codecs, re, sys
+
+has_ucs4 = sys.maxunicode > 0xffff
class ReaderError(YAMLError):
@@ -134,7 +136,10 @@ class Reader(object):
self.encoding = 'utf-8'
self.update(1)
NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')
if has_ucs4:
else:
def check_printable(self, data):
match = self.NON_PRINTABLE.search(data)
if match:
Index: pyyaml-3.11/lib3/yaml/emitter.py
--- pyyaml-3.11.orig/lib3/yaml/emitter.py
+++ pyyaml-3.11/lib3/yaml/emitter.py
@@ -698,7 +698,8 @@ class Emitter:
line_breaks = True
if not (ch == '\n' or '\x20' <= ch <= '\x7E'):
if (ch == '\x85' or '\xA0' <= ch <= '\uD7FF'
Index: pyyaml-3.11/lib3/yaml/reader.py
--- pyyaml-3.11.orig/lib3/yaml/reader.py
+++ pyyaml-3.11/lib3/yaml/reader.py
@@ -134,7 +134,7 @@ class Reader(object):
self.encoding = 'utf-8'
self.update(1)
NON_PRINTABLE = re.compile('[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')
NON_PRINTABLE = re.compile('[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010ffff]')
def check_printable(self, data):
match = self.NON_PRINTABLE.search(data)
if match:
The text was updated successfully, but these errors were encountered: