Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode bug #2

Open
VladimirAlexiev opened this issue Dec 2, 2016 · 9 comments
Open

Unicode bug #2

VladimirAlexiev opened this issue Dec 2, 2016 · 9 comments
Labels

Comments

@VladimirAlexiev
Copy link
Contributor

VladimirAlexiev commented Dec 2, 2016

"μg" in XML is converted to \265g (octal code \265 followed by "g") which is invalid Unicode and causes this RIOT exception:

  org.apache.jena.atlas.RuntimeIOException: java.nio.charset.MalformedInputException: Input length = 1
        at org.apache.jena.atlas.io.IO.exception(IO.java:216)
        at org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:77)

"μ" in UFT8 is 2 bytes: \302\265. Somehow the XSPARQL conversion loses the first byte

>head -16 NCT01047553.xml |tail -1|od -c
0000120   y       D   o   s   e       o   f       1   8     302 265   g
0000140       (   9     302 265   g       T   w   i   c   e       D   a
0000160   i   l   y   )       i   n       J   a   p   a   n   e   s   e

>head -14 NCT01047553.ttl |tail -1|od -c
0000200   r   m   o   t   e   r   o   l       i   n       a       D   a
0000220   i   l   y       D   o   s   e       o   f       1   8     265
0000240   g       (   9     265   g       T   w   i   c   e       D   a
@VladimirAlexiev
Copy link
Contributor Author

This is a blocker for any use of XSPARQL over Unicode data. I suspect the builtin old Saxon is to blame and we'll try to retarget it to use BaseX.
Can anyone suggest any better ideas?

@VladimirAlexiev
Copy link
Contributor Author

A similar thing happens for the Unicode char in Oxis Turbuhaler®:

>head -293 NCT01047553.xml|tail -1|od -c
0000000                   <   o   t   h   e   r   _   n   a   m   e   >
0000020   O   x   i   s       T   u   r   b   u   h   a   l   e   r 302
0000040 256   <   /   o   t   h   e   r   _   n   a   m   e   >  \r  \n
0000060

>head -306 NCT01047553.ttl|tail -1|od -c
0000000   <   h   t   t   p   :   /   /   l   i   n   k   e   d   l   i
0000020   f   e   d   a   t   a   .   c   o   m   /   r   e   s   o   u
0000040   r   c   e   /   c   l   i   n   i   c   a   l   t   r   i   a
0000060   l   s   /   N   C   T   0   1   0   4   7   5   5   3   /   i
0000100   n   t   e   r   v   e   n   t   i   o   n   /   1   >       s
0000120   k   o   s   :   a   l   t   L   a   b   e   l           "   O
0000140   x   i   s       T   u   r   b   u   h   a   l   e   r 256   "
0000160       .  \r  \n

@VladimirAlexiev
Copy link
Contributor Author

retarget it to use BaseX

This won't be quite easy. So we'll first try to use the most recent SaxonHE: it uses 9.4 which is Mar 2012; the current one is 9.7.0-14

@boyan-simeonov
Copy link

I tried to build XSPARQL with the latest SaxonHE 9.7.0-14, but after version 9.5 there are changes. For this reason it will need many changes in the code of the project. For example after 9.5 ExtensionFunctionCall use Sequence rather than SequenceIterator.

@VladimirAlexiev
Copy link
Contributor Author

We tried a simple XQuery on the XML in question, with both the SaxonHE versions listed above: there's no Unicode problem.
So it's confirmed the problem is in the XSPARQL part.

@VladimirAlexiev
Copy link
Contributor Author

The problem is fixed if you run it with the appropriate java option:

java -Dfile.encoding=UTF-8 -jar c:/prog/xsparql/xsparql-cli-jar-with-dependencies.jar $@

But the input file starts with

<?xml version="1.0" encoding="UTF-8"?>

So this is still a valid bug

@zacharywhitley
Copy link
Contributor

Updated to saxon 9.9.0-2 8aff57e

@zacharywhitley
Copy link
Contributor

Kanji test cases are failing probably for the same reason

@VladimirAlexiev
Copy link
Contributor Author

I still experience Unicode problems if xsparql writes to STDOUT.
There are no problems when used with -f file.ttl to write to a file: #45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants