Unicode bug #2

VladimirAlexiev · 2016-12-02T12:38:00Z

"μg" in XML is converted to \265g (octal code \265 followed by "g") which is invalid Unicode and causes this RIOT exception:

  org.apache.jena.atlas.RuntimeIOException: java.nio.charset.MalformedInputException: Input length = 1
        at org.apache.jena.atlas.io.IO.exception(IO.java:216)
        at org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:77)

"μ" in UFT8 is 2 bytes: \302\265. Somehow the XSPARQL conversion loses the first byte

>head -16 NCT01047553.xml |tail -1|od -c
0000120   y       D   o   s   e       o   f       1   8     302 265   g
0000140       (   9     302 265   g       T   w   i   c   e       D   a
0000160   i   l   y   )       i   n       J   a   p   a   n   e   s   e

>head -14 NCT01047553.ttl |tail -1|od -c
0000200   r   m   o   t   e   r   o   l       i   n       a       D   a
0000220   i   l   y       D   o   s   e       o   f       1   8     265
0000240   g       (   9     265   g       T   w   i   c   e       D   a

The text was updated successfully, but these errors were encountered:

VladimirAlexiev · 2016-12-02T12:38:59Z

This is a blocker for any use of XSPARQL over Unicode data. I suspect the builtin old Saxon is to blame and we'll try to retarget it to use BaseX.
Can anyone suggest any better ideas?

VladimirAlexiev · 2016-12-02T13:05:13Z

A similar thing happens for the Unicode char in Oxis Turbuhaler®:

>head -293 NCT01047553.xml|tail -1|od -c
0000000                   <   o   t   h   e   r   _   n   a   m   e   >
0000020   O   x   i   s       T   u   r   b   u   h   a   l   e   r 302
0000040 256   <   /   o   t   h   e   r   _   n   a   m   e   >  \r  \n
0000060

>head -306 NCT01047553.ttl|tail -1|od -c
0000000   <   h   t   t   p   :   /   /   l   i   n   k   e   d   l   i
0000020   f   e   d   a   t   a   .   c   o   m   /   r   e   s   o   u
0000040   r   c   e   /   c   l   i   n   i   c   a   l   t   r   i   a
0000060   l   s   /   N   C   T   0   1   0   4   7   5   5   3   /   i
0000100   n   t   e   r   v   e   n   t   i   o   n   /   1   >       s
0000120   k   o   s   :   a   l   t   L   a   b   e   l           "   O
0000140   x   i   s       T   u   r   b   u   h   a   l   e   r 256   "
0000160       .  \r  \n

VladimirAlexiev · 2017-01-05T12:29:07Z

retarget it to use BaseX

This won't be quite easy. So we'll first try to use the most recent SaxonHE: it uses 9.4 which is Mar 2012; the current one is 9.7.0-14

boyan-simeonov · 2017-01-06T13:50:37Z

I tried to build XSPARQL with the latest SaxonHE 9.7.0-14, but after version 9.5 there are changes. For this reason it will need many changes in the code of the project. For example after 9.5 ExtensionFunctionCall use Sequence rather than SequenceIterator.

VladimirAlexiev · 2017-01-06T13:50:47Z

We tried a simple XQuery on the XML in question, with both the SaxonHE versions listed above: there's no Unicode problem.
So it's confirmed the problem is in the XSPARQL part.

VladimirAlexiev · 2017-01-06T14:32:33Z

The problem is fixed if you run it with the appropriate java option:

java -Dfile.encoding=UTF-8 -jar c:/prog/xsparql/xsparql-cli-jar-with-dependencies.jar $@

But the input file starts with

<?xml version="1.0" encoding="UTF-8"?>

So this is still a valid bug

zacharywhitley · 2019-02-04T15:49:37Z

Updated to saxon 9.9.0-2 8aff57e

zacharywhitley · 2019-02-04T15:49:57Z

Kanji test cases are failing probably for the same reason

VladimirAlexiev · 2024-01-12T08:40:25Z

I still experience Unicode problems if xsparql writes to STDOUT.
There are no problems when used with -f file.ttl to write to a file: #45

VladimirAlexiev mentioned this issue Dec 6, 2016

rewriter error messages are useless #9

Open

zacharywhitley added the bug label Feb 4, 2019

VladimirAlexiev mentioned this issue Jan 8, 2024

How to capture piece of XML to a literal as is? #38

Closed

VladimirAlexiev mentioned this issue Jan 12, 2024

Future of xsparql #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode bug #2

Unicode bug #2

VladimirAlexiev commented Dec 2, 2016 •

edited

Loading

VladimirAlexiev commented Dec 2, 2016

VladimirAlexiev commented Dec 2, 2016

VladimirAlexiev commented Jan 5, 2017

boyan-simeonov commented Jan 6, 2017

VladimirAlexiev commented Jan 6, 2017

VladimirAlexiev commented Jan 6, 2017

zacharywhitley commented Feb 4, 2019

zacharywhitley commented Feb 4, 2019

VladimirAlexiev commented Jan 12, 2024

Unicode bug #2

Unicode bug #2

Comments

VladimirAlexiev commented Dec 2, 2016 • edited Loading

VladimirAlexiev commented Dec 2, 2016

VladimirAlexiev commented Dec 2, 2016

VladimirAlexiev commented Jan 5, 2017

boyan-simeonov commented Jan 6, 2017

VladimirAlexiev commented Jan 6, 2017

VladimirAlexiev commented Jan 6, 2017

zacharywhitley commented Feb 4, 2019

zacharywhitley commented Feb 4, 2019

VladimirAlexiev commented Jan 12, 2024

VladimirAlexiev commented Dec 2, 2016 •

edited

Loading