Support for non-UTF `encoding` in xml parser #332

armanbilge · 2022-06-10T23:18:34Z

Very excited about the enhanced XML support in 1.4.0 :) I've been experimenting with it in http4s/http4s-scala-xml#25 and running into trouble with non UTF encodings. FTR I'm no expert in these things :)

For example this request:

Content-Type: application/xml

<?xml version="1.0" encoding="iso-8859-1"?><hello name="Günther"/>

as used in this test:
https://github.com/http4s/http4s-scala-xml/blob/1ca64f2ab7ef500d384d2ec5f8caf88df600e6a6/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala#L198-L209

Furthermore the RFC specifies:

Since the charset parameter is not provided in the Content-Type
header and there is no overriding BOM, conformant XML processors must
treat the "iso-8859-1" encoding as authoritative.  Conformant XML-
unaware MIME processors should make no assumptions about the
character encoding of the XML MIME entity.

https://datatracker.ietf.org/doc/html/rfc7303#section-8.3

I'm not sure if there is a way to support this without an XML parser that operates directly on bytes instead of chars/strings 😕 any thoughts? Thanks!

The text was updated successfully, but these errors were encountered:

satabin · 2022-06-11T15:36:22Z

Thanks for reporting. My parser indeed works on decoded characters, so you need to decode it in the right way before going through the XML event pipe. This can be done with fs2-data by using the appropriate import which brings the decoder in scope from Byte to something that can emit Chars.

The parser operates then on these bytes without changing the decoding if it does not match the one provided.

I remember thinking about implementing this behavior but I decided to stay with an implementation that is not aware of the encoding. It trusts blindly the strings that were decoded for it.

In the case of your issue, I am not entirely sure how the string is decoded from bytes, I should have a closer look at it.

armanbilge · 2022-06-11T15:51:32Z

Thanks for the response! I'm digging into this again since it's all rather confusing.

Currently looking at https://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

Which says:

Each external parsed entity in an XML document may use a different encoding for its characters.

Which again suggests to me, that XML-parsing should operate directly on byte-streams, rather than decoded Strings.

satabin · 2022-06-11T15:54:02Z

Parsed entities are DTD related (you have internal and external ones). What this says is that every externally defined entity (i.e. in a DTD that is physically in another file) might use a different encoding.

satabin · 2022-06-11T15:57:07Z

Basically the approach taken in fs2-data is:

for textual format, you need to decode the bytes into characters first.
the textual parsers get a stream of characters that is assumed to be correct.

In your case, I would expect bodyText to decode text according to the declared charset (which was set to the source).
I am not familiar with http4s types anymore, I need to dig a bit more to see where the problem would lie. If the text is actually latin9 encoded, and the message declares this charset, I would expect the string coming out of bodyText to be properly decoded, but I might be wrong, I need to check.

armanbilge · 2022-06-11T15:59:29Z

In your case, I would expect bodyText to decode text according to the declared charset

Right. I think this situation is testing when there is no declared charset, except for the encoding specified in the XML itself.

Still, I think are you right that this might be a problem to solve in http4s. Further reading:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

https://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

satabin · 2022-06-11T16:05:16Z

In the test, can you try making an implicit ISO-8859-1 charset available in scope of the test and see if it solves it? bodyText takes an implicit charset, which is UTF-8 by default.

armanbilge · 2022-06-11T16:17:12Z

Right, so the test is currently declared like this:

  test("parse omitted charset and 8-Bit MIME Entity") {
    // https://datatracker.ietf.org/doc/html/rfc7303#section-8.3
    encodingTest(
      Chunk.array(
        """<?xml version="1.0" encoding="iso-8859-1"?><hello name="Günther"/>""".getBytes(
          StandardCharsets.ISO_8859_1
        )
      ),
      "application/xml",
      "Günther",
    )
  }

That fails.

However, if I specify the charset like this, then it passes:

diff --git a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
index 8739d94..cfd9622 100644
--- a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
+++ b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
@@ -203,7 +203,7 @@ class ScalaXmlSuite extends CatsEffectSuite with ScalaCheckSuite {
           StandardCharsets.ISO_8859_1
         )
       ),
-      "application/xml",
+      "application/xml; charset=iso-8859-1",
       "Günther",
     )
   }

But the whole point of that test is to pass without specifying the charset.

test("parse omitted charset and 8-Bit MIME Entity")

satabin · 2022-06-11T17:52:45Z

No I meant something adding implicit val charset = Charset.forName("iso.8859.15") or similar in the test itself. Since no encoding is provided in the answer, you need to specify how to decode the body.

armanbilge · 2022-06-11T17:57:04Z

Sorry, I guess I'm confused 😕 I understand the code changes you are suggesting (and I expect it will work), but I don't understand what it will demonstrate? Since in practice there would be no way to know what the implicit charset should be.

armanbilge · 2022-06-11T18:13:15Z

@rossabaker if you have a moment to weigh in here it would be appreciated 🙏

ybasket · 2022-10-23T17:45:46Z

@satabin I solved the problem with a small hack in http4s-fs2-data-xml, see https://github.com/http4s/http4s-fs2-data/blob/main/scala-xml/src/main/scala/org/http4s/scalaxml/ElemInstances.scala#L53-L134.

Would you see this as something we could pull into fs2-data (partially, of course we don't have HTTP headers, only the XML prolog)? Maybe as some kind of configurable behaviour? I'm a bit on the fence…it brings us closer to the XML spec, but would be one of the least elegant parts of this project as it goes against our abstractions like CharLikeChunks. Or maybe we find a reasonable way of back-channeling charset info which we can implement in 2.0.

If we don't pull it in or explore alternatives, I would close this issue as http4s has a way forward.

satabin · 2022-10-24T09:56:11Z

I am not a fan of what I did with the CharLikeChunks, it is not flexible enough, and I was thinking about a better approach for 2.0, that would allow for this kind of behavior.

The current approach in fs2-data 1.x is to have characters decoded outside of fs2-data, which does not work well with this kind of of format.

I would rather not integrate it currently and think about a better approach for this problem and abstraction and integrate it in 2.0. WDYT?

armanbilge changed the title ~~Support for non-UTF8 encoding~~ Support for non-UTF8 encoding in xml parser Jun 10, 2022

armanbilge changed the title ~~Support for non-UTF8 encoding in xml parser~~ Support for non-UTF encoding in xml parser Jun 10, 2022

satabin self-assigned this Jun 11, 2022

armanbilge mentioned this issue Jun 12, 2022

fs2-data-xml streaming parser http4s/http4s-scala-xml#25

Closed

ybasket added this to the 2.0.0 milestone Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for non-UTF `encoding` in xml parser #332

Support for non-UTF `encoding` in xml parser #332

armanbilge commented Jun 10, 2022

satabin commented Jun 11, 2022

armanbilge commented Jun 11, 2022

satabin commented Jun 11, 2022

satabin commented Jun 11, 2022

armanbilge commented Jun 11, 2022

satabin commented Jun 11, 2022

armanbilge commented Jun 11, 2022

satabin commented Jun 11, 2022

armanbilge commented Jun 11, 2022

armanbilge commented Jun 11, 2022

ybasket commented Oct 23, 2022

satabin commented Oct 24, 2022

Support for non-UTF encoding in xml parser #332

Support for non-UTF encoding in xml parser #332

Comments

armanbilge commented Jun 10, 2022

satabin commented Jun 11, 2022

armanbilge commented Jun 11, 2022

satabin commented Jun 11, 2022

satabin commented Jun 11, 2022

armanbilge commented Jun 11, 2022

satabin commented Jun 11, 2022

armanbilge commented Jun 11, 2022

satabin commented Jun 11, 2022

armanbilge commented Jun 11, 2022

armanbilge commented Jun 11, 2022

ybasket commented Oct 23, 2022

satabin commented Oct 24, 2022

Support for non-UTF `encoding` in xml parser #332

Support for non-UTF `encoding` in xml parser #332