Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-UTF encoding in xml parser #332

Open
armanbilge opened this issue Jun 10, 2022 · 12 comments
Open

Support for non-UTF encoding in xml parser #332

armanbilge opened this issue Jun 10, 2022 · 12 comments
Assignees
Milestone

Comments

@armanbilge
Copy link
Contributor

Very excited about the enhanced XML support in 1.4.0 :) I've been experimenting with it in http4s/http4s-scala-xml#25 and running into trouble with non UTF encodings. FTR I'm no expert in these things :)

For example this request:

Content-Type: application/xml

<?xml version="1.0" encoding="iso-8859-1"?><hello name="Günther"/>

as used in this test:
https://github.com/http4s/http4s-scala-xml/blob/1ca64f2ab7ef500d384d2ec5f8caf88df600e6a6/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala#L198-L209

Furthermore the RFC specifies:

Since the charset parameter is not provided in the Content-Type
header and there is no overriding BOM, conformant XML processors must
treat the "iso-8859-1" encoding as authoritative.  Conformant XML-
unaware MIME processors should make no assumptions about the
character encoding of the XML MIME entity.

https://datatracker.ietf.org/doc/html/rfc7303#section-8.3

I'm not sure if there is a way to support this without an XML parser that operates directly on bytes instead of chars/strings 😕 any thoughts? Thanks!

@armanbilge armanbilge changed the title Support for non-UTF8 encoding Support for non-UTF8 encoding in xml parser Jun 10, 2022
@armanbilge armanbilge changed the title Support for non-UTF8 encoding in xml parser Support for non-UTF encoding in xml parser Jun 10, 2022
@satabin
Copy link
Member

satabin commented Jun 11, 2022

Thanks for reporting. My parser indeed works on decoded characters, so you need to decode it in the right way before going through the XML event pipe. This can be done with fs2-data by using the appropriate import which brings the decoder in scope from Byte to something that can emit Chars.

The parser operates then on these bytes without changing the decoding if it does not match the one provided.

I remember thinking about implementing this behavior but I decided to stay with an implementation that is not aware of the encoding. It trusts blindly the strings that were decoded for it.

In the case of your issue, I am not entirely sure how the string is decoded from bytes, I should have a closer look at it.

@satabin satabin self-assigned this Jun 11, 2022
@armanbilge
Copy link
Contributor Author

Thanks for the response! I'm digging into this again since it's all rather confusing.

Currently looking at https://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

Which says:

Each external parsed entity in an XML document may use a different encoding for its characters.

Which again suggests to me, that XML-parsing should operate directly on byte-streams, rather than decoded Strings.

@satabin
Copy link
Member

satabin commented Jun 11, 2022

Parsed entities are DTD related (you have internal and external ones). What this says is that every externally defined entity (i.e. in a DTD that is physically in another file) might use a different encoding.

@satabin
Copy link
Member

satabin commented Jun 11, 2022

Basically the approach taken in fs2-data is:

  • for textual format, you need to decode the bytes into characters first.
  • the textual parsers get a stream of characters that is assumed to be correct.

In your case, I would expect bodyText to decode text according to the declared charset (which was set to the source).
I am not familiar with http4s types anymore, I need to dig a bit more to see where the problem would lie. If the text is actually latin9 encoded, and the message declares this charset, I would expect the string coming out of bodyText to be properly decoded, but I might be wrong, I need to check.

@armanbilge
Copy link
Contributor Author

In your case, I would expect bodyText to decode text according to the declared charset

Right. I think this situation is testing when there is no declared charset, except for the encoding specified in the XML itself.

Still, I think are you right that this might be a problem to solve in http4s. Further reading:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

https://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

@satabin
Copy link
Member

satabin commented Jun 11, 2022

In the test, can you try making an implicit ISO-8859-1 charset available in scope of the test and see if it solves it? bodyText takes an implicit charset, which is UTF-8 by default.

@armanbilge
Copy link
Contributor Author

Right, so the test is currently declared like this:

  test("parse omitted charset and 8-Bit MIME Entity") {
    // https://datatracker.ietf.org/doc/html/rfc7303#section-8.3
    encodingTest(
      Chunk.array(
        """<?xml version="1.0" encoding="iso-8859-1"?><hello name="Günther"/>""".getBytes(
          StandardCharsets.ISO_8859_1
        )
      ),
      "application/xml",
      "Günther",
    )
  }

That fails.

However, if I specify the charset like this, then it passes:

diff --git a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
index 8739d94..cfd9622 100644
--- a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
+++ b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
@@ -203,7 +203,7 @@ class ScalaXmlSuite extends CatsEffectSuite with ScalaCheckSuite {
           StandardCharsets.ISO_8859_1
         )
       ),
-      "application/xml",
+      "application/xml; charset=iso-8859-1",
       "Günther",
     )
   }

But the whole point of that test is to pass without specifying the charset.

test("parse omitted charset and 8-Bit MIME Entity")

@satabin
Copy link
Member

satabin commented Jun 11, 2022

No I meant something adding implicit val charset = Charset.forName("iso.8859.15") or similar in the test itself. Since no encoding is provided in the answer, you need to specify how to decode the body.

@armanbilge
Copy link
Contributor Author

Sorry, I guess I'm confused 😕 I understand the code changes you are suggesting (and I expect it will work), but I don't understand what it will demonstrate? Since in practice there would be no way to know what the implicit charset should be.

@armanbilge
Copy link
Contributor Author

@rossabaker if you have a moment to weigh in here it would be appreciated 🙏

@ybasket
Copy link
Collaborator

ybasket commented Oct 23, 2022

@satabin I solved the problem with a small hack in http4s-fs2-data-xml, see https://github.com/http4s/http4s-fs2-data/blob/main/scala-xml/src/main/scala/org/http4s/scalaxml/ElemInstances.scala#L53-L134.

Would you see this as something we could pull into fs2-data (partially, of course we don't have HTTP headers, only the XML prolog)? Maybe as some kind of configurable behaviour? I'm a bit on the fence…it brings us closer to the XML spec, but would be one of the least elegant parts of this project as it goes against our abstractions like CharLikeChunks. Or maybe we find a reasonable way of back-channeling charset info which we can implement in 2.0.

If we don't pull it in or explore alternatives, I would close this issue as http4s has a way forward.

@satabin
Copy link
Member

satabin commented Oct 24, 2022

I am not a fan of what I did with the CharLikeChunks, it is not flexible enough, and I was thinking about a better approach for 2.0, that would allow for this kind of behavior.

The current approach in fs2-data 1.x is to have characters decoded outside of fs2-data, which does not work well with this kind of of format.

I would rather not integrate it currently and think about a better approach for this problem and abstraction and integrate it in 2.0. WDYT?

@ybasket ybasket added this to the 2.0.0 milestone Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants