Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document.to_xml is reformatting tags inconsistently #415

Closed
pragdave opened this issue Feb 7, 2011 · 10 comments
Closed

Document.to_xml is reformatting tags inconsistently #415

pragdave opened this issue Feb 7, 2011 · 10 comments

Comments

@pragdave
Copy link

pragdave commented Feb 7, 2011

The code at https://gist.github.com/815162 reads an XML document and then writes it back out. It produces different results with libxml Nokogiri and the pure-Java version. The first and last tags are split onto multiple lines by the Java version, but left intact by the libxml version.

libxml nokogiri

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE book SYSTEM "unused.dtd">
<book code="_authorinfo" in-beta="yes">

<processedcode language="ruby" size="normal">
<codeline><cokw>class</cokw> Fred &lt; Preprocessor</codeline>
<codeline>  <cokw>def</cokw> map(line)</codeline>
<codeline>    line.gsub(%r{&lt;ror/&gt;}, <costring>"Ruby on Rails"</costring>)</codeline>
<codeline>  <cokw>end</cokw></codeline>
<codeline><cokw>end</cokw></codeline>
</processedcode>
</book>

Java nokogiri

<?xml version="1.0" encoding="utf-8"?>
<book code="_authorinfo" in-beta="yes">

<processedcode language="ruby" size="normal">
<codeline>
  <cokw>class</cokw> Fred &lt; Preprocessor
</codeline>

<codeline>  <cokw>def</cokw> map(line)</codeline>
<codeline>    line.gsub(%r{&lt;ror/&gt;}, <costring>"Ruby on Rails"</costring>)</codeline>
<codeline>  <cokw>end</cokw></codeline>
<codeline>
  <cokw>end</cokw>
</codeline>

</processedcode>

</book>

Is there any configuration I can use to turn off this behavior—it's currently preventing me from switching our toolchain to JRuby.

Dave

@yokolet
Copy link
Member

yokolet commented Feb 8, 2011

Hello! Thank you for testing pure Java Nokogiri.

Inconsistent spaces and newlines are so difficult to resolve for pure Java version. Xerces needs schema to handle those. However, the difference displayed here looks like coming from some bug. I'll look at what makes that difference.

@pragdave
Copy link
Author

pragdave commented Feb 8, 2011

Let me know if I can help.

Dave

@flavorjones
Copy link
Member

Hi Dave,

As @yokolet mentioned, it is extraordinarily hard to keep formatting behavior consistent between implementations.

So, I'm wondering, can you explain a bit more about why this sort of non-semantic change is a blocker for you?

Generally when people have pointed this out, it's because their test suites are asserting that the serialized document is identical to what they expect. Is this what you're up against?

A more semantic (and thus portable) way to do this sort of testing is to assert against the document structure. The gems lorax or nokogiri-diff may be able to help in that case.

Regardless, it would help us all if you'd give us some insight into what your particular blocker is here.

Thanks for using Nokogirl! (Aaron made me say that. ;))

@pragdave
Copy link
Author

pragdave commented Feb 8, 2011

This isn't a question of a failing test. The problem is that this generated XML has additional whitespace that, when formatted using FO, results in extra spaces in the printed book.

I'm representing a line of source code to be formatted in a book.

So, the source code

class Fred < Preprocessor

gets converted to

<codeline><cokw>class</cokw> Fred &lt; Preprocessor</codeline>

However, the pure Java version converts it to

<codeline>
   <cokw>class</cokw> Fred &lt; Preprocessor
</codeline>

Now, in a <codeline>, whitespace is significant. The libxml version correctly puts no whitespace before the class keyword, while the Java version inserts it. As a result, the code listings format incorrectly.

Even more confusingly, though, the Java version treats the codelines differently—the first and last are wrapped, while the rest are formated in the same way that libxml formats them.

If we can stop the wrapping of the first and last, I think the problem would be solved.

Dave

@pragdave
Copy link
Author

pragdave commented Mar 9, 2011

I can fix this by overriding the default formatting

@doc.to_xml(:save_with => 0)

@flavorjones
Copy link
Member

Dave - should this be closed? If this is as simple as turning off Node::SaveOptions::FORMAT by default on JRuby then maybe we should make that the default? I'm going to reopen.

@pragdave
Copy link
Author

pragdave commented Mar 9, 2011

I think it makes more sense to have it off by default.

@flavorjones
Copy link
Member

Mise en place refactor: fa671aa

@flavorjones
Copy link
Member

default output of XML on JRuby is no longer formatted due to inconsistent whitespace handling. Closed by 4337005

@yokolet
Copy link
Member

yokolet commented Mar 9, 2011

The change of default of Node::SaveOptions::FORMAT makes sense. I've almost fixed this, but format option was doing something wrong. That confused me.

I've already fixed the problem that a doctype decl was missing.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants