Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy does not move <style>-Tag to head #567

Closed
suchafreak opened this issue Jun 2, 2017 · 25 comments
Closed

Tidy does not move <style>-Tag to head #567

suchafreak opened this issue Jun 2, 2017 · 25 comments

Comments

@suchafreak
Copy link

suchafreak commented Jun 2, 2017

We just recently switched from old jTidy to tidy-html5 in our Java application. We rely on tidying to correct user provided html that can be quite messy.

Now, we hit a problem regarding style tags in the body: jTidy and the online tidy at https://infohound.net/tidy/ both move style tags found anywhere in the body to the head correctly. This is exactly what we we want!

Unfortunately, we cannot seem to get tidy-html5 to do the same: Tidying reports no errors even though there are several style tags nested in divs like so:

<body>
<div>
...
<style type="text/css">
.background-grey {
        background-color: #f8f8f8;
        }
</style>
...
</div>
</body>

Unfortunately, they are not moved to the head as I would expect. (Again, jTidy and the online version at https://infohound.net/tidy/ do this).

Has this behavior changed or am I doing something wrong?

Any help is very much appreciated!
MTIA
-sascha

@suchafreak
Copy link
Author

Here a simple test file:

<!DOCTYPE html>
<html lang="de-CH">
<head> 
  <meta charset="UTF-8">
  <title>Test Page with style in body</title>
  <meta content="Tidy should move style to head example" name="description">
  <meta content="width=device-width, initial-scale=1.0" name="viewport">
</head>
<body class="l-center">
  <style type="text/css">

  .content {
    max-width: inherit;
  }

  #centerContent {
    padding: 0;
  }

  </style>
  <div id="layout">
    <div id="main">
      <div class="content pure-g pure-u-md-24-24 pure-u-1" id="centerContent">
        <a id="a28" name="a28"></a> <a id="a30" name="a30"></a> A simple test.
      </div>
    </div>
  </div>
</body>
</html>

@suchafreak suchafreak reopened this Jun 2, 2017
@geoffmcl
Copy link
Contributor

geoffmcl commented Jun 2, 2017

@suchafreak thank you for the issue...

I know very little about jTidy, but, for sure, current 5.5.31 Tidy refuses to move <style> elements to the <head>, and unless convinced otherwise, marking this as a big BUG!

Current tidy still has the message { TAG_NOT_ALLOWED_IN,0,"%s isn't allowed in <%s> elements" }, but as far as I can see the message is now never used...

Going back in Tidy versions, my earliest, Tidy4aug00, which I call Tidy2000 moved it to the <head>, and issued the warning, as did the last cvs code release circa 2009... even a Jan 2012, tidydev fixed this...

But the fix was gone by our first release 5.0.0, circa 2015... the first release since 2009... but that is all just history...

It will take more digging into the code, to find out exactly when and why this was changed, and hopefully change it back, unless, as stated, there is some valid reason for this change...

Will look deeper into this soonest, and report... and would appreciate any further feedback... thanks

@geoffmcl geoffmcl added the Bug label Jun 2, 2017
@geoffmcl geoffmcl added this to the 5.5 milestone Jun 2, 2017
@suchafreak
Copy link
Author

@geoffmcl

Thank you for your answer.

As of now, I consider this to be a bug as well. The W3C validator reports an error for each style-tag in the body of a page unmistakably .

Please let me know, if I can be of any assistance. (Unfortunately, I am versed too well in C as I had to learn the hard way trying to access libtidy via JNA from Java to no avail. )

Thanks for your hard work!

@geoffmcl
Copy link
Contributor

geoffmcl commented Jun 2, 2017

@suchafreak, ok, found the exact line that causes this...

It changed from -

        if ( nodeIsINPUT(node) ||
             (!TY_(nodeHasCM)(node, CM_BLOCK) && !TY_(nodeHasCM)(node, CM_INLINE))
           )
        {

to -

        if (( nodeIsINPUT(node) ||
             (!TY_(nodeHasCM)(node, CM_BLOCK) && !TY_(nodeHasCM)(node, CM_INLINE))
           ) && !TY_(IsHTML5Mode)(doc) )
        {

But, at some point CM_BLOCK got added to the "style" tag, line 259, so now it does not enter this block of code... still to track the commit that did this, and examine why...

And even if that CM_BLOCK was removed from the <style> tag, my commit 685f7a6 also means the code does not enter this block of code, if still in HTML5 mode...

And thus never gets to the code, which does exactly what it says -

            if (node->tag->model & CM_HEAD)
            {
                MoveToHead(doc, body, node);
                continue;
            }

Would be relatively easy to add code like -

    if (nodeIsSTYLE(node)) 
    {
        MoveToHead(doc, body, node);
        continue;
    }

Full patch -

diff --git a/src/parser.c b/src/parser.c
index a037280..41ff2d0 100644
--- a/src/parser.c
+++ b/src/parser.c
@@ -4121,6 +4121,14 @@ void TY_(ParseBody)(TidyDocImpl* doc, Node *body, GetTokenMode mode)
             }
         }
 
+        /* Issue #567 - <style> tags found in this <body> parsing
+           should be moved to the <head>, always... */
+        if (nodeIsSTYLE(node))
+        {
+            MoveToHead(doc, body, node);
+            continue;
+        }
+
         if (node->type == EndTag)
         {
             if ( nodeIsBR(node) )

Is there any reason this would not be true?

Need to give this a little more thought... testing and feedback very welcome... thanks...

@suchafreak
Copy link
Author

&& !TY_(IsHTML5Mode)(doc)

The only reason I see for this check is that at some point there were plans to support scoped style elements in HTML5. But as far as I can tell, this seems to be the case no longer, see: http://w3c.github.io/html/document-metadata.html#the-style-element
http://caniuse.com/#search=scoped

No current browser supports this anyway.
HTH

@geoffmcl
Copy link
Contributor

geoffmcl commented Jun 3, 2017

@suchafreak thank for the links and feedback... wow, there has certainly been a lot said about allowing <style> in the <body>, and the <style scoped> issue that leads on from that... certainly some heated debate...

And while few browsers, other than Firefox, support the scoped style, all do work with a <style> in the body... but that has little weight in that Tidy tries hard to follow the W3C specs...

One can read the clear comments from @sideshowbarker, who worked on tidy a while back, that he added the error to W3C nu validator... and likewise the legacy document validator also flags an error...

So if the aim of Tidy is to produce valid html, then I too think it should move it to the <head>... simple...

No, the addition of && !TY_(IsHTML5Mode)(doc) was to only address issue #428, namely the change in the <input> tag between html5 and legacy documents... and missed the fact other tags come through here...

As mentioned, the moving of the <style> to the head has been broken ever since the CM_BLOCK bit was added to the <style> tag table entry, but I still can not trace the origin, and reason for that change!

So removing that bit would be one solution, but could have other consequences... although none seen in the 235 regression tests... but sadly there seems no tests covering this <style> in the <body>... obviously one should be added to check this in the future...

But on balance, it seems the proposed patch is a clean, very specific, and a clear solution, once it is agreed Tidy should issue this warning, and make this document fix...

Will ponder this some more... as always feedback very welcome... thanks..

@geoffmcl
Copy link
Contributor

geoffmcl commented Jun 3, 2017

@suchafreak to facilitate easy testing have pushed the changes to an issue-567 branch...

Have also pushed some sample in_567*.html to my test repo...

@suchafreak
Copy link
Author

Hi

Thanks a lot for the prompt response.

Is there any way to get hold of a binary of this fixed version for Windows? Unfortunately, I lack the necessary build environment to build tidy.exe from the sources. If not, what is the estimated time frame for the release of this fix?

MTIA
-sascha

@geoffmcl
Copy link
Contributor

geoffmcl commented Jun 6, 2017

@suchafreak well I do like working on bugs ;=)) It's my kind of fun...

get hold of a binary of this fixed version for Windows?

Well, we have no present place to officially publish such an intermediate, test version of Tidy, but maybe we should...

I could copy one to a temporary place, but you would probably still need some other installs to get it working... What version of Windows are you using, and bitness, 32 or 64?

But just because it is easy, I have copied a zip - http://geoffair.org/tmp/tidy-5.5.31.I567-w64-vc14-md.zip - but this assumes a Windows 64-bits, and you may have to install the appropriate vc_redist.x64, if not already done...

lack the necessary build environment

But this is so easy to set up... you just need 3 things git, cmake, and MSVCxx... all free.. and voila...

  1. git - https://git-scm.com/download/win
  2. cmake - https://cmake.org/download/
  3. MSVCxx - https://www.visualstudio.com/vs/community/

This is documented in our BUILD.md... but repeated here, once the above 3 tools are installed...

Start in <someroot>, like say C:\Projects or even C:\Users\<user>\Projects, etc...

  1. git clone https://github.com/htacg/tidy-html5.git # clone repo
  2. cd tidy-html5 # get to source tidy root
  3. git checkout issue-567 # checkout the desired branch - git branch -r will show all
  4. cd build\cmake # get into a build directory...
  5. cmake ..\.. # do configure and generation
  6. cmake --build . --config Release # actually build the project...
  7. release\tidy -v # run the built tidy, and show version...

This may default to a 32-bit build, but if in a 64-bit system, change 4. to cd build\win64, and run the MSVCxx x64 command prompt before continuing with 5.... Both these build directories contain a build-me.bat that may just require a few version tweaks, or none, to be used...

That does not seem too hard... Why not try it? It is fun!

Look forward to further feedback... thanks...

@suchafreak
Copy link
Author

Hi

I tested the patched version with this input:

<!doctype html>
<html lang="de-CH">
<head>
  <meta charset="UTF-8"/>
  <title>A title</title>
</head>
<body>
<style type="text/css">
  #someId {
    right: 0;
  }
</style>
<div id="layout">
  <style type="text/css">
    #top-bar {
      left: 0;
    }
  </style>
</div>
</body>
</html>

The good: The style-Tag directly within body gets moved to the head correctly.
The bad: The style-Tag nested within the div is not.

This is the output:

<!DOCTYPE html>
<html lang="de-CH">
<head>
<meta charset="UTF-8" />
<title>A title</title>

<style type="text/css">
/*<![CDATA[*/
  #someId {
    right: 0;
  }
/*]]>*/
</style>
</head>
<body>
<div id="layout">
<style type="text/css">
/*<![CDATA[*/
    #top-bar {
      left: 0;
    }
/*]]>*/
</style></div>
</body>
</html>

Hope this helps to nail down the problem further.
And many thanks for the build steps - worked like a charm.

Please let me know if I can be of any assistance.
Cheers,
-sascha

@geoffmcl
Copy link
Contributor

@suchafreak yup, the patch I proposed will only deal with moving the <style> if it occurs as a direct descendant of <body>... at that moment Tidy is in ParseBody, where the patch lives...

But when the token <div> is decoded, tidy will enter ParseBlock to handle that, and so on for about 20 other ParseXXXXX elements... my patch would then need to be replicated in these many parsers... not the way to go!

So if we want to catch any and every <style> element, this has to be done after the initial token decodes, where they are added to a DOM like tree of nodes... and the file stream is closed...

Then tidy runs status = tidyCleanAndRepair( tdoc );, which in turn runs tidyDocCleanAndRepair( impl );...

Here the complete node tree can be processed again, node, by node...

This is where, for example, inline style attributes, like <span style='font ...'> are converted to CSS in the <head>, and replaced with <span class="c1"> in the <body>, if clean is on...

Reading tidyDocCleanAndRepair( impl ); will show there are many little services, like NestedEmphasis, List2BQ, BQ2Div, CleanDocument, etc, etc, some dependant on options, but in essence each will traverse the node tree, looking to fix what they are intended for...

So as indicted, to move all <style> out of body, we would need a new service like say TY_(CleanStyle)( TidyDocImpl* doc ), or something...

Will give this some thought, and cycle back to it as time permits, but if you, or others, have some ideas, then feedback, patches, PR very welcome... thanks...

Glad to hear you were able to clone and build tidy. The next step would be to fork tidy-html5 repo to your own github space, then clone and build that fork...

Then you can make code changes, usually in a branch, push these to your fork, and present a PR... the process is simple... after you have forked the repo -

  1. git clone [email protected]:suchafreak/tidy-html5.git tidy-html5-fork # clone fork to dir
  2. cd tidy-html5-fork # get to source tidy fork root
  3. git checkout -b test1 # checkout the new branch - important
  4. make desired code changes now or later...
  5. cd build\cmake # get into a build directory...
  6. cmake ..\.. # do configure and generation
  7. cmake --build . --config Debug # actually build the project...
  8. cmake --build . --config Release # actually build the project...
  9. in the MSVCxx IDE, test that the built tidyd.exe does what you want
  10. git commit -m "Changes for issue Tidy does not move <style>-Tag to head #567" -a # commit the code changes
  11. git push -u origin test1 # add branch and changes to fork
  12. setup a PR

I usually leave step 4. until after 8., and load the MSVCxx IDE, and make code changes using its very helpful IDE editor... Then as in 9. you can build and test in the IDE...

Advise if you need any help with this... thanks...

@suchafreak
Copy link
Author

@geoffmcl

Thank you for your reply.
Unfortunately, neither me nor anyone on my team is versed enough in C to be of any help here.

Regards,
-sascha

@geoffmcl
Copy link
Contributor

@suchafreak ok, took another look at this, but not pushed anything yet... but as before have copied an experimental build to http://geoffair.org/tmp/tidy-5.5.31.I567-2-w64-vc14-md.zip

Note it is experimental, and have not yet tested all cases... It would really help if you could find a sample where it fails...

And if I get the chance will push this to an issue-567-2 branch, so you can build it yourself..

But now two things trouble me -

  1. Should this be a option?
  2. What about a warning, or at least info message?

If I load my in_567-3.html test file in a browser that supports scoped, like firefox 43.0.1...54.0... the Hello is in red, and the World is in green, and I would be very upset with tidy if it unconditionally moved these two <style> blocks to the <head>, when browsers will only use the last style, green...

So there would need to be a option like --move-style-to-head yes/no... or --fix-style-tag, or something, so I could turn this off if I am using a browser that supported scoped... Maybe you are right - there are not many - but the moment there is one, then this option becomes important...

Of course, it seems Chome and IE, and maybe others, show both words in green...

And at the moment, this is a silent fix! I think I would want to know, at least as an Info: type message, that tidy has made this quite drastic change... even if it is correct html...

SOOO, really seek further feedback on this... thanks...

@suchafreak
Copy link
Author

Hi @geoffmcl

Thank you for your work.

We have a large number of pages we can use to test your changes, some of them contain rather "nasty" markup. I will get back to you as soon as I have results.

As far as I understand it, the support for scoped style definitions has been dropped from HTML5, probably because web components offer better isolation of CSS. So, from where I stand, the only correct behavior is to move any style tag found in the body to the head. A relaxed implementation could leave scope styles untouched until Mozilla drops style support, see: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/style:

The scoped attribute is non-standard; it was being considered for standardization at one point but was dropped. Firefox and some other browsers still support it but you should avoid using it if at all possible since it could be removed from any or all of the supporting browsers at any time.

Hope this helps. I will report my findings asap.

@suchafreak
Copy link
Author

Hi

Unfortunately, the experimental version found at http://geoffair.org/tmp/tidy-5.5.31.I567-2-w64-vc14-md.zip does not move any style tags to the head anymore.

What am I missing?
Regards

@geoffmcl
Copy link
Contributor

@suchafreak thanks for the feedback and links... I understand the scoped attribute is non-standard, and support could be removed at any time from the few browsers that support it at this time...

And if I was into browser development I too would be against it, since you would have to attach rendering attributes to just about every level of the DOM (like) tree... rather ugly and quite difficult... aside from consuming more memory, to hold rendering attributes at every context level... YIKES!!!

But to play it safe, I have proceeded to implement a --fix-style-tags which defaults to yes, but not pushed yet... mainly because you reported my tidy-5.5.31.I567-2-w64-vc14-md.zip exe FAILED! UGH!!!

It seems to work on the five samples I have created in my test repo, see in_567*.html... maybe you can confirm the above tidy -v = 5.5.31.I567-2 also works for you on these 5 samples... just to be sure I have not fallen out of my tree...

So it is my turn to say What am I missing? ;=))

Can you provide at least one little sample, hopefully the smallest, where it fails... please try to trim it down to the basics...

Maybe I need to do more to fully traverse the tidy DOM like tree... I guess what I have now will miss if the <style> is more that 2 child levels deep into a parent node tree... or something...

Am sure I can fix that by fully iterating every node in the <body> tree... if that is the problem...

So, like I say, still considering an option, but it would default to yes, so would only be needed by a users that specifically did not want this action... could be forgotten by most of us... never used... and maybe removed again when all browsers drop support...

But the first thing is to get it working fully, and hope you can help with a small sample that fails... thanks...

@suchafreak
Copy link
Author

Hi @geoffmcl

Sorry for not getting back to you sooner.

The experimental version found at http://geoffair.org/tmp/tidy-5.5.31.I567-2-w64-vc14-md.zip does not move any style tags to the head anymore, not even in this simple case:

<!doctype html>
<html lang="de-CH">
<head>
  <meta charset="UTF-8"/>
  <title>A title</title>
</head>
<body>
<style type="text/css">
  #someId {
    right: 0;
  }
</style>
</body>
</html>

Cheers,
-sascha

@geoffmcl
Copy link
Contributor

@suchafreak really do not understand why http://geoffair.org/tmp/tidy-5.5.31.I567-2-w64-vc14-md.zip works fine for me. Given your above input, I get the following output -

C:\Users\user\Downloads\tidy-5.5.31.I567-2-w64-vc14-md>tidy-5.5.31.I567-2-w64-vc14-md in_567-simple.html
Info: Document content looks like HTML5
No warnings or errors were found.

<!DOCTYPE html>
<html lang="de-CH">
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.5.31.I567-2">
<meta charset="UTF-8">
<title>A title</title>

<style type="text/css">
  #someId {
    right: 0;
  }
</style>
</head>
<body>
</body>
</html>

That is the <style> has been moved to the <head>, silently...

But as stated, I do recognize there may be cases where the fix in 5.5.31.I567-2 could fail, and I tried to construct such a sample, and had some surprises...

Current tidy 5.5.31 will already fix the following sample -

<table>
<tr>
<td>
<style  type="text/css">
 p { color: red }
</style>
<p>red text</p>

With the output:

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 1 - Warning: inserting implicit <body>
line 4 column 1 - Warning: <style> isn't allowed in <td> elements
line 3 column 1 - Info: <td> previously mentioned
line 1 column 1 - Warning: missing </table>
line 1 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML5
Tidy found 5 warnings and 0 errors!

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.5.31">
<style type="text/css">
 p { color: red }
</style>
<title></title>
</head>
<body>
<table>
<tr>
<td>
<p>red text</p>
</td>
</tr>
</table>
</body>
</html>

So am still searching for an example that breaks my current patch... so I can work on it further...

But still really puzzled why version 5.5.31.I567-2 fails for you, and seems to work ok for me... that is a big mystery...

Anyway, will work on this as time permits... thanks...

PS: Found a test case that FAILED! in_567-7.html, and did some more fixes... Still testing, but copied a test http://geoffair.org/tmp/tidy-5.5.31.I567-3-w64-vc14-md.zip... hope you get the chance to test... hopefully reporting success... thanks...

@suchafreak
Copy link
Author

Hi @geoffmcl

I will test the new version. Maybe the problem lies in the settings we use:

drop-empty-elements: false
drop-empty-paras: false
fix-uri: false
fix-backslash: false
join-styles: false
output-html: true
output-xhtml: true
output-xml: false
quiet: true
tidy-mark: false
wrap-php: false
wrap-script-literals: false

I will get back to you asap.
Many thanks for your ongoing work and support.

@suchafreak
Copy link
Author

Good news: the new version works 👍

<!doctype html>
<html lang="de-CH">
<head>
  <meta charset="UTF-8"/>
  <title>A title</title>
</head>
<body>
<style type="text/css">
  #someId {
    right: 0;
  }
</style>
<div id="layout">
  <style type="text/css">
    #top-bar {
      left: 0;
    }
  </style>
</div>
</body>
</html>

Next, I will test more tricky scenarios and report my findings here asap.

@suchafreak
Copy link
Author

Me again:

Run some more tests on files that contain style tags in divs nested up to five levels deep. Tidy moved all the encountered style tags to the head.

@geoffmcl : Should we also tests files containing even deeper nested structures?

Many thanks!
Cheers,
-sascha

@geoffmcl
Copy link
Contributor

@suchafreak thank you for testing the version 3, 5.5.31.I567-3, and reporting success...

I had tested with your given config and found no problems... The only time this fix will be skipped is if you tell tidy to treat the input as xml, when this, and a lot of other [x]html cleaning is skipped...

See tidyXmlTags, where tidy will return from tidyDocCleanAndRepair immediately, if this is set... ie if --input-xml: yes is given in the config...

Reading around it seems xhtml1, was effectively the same as html4, but then the spec moved on to html5, which seems to have made xhtml1 somewhat redundant... so not sure why you are choosing output-xhtml: true? Why not just use html5?

But many blogs I read sometimes seems somewhat contradictory, or at least unclean, so I guess users are free to choose what they want... the only difference I saw was self closing tags, like <meta ... will end in />, and the style tag will be wrapped in <![CDATA[ ... ]]>... But I do not think browsers care about such changes...

Should we also tests files containing even deeper nested structures?

I do not think this is necessary. The way I have re-arranged to fix, it iterates through every tag in the <body>, so the actual depth should now not matter... but it never hurts to try and be sure...

I will shortly get around to pushing my latest changes to the issue-567-2 branch, and set up a PR for further testing...

But in general, starting to feel good about this fix... thanks...

geoffmcl added a commit that referenced this issue Jun 28, 2017
Add option TidyStyleTags, --fix-style-tags, Bool, to turn off
this action.

Add warning messages MOVED_STYLE_TO_HEAD, and FOUND_STYLE_IN_BODY.

Fully iterate ALL nodes in the body, in search of style tags...

Changes to be committed:
	modified:   include/tidyenum.h
	modified:   src/clean.c
	modified:   src/config.c
	modified:   src/language_en.h
	modified:   src/message.c
@geoffmcl
Copy link
Contributor

All changes pushed to issue-567-2 branch... hope others get a chance to checkout, pull, and test this branch... thanks...

@suchafreak
Copy link
Author

@geoffmcl

All tests on our side show success! This is very good news. Many thanks for your work.

@ferdnyc
Copy link

ferdnyc commented Jul 16, 2017

@geoffmcl

@suchafreak thanks for the feedback and links... I understand the scoped attribute is non-standard, and support could be removed at any time from the few browsers that support it at this time...

If it helps cement this decision any, it's not even "could be" anymore: Firefox devs announced plans a month ago to unship <style scoped> support, as they'd like to avoid having to deal with it as they ramp up the "Quantum CSS" project to integrate the new Stylo parallelized CSS engine into Gecko. Stylo doesn't support <style scoped>, and now it will never have to. There was much rejoicing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants