Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unescaping already escaped xml URLs #460

Closed

Conversation

wblachowski
Copy link
Contributor

I made a change fixing unescaped URLs in report page. For details see #441.

Description

Current implementation expects unescaped URLs in suite XML (e.g (...)?a=b&c=d ) which are then escaped before serializing XML to TestSuite. If you put correctly escaped XML in URL (e.g (...)?a=b&c=d) then the query params are escaped to (...)?a=b&c=d, parsed back (while serializing) to (...)a=b&c=d and this version is then displayed in report page.

This redundancy in escaping is what causes the issue. If the URLs are already escaped they shouldn't be tampered with. My fix uses the simpliest solution, i.e it tries to unescape before escaping, thus handling correctly both unescaped (unescape has no effect here) and escaped URLs.

I also added a unit test for escaped URLs.

Motivation and Context

Closes #441

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • I have reviewed (and updated if needed) the documentation regarding this change

I hereby agree to the terms of the AET Contributor License Agreement.

Copy link
Contributor

@wiiitek wiiitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wblachowski i think you observed something important in AET suite. The fact, that we are escaping URLs: special XML characters are escaped - i.e. &, " are escaped... to &, ".

But something is not right here. Isn't it that:

StringEscapeUtils.escapeXml11(StringEscapeUtils.unescapeXml( <something> ) === <something>

??

Motivation:
I believe someone might want to test proper URLs decoding on site with:

  • https://www.google.pl/search?q=ąę&param=12
  • https://www.google.pl/search?param=12&q="ąę ćń"

What should be put into suite XML as URL? Currently we are escaping special chars like & before parsing content from XML into TestSuite object isn't it? I believe this is done because we wanted to be kind to end-users. In other words we were afraid, that they will not be able to correctly escape special entities for XML attribute.

But what should we do with this URL:

        <url href="https://www.google.pl/search?param=12&q=&quot;" />

If something like that is placed inside suite - what page should AET test? Currently it will do search for &quot;, but after the proposed change it will search for "?

So maybe we could do something different?

Proposition:
Let's agree, that suite XML is a XML file. So content of this file needs to be properly escaped. This is also correct for XML attributes for regular expressions.

In my opinion we should NOT have any escaping utils. User needs to properly escape an URL with XML entities:

  • https://www.google.pl/search?param=12&q=&quot;

or query string (URI component) encodings:

  • https://www.google.pl/search?param=12&q=%22

EscapeUtils class could be removed and we can introduce UrlValidator class, that looks into XML content, finds values of href attributes and validates if the URL is properly escaped - returning nice error message for users that provide invalid URLs.

@wblachowski
Copy link
Contributor Author

Hello @wiiitek, you've made a few good points and I feel obliged to clear up some misunderstandings.
First of all:

But something is not right here. Isn't it that:
StringEscapeUtils.escapeXml11(StringEscapeUtils.unescapeXml( something )) === something
??

That's true only for escaped strings. Take for example this string: &. The above statement evaluates to escape(unescape(&)) == escape(&) == &amp; =/= &.
As you can see unescape() has no effect on not-escaped strings. This little workaroud of unescaping and escaping ensures that at the end we get an escaped string regardless of initial input being escaped or not.

Secondly:

But what should we do with this URL:
<url href="https://www.google.pl/search?param=12&q=&quot;" />
If something like that is placed inside suite - what page should AET test? Currently it will do search for >", but after the proposed change it will search for "?

You are right that this url will in the end be serialized to (...)?param=12&q=", however I have some problems with this url. To me, it would never be necessary to use XML escaping on query strings. If you have a valid URL, all XML special characters (&<>"') are already escaped. The exception is & delimiting query parameters

In conclusion:
To my mind the only enemy among XML special characters (&<>"') is the ampersand because I don't see any possibility of any of those <>"' occurring unescaped in a valid URL, although I might be wrong. With a valid URL user has to decide whether to escape all & chars with &amp; or keep them unescaped (thus making suite.xml invalid XML file). Both of those options will be correctly serialized but I agree that it might be misleading and sticking to one option would be nice.

@wiiitek
Copy link
Contributor

wiiitek commented Jan 3, 2019

Thank you, yes I agree :

To me, it would never be necessary to use XML escaping on query strings. If you have a valid URL, all XML special characters (&<>"') are already escaped. The exception is & delimiting query parameters

But it might be confusing what one should do in case of testing escaped special characters on search pages. Maybe we should update documentation? ;p
What should i put into XML if i want to test something like that:

escaped query string

?

@wblachowski
Copy link
Contributor Author

Dealing with <>"
Hmm, I delved deeper into the subject and according to RFC1738 characters such as <>" are regarded unsafe and should ALWAYS be encoded.

In fact, in your screenshot they are escaped but it appears that chrome tries to be user friendly and displays prettified version in the url bar. Fortunately, if you copy the address in the url bar and paste it in a text editor you get escaped version.
Example:
the link you provided:
https://www.google.pl/search?source=hp&q="%26amp%3B+vs+%2526"
copied from the url bar into notepad looks like this:
https://www.google.pl/search?source=hp&q=%22%26amp%3B+vs+%2526%22

I carried out a small experiment and pasted
http://nonexistingpage.mock/?checkmark=%E2%9C%93&q=%22heh%22
in url bars of different browsers. Results:

browsers
If you copy it back from the url bar to the notepad you get the original (escaped) URL.
Concluding, i think that unless you type the url by hand you shouldn't have to deal with escaping query parameters.

Dealing with ;

The only problem left is ; character. Similarly to & it is a reserved character and thus can occur unescaped in URLs. We wouldn't want strings such as &amp;, &lt; existing in valid URLs because they would get wrongly interpreted.
W3 reccommends (here) the possiblity of using ; as a query string separator instead of &.
So i guess a set of query strings like this: "a","amp","b" can IN THEORY be represented as ?a&amp&b or ?a;amp;c or even ?a&amp;c (this is a a strange mix-up but i guess it's valid?). This remains a problem but it seems like an edge case almost impossible to occur.

I have to admit this is more complicated than it seemed :p I wonder what are your thoughts

@wiiitek
Copy link
Contributor

wiiitek commented Jan 4, 2019

Thanks @wblachowski for your answer! I see you put an effort to look into this.

Please make a decision on this topic after some consultations with @Skejven and @mchrominski.

Personally I like the approach with with "valid XML" and making no escaping/unescaping but i understand that this could be not acceptable because of end-users. My other thought is that the XML we are using is not friendly format - we don't have XML schema (I remember there were attempts to prepare one). So any other format would be better ;p

Thank you again for your work. I believe you will make good decision. Cheers!

@dzuma
Copy link
Contributor

dzuma commented Nov 15, 2019

Whole this topic is resolved here: #524
I think this PR can be closed.

@dzuma dzuma closed this Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XML attribute values in suite.xml are not un-escaped
4 participants