layout | title | date | disqus |
---|---|---|---|
post |
Documentation |
2014-08-11 16:21:31 -0700 |
y |
######- Pre-GSoC State of Art:
AChecker validates the static html of a web page given as input against the
WCAG 2.0 guidelines. Basically it takes the source code of the input URL and
validates it according to the specified WCAG accessibility norms.
However, some accessibility issues can not be identified just by this static
validation. For e.g a lot of HTML elements within a webpage are triggered by
Document Object Model (DOM) events which are unaccounted for.
######- Purpose & Scope of the Project:
The main crux is to improve the checks done by AChecker, by taking the current
WCAG 2.0 guidelines implemented as a black box, and extend the current AChecker
implementation to validate the dynamically generated Web content too.
Aim: Fetching newly generated dynamically triggered HTML from the webpage
upon a certain set of events (triggers) and then validate.
Basic steps carried out:
- Javascript and jQuery scripts to detect the common events that manipulate DOM and cause generation of new HTML contents.
- Using PhantomJS and CasperJS libraries (discussed in detail below), to push
all such events into an array and evaluate each of them.
- Thereafter, fetch the new HTML content upon triggering all such dynamic events.
- Merge the new HTML contents obtained from each trigger with the original source code (without duplication).
- Pass this newly obtained HTML for AChecker validation and get union of the results obtained. Thus, the final result of AChecker will be validation of all such dynamic events including the default static HTML source (Integration with AChecker)
Crux: The above task of fetching the dynamically generated new HTML contents requires the site to be opened and an event to be triggered. The javascript written in the source of the webpage for a particular event triggered needs to get rendered.
To accomplish the above said, we need a headless browser Webkit i.e. a browser
framework without the actual UI which renders the JavaScripts and is able to
modify DOM => PhantomJS
PhantomJS in itself has DOM handling and jQuery selector functionality.
However, to ease the usage of PhantomJS and a better hand at navigation
scripting, we use CasperJS too. Its basically an
utility written for PhantomJS which provides high-level functions for
mamipulating DOM and remotely accessing it, the most apt library for this
project.
Setting up the environment
For installing PhantomJS:
Using the native package manager (apt-get for Ubuntu and Debian, pacman for Arch Linux, pkg_add for OpenBSD, etc).
e.g.: for Ubuntu
$ sudo apt-get install phantomjs
Click here for more information
For installing CasperJS:
Using npm:
$ npm install -g casperjs
Click here for detailed
instructions and alternative methods
Developer Manual
Basically since CasperJS is an utility framework in Javascript which allows
accessing remote DOM, events that manipulate DOM are detected using standard
jQuery selectors and the HTML entities are returned back to CasperJS
environment.
To get a gist, following aptly describes the process being carried out
The evalute() function acts as a gate between the CasperJS environment and the webpage opened. Thus, everytime a closure is passed to evaluate(), we enter the page and execute code as if using the browser console.
For e.g., let there be a function __getOnClickTriggerElements()__:
function getOnClickTriggerElements(){
/* .. Javascript/jQuery HTML entity selector code here .. */
return onClickTriggerElements;
}
This function returns a certain set of HTML elements which are capable of manipulating DOM and generating new HTML content:
Now the above function is evaluated as:
onClickTriggerElements = casper.evaluate(getOnClickTriggerElements);
Here, __onClickTriggerElements__ represents array of HTML entities in CasperJS environment returned from __evaluate()__ function. This array contains list of HTML elements which when triggered, manipulate DOM and generate new HTML contents respectively. Thus similar functions (such as onClick, input Forms, mouseover and related mouse events, button triggers, etc) are written which fetch such DOM manipulating elements and return them in the form of an array.
Now that all the trigger elements are returned in an array, all these elements
need to be triggered one-by-one and then the newly generated HTML content can
be fetched accordingly.
Following the CasperJS evaluation architecture described above, we trigger each
of these elements (using casper.each() since each element of the array is
to be triggered separately) in the evaluate() function and render the
respective newly generated HTML content.
Assumption: Currently, a wait() function is used assuming it takes atmost 1
sec to load the DOM after triggering an element. Thereafter, the source code at
that particular instant is captured and sent back to CasperJS environment from
the evaluate() function. For each of the trigger, a HTML file with name
data<counter>.html
is generated (counter represents no. of such data
files) which contains source code of the webpage at the point after
triggering an element respectively.
A sample code snippet for this would seem like:
// wait for approx. 1000 ms to load the DOM
casper.wait(1000, function(){
HTMLSource = this.evaluate(function(){
return document.getElementsByTagName('html')[0].outerHTML;
});
// save HTMLSource contents into separate files for each of the triggers
});
Crux: Since in the part discussed above, each of the DOM manipulating
element is triggered iteratively, the HTML source code grows incrementally with
duplication. Consider following illustration:
Say there are 4 trigger elements on a webpage.
- Now the data0.html file contains the static source code of the webpage being validated.
- After processing 1st trigger, say some new HTML content gets generated and thus, data1.html contains a snapshot of the source code of the webpage after triggering 1st dynamic element.
- After processing 2nd trigger, data2.html contains source code after
triggering 2nd element. However, it also includes the HTML content generated by
1st trigger since this is an iterative process and the HTML/DOM is kept
triggering continually.
- A fresh reload is not done after every trigger because loading the webpage after every trigger would prove to be costly in terms of time.
Also assuming that the last HTML generated (say _dataN.html_) would contain all of the newly generated HTML alongwith original source code is not correct since some triggers may overwrite the content written by other.
**Aim**: Get all the newly generated HTML contents by each of the trigger (Maximization) considering even slightest trigger and merge them.
**Implementation**: So, basically following iterative approach, a __diff__ of adjacent HTML files is taken using
diff -u data0.html data1.html
which provides an output in the form of git diff
. Iteratively fetching
all the '+' differences from the diff
output would give the dynamic HTML
content generated by all the triggers scattered across different HTML files.
All such positive diffs are merged into a file say dynamicDOMElements.html.
(File mergeFiles.py does the work described)
Our objective is to get the union of this newly formed merged
dynamicDOMElements file and the original source code of the webpage, and
thereafter pass this whole HTML content to AChecker for validation. However,
while getting the union of dynamicDOMElements with source code, there needs to
be some HTML headers associated with dynamic content else it would lead to
false validation of that content via AChecker stating some problem types of
HTML headers with dynamically obtained HTML content although it might be
alright in the original source code. In a nutshell, false negatives regarding
HTML headers( <!DOCTYPE>
headers and <html>
attributes) w.r.t to
validating new content must not be given as output by AChecker.
Now that dynamicDOMElements file is created, instead of unifying the webpage
source code with dynamic content disjointly with manually inputing DOCTYPE and
HTML headers, we perform a selective merge contents of this file with the
main source code. Here selective merge refers that all this dynamically
generated content must be placed above </body></html>
tags, thus
preserving the original DOCTYPE and HTML headers of the webpage. Thus now
mergedSourceContent.html (say) contains a merged HTML code which contains
source code of original webpage selectively merged with dynamically generated
content while preserving the headers (avoiding false negatives).
Result:
Thus, with performing above steps now we have mergedSourceContent.html file
with merged contents which were generated by triggering DOM manipulating
elements. Also, this file contains apt DOCTYPE headers and html attributes
(same as that of the original webpage), thus leading to no ambiguous
warnings/problems reporting from AChecker. Now that mergedSourceContent.html
is generated, the task of integration breaks down into following:
On getting the URL from the input form, if the URL contains no errors, then a
execute.sh script is called which contains a sequence of steps to be done
- Calls the CasperJS script with the URL as a parameter and stores the dynamic content fetched from each trigger into HTMLSourceFiles folderwith filenames as data0.html, data1.html, data2.html and so on.
- Thereafter python script mergeFiles.py is called and dynamicDOMElements.html gets generated with all the dynamic contents merged into one file.
- After this gets done, a selective merge of this dynamic content and static source code of the webpage is done (using some bash commands). This results in generation of mergedSourceContent.html file now containing the webpage source code merged with dynamic content.
- Replacing the
$validate_content
variable: Thereafter, the content to be validated is read from the above generated file instead of directly taking the static source code from the web and thus is loaded into$validate_content
variable.- With help of some switches, we fetch contents of the mergedSourceContent.html file if the "Show Source" option is enabled in the options menu while validating the URL (i.e. the source code of the webpage to be validated along with dynamic content is to be shown).
The above implemented has been tested thoroughly on a sample site built for reference and debugging purposes hosted here. The site mentioned is of simplistic form but contains minimal required features for triggering. It contains 4 DOM manipulating events which generate new HTML content. Codebase has been made rigorous enough to tackle such elements within other sites. After being completely built, it has been tested against some sites which gives additional known, likely, potential problems accordingly to the HTML content they generate. Results have been discussed and found satisfactory enough.
SampleHTML validation comparison.. Full image here
--
Google.com validation comparison.. Full image here
- Currently, this integrated dynamic validating AChecker does not provide a seperate option whether to validate dynamic content or not. Thus, since this dynamic validation consumes considerably more time, and also, to report the user as to "where-in" actually the problem validated by AChecker lies, there must be a different section for this. Thus, differentiating the results. This was thought as a todo for the project and this idea was given considerable discussion, however due to time constraints it was not accomplished.
Also, some problems that were noticed recently are:
- While testing a site, say it has 10 DOM manipulating elements. Now if one of them has a input type="submit", (i.e. a form), currently codebase is structured assuming the site would not navigate to another webpage as such and would report some warnings, etc about blank fields then and there itself. However, if it navigates to another site, say on 6th trigger, then further triggers would not run successfully since it was assumed that the webpage would not change (we do not refresh webpage on every trigger => costly). Thus, since the webpage itself got navigated, those remaining triggers would not be evaluated successfully which would miss out some content. - Solution: The verbose log of PhantomJS reports something like this for every navigation:
```
[phantom] Navigation requested: url=<some-url-here>, type=Other,
willNavigate=true, isMainFrame=true
```
<br/>
Now, our solution would be we place a check on every requested navigation
and if the URL where the page is to be navigated is same as the given input
URL, then we pass, else we break the navigation. Sounds optimal, and can be
implemented.
For doing standalone work, a local github repo is maintained.
Link-to-local-repo-used
Since the implementation discussed above requires reading, writing,
modification of files via Apache hosted server, its necessary to give
required permissions to the 'AChecker/checker' folder. i.e. basically giving
Apache server the ownership of the files ('apache' user in Fedora and alike,
whereas 'www-data in Ubuntu and similar')
Following commands should do the work:
In Fedora:
sudo chown -R apache:apache <folder-name>
In Ubuntu:
sudo chown -R www-data:www-data <folder-name>
For any queries or just to get in touch:
Tejas Shah
Email ID: [email protected]
github : tejasshah93
IRC nick: jash4/carver404