stvr-jseft/evaluation.tex

\section{Empirical Evaluation} \label{Sec:evaluation}

To quantitatively assess the efficacy of our test generation approach, we have conducted an empirical study, in which we address the following research questions:

\input{objectsChar-table}

\begin{description}%[noitemsep]
%\item [RQ1] How effective is our \emph{function coverage maximization} technique?
\item [RQ1] How effective is \tool in generating test cases with high coverage?
\item [RQ2] How capable is \tool of generating test oracles that detect regression faults?
\item [RQ3] How effective is the state reduction technique in reducing the number of function states?
\item [RQ4] How does \tool compare to existing automated \javascript testing frameworks?
\end{description} 

%in which we compare \tool with an existing \javascript testing tool \artemis

\tool and all our experimental data in this paper are available for download \cite{jseft-dl}.


\subsection{Objects}
Our study includes thirteen \javascript-based applications in total. 
\tabref{objectsChar-table} presents each application's ID, name, lines of custom \javascript code (LOC, excluding \javascript libraries) and resource.
The first five are web-based games. AjaxTabs is a \jquery plugin for creating tabs. NarrowDesign and JointLondon are websites. FractalViewer is a fractal tree zoom application. SimpleCart is a  shopping cart library, WymEditor is a web-based HTML editor, Tudu\-List is a web-based task management application, and Tiny\-MCE is a \javascript based WYSIWYG editor control. The applications range from 206 to 27K lines of \javascript code.
%which has been used in other studies  \cite{artzi:icse11}. 

The experimental objects are open-source and cover different application types. All the applications are interactive in nature and extensively use \javascript on the client-side. %Since we require automated access and modification of the source code (\ie for instrumentation), we were not able to use applications such as FaceBook, where automated access is forbidden. %Moreover, since \tool does not support server-side testing, applications which their computations are mostly performed on the server side do not benefit from our approach.    


\subsection{Setup} \label{Sec:setup}
To address our research questions, we provide the URL of each  experimental object to \tool.
Test cases are then automatically generated by \tool.
%It is believed that \cite{humble:2010} testers dedicate no more than 10 minutes to test execution. Therefore, 
We give \tool 10 minutes in total for each application. 
5 minutes of the total time is designated for the dynamic exploration step.
%We outline the setup and methodology used in our empirical study to address our research questions.

%\subsubsection{Function Coverage Maximization (RQ1)}
%To measure the effectiveness of the function coverage maximization technique, we provide the URL of each experimental object to the first component of \tool as depicted in \figref{approach-view}. We compare our state/event selection strategy with a random exploration method, in which the next state is chosen uniformly at random for the expansion. 
%We limit the dynamic exploration time to five minutes \cite{humble:2010} for each technique and report the average results over five runs. We generate event sequences from the two state-flow graphs obtained from each method. 
%
%\jscover \cite{jscover}, an open-source tool for measuring \javascript code coverage, is used to measure the statement coverage. We collect the traces of the executed statements after each event is triggered. 
%Finally, we compare the statement coverage achieved by running the generated event sequences separately.
\headbf{Test Case Generation (RQ1)} \label{test-gen-setup}
To measure client-side code coverage, we use \jscover \cite{jscover}, an open-source tool for measuring \javascript code coverage. We report the average results over five runs to account for the non-determinism behaviour that stems from crawling the application.
In addition,  we assess each step in our approach separately as follows: 
(1) compare the statement coverage achieved by our function coverage maximization with a method that chooses the next state/event for the expansion uniformly at random, 
(2) evaluate the effectiveness of applying mutation techniques (\algref{oracleGenAlgo}) to reduce the number of assertions generated.
% we provide the URL of each experimental object to the first component of \tool as depicted in \figref{approach-view}. We compare our state/event selection strategy with a random exploration method, in which the next state is chosen uniformly at random for the expansion. 
%We generate event sequences from the two state-flow graphs obtained from each method. 

%We collect the traces of the executed statements after each event is triggered. 
%Finally, we compare the statement coverage achieved by running the generated event sequences separately. 

%Of course, no unit test generation technique can test functions that are not directly accessible (\eg nested functions, anonymous functions).
%Therefore, in addition to measuring the statement coverage of the generated test suite, we define a unit testability metric, which measures the testability degree of individual functions of an application. We call a function $f$ \emph{testable} if it is possible to call $f$ directly from a test case --- regardless of whether the test case is written manually or generated automatically. The testability metric of a given web application $A$ is calculated as follows: 
%
%\begin{equation}
%testability(A)=\frac{\sum _{i\in A(f_i)}^{n} {testable(f_i)}}{n},
%\label{testabilityFormula}
%\end{equation}
%\noindent
%where $n$ is the total number of functions, and $testable$ decides whether a function ($f_i$) is testable according to the definition. We then measure the percentage of functions that \tool can generate test cases for in $A$ as:   
%
%\begin{equation}
%testGenRate(A)=\frac{\sum _{i\in A(f_i)}^{n} {tested(f_i)}}{\sum _{i\in A(f_i)}^{n} {testable(f_i)}},
%\label{pythiaTestabilityFormula}
%\end{equation}
%\noindent
%where the numerator is the total number of functions that are directly tested by \tool and the denominator is the total number of testable functions in $A$.

\headbf{Test Oracles (RQ2)} \label{test-oracle-setup}
%To generate test oracles at function-level, we configure \tool to inject 50 \javascript code-level faults in each application.
%To produce DOM event-level oracles, we configure the DOM mutation module of \tool to inject 20 DOM-level faults per application. \ali{why 50? why 20? why these numbers? why are they not the same? Motivate! Do we need to include this information?} %We then run \tool on each web application to obtain the test cases with oracles.
%
To evaluate the fault finding capability of \tool (RQ2), we simulate web application faults by automatically seeding each application with 50 random faults according to the following fault category: 
\begin{enumerate}%[noitemsep, nolistsep]
\item Changing conditional statements by modifying the upper/lower bounds of loop statements, changing the condition itself, as well as swapping consecutive conditional statements;
\item Modifying the values of global/local variables, and removing or changing their names, as well as modifying arithmetic operations;
\item Changing function parameters or function call arguments by swapping, removing,
and renaming parameters/arguments. Changing the sequence of function
calls within a given function if applicable;
\item Modifying DOM related properties.
\end{enumerate}
The first three categories target \javascript code while the last one targets both \javascript and HTML code-levels. 
We automatically pick a random program point and seed a fault at that point according to our fault category.
While mutations used for oracle generation have been selectively generated (as discussed in \secref{oracleGen}), 
mutations used for the purpose of evaluation are randomly generated from the entire application. Note that if the mutation used for the purpose of evaluation and the mutation used for generating oracles happen to be the same, we remove the mutant from the evaluation set. 
%in the \javascript code as well as the HTML code of each application. 
%Note that we decided to manually perform fault seeding instead of using \mutandis, which automates \javascript mutation testing. The main reason is to mitigate bias since \mutandis is used by \tool to generate mutants automatically during the test oracle generation phase. 
%One challenge with generating assertions is their stability, \ie the assertions may fail on the original version of the program. To filter unstable assertions, we run the test suite on the original program and discard any assertions that fail.
%
Next we run the whole generated test suite (including both function-level and event-based test cases) on the faulty version of the application. The fault is considered detected if an assertion generated by \tool fails and our manual examination confirms that the failed assertion is detecting the seeded fault.
%\ali{how do we know we have detected the right fault?}
We measure the precision and recall as follows:

\begin{description}%[noitemsep, nolistsep]
\item[Precision] is the rate of injected faults found by the tool that are actual faults: $\frac{\mathit{TP}}{\mathit{TP} + \mathit{FP}}$
\item[Recall] is the rate of actual injected faults that the tool finds: $\frac{\mathit{TP}}{\mathit{TP} + \mathit{FN}}$ 
\end{description}
where $\textit{TP}$ (true positives), $\textit{FP}$ (false positives), and $\textit{FN}$ (false negatives) respectively represent the number of faults that are correctly detected, falsely reported, and missed.

\headbf{Function State Reduction (RQ3)} \label{reduction-setup}
To assess the efficacy of the function state reduction method (\algref{stateAbstractionAlgo}), we compare both the statement coverage and fault finding capability, before and after applying the state reduction technique. The former includes all possible function states converted to a test case, while the latter includes reduced test cases after applying the state reduction technique. To measure the fault finding capability, the same set of mutations (50) is used in both cases using the same fault seeding procedure as described in RQ2. Since the function state reduction method is used for generating function-level test cases, we run only the function-level test suite on the mutated versions for this research question.  
  
\headbf{Comparison (RQ4)} \label{comparison-setup}
To assess how \tool performs with respect to existing \javascript test automation tools, we compare its coverage and fault finding capability to that of \artemis \cite{artzi:icse11}.  
Similar to \tool, we give \artemis 10 minutes in total for each application; we observed no improvements in the results obtained from running \artemis for longer periods of time. 
We run \artemis from the command line by setting the iteration option to 100 and enabling the coverage priority strategy, as described in \cite{artzi:icse11}. %\karthik{Earlier you said 5 minutes}. 
Similarly, JSCover is used to measure the coverage of \artemis (over 5 runs).
We use the output provided by \artemis to determine if the seeded mutations are detected by the tool, by following the same procedure as described above for \tool. 
%We measure the coverage using the HTML output of \artemis, which shows the covered \javascript code. %As mentioned before, we measure the coverage of \tool using \blanket.


\input{results}
%\input{threatsToValidity}
%\input{discussion}