Comparative Analysis of Text Mining and Clustering Techniques for Assessing Functional Dependency between Manual Test Cases
The supplementary appendix materials for the article Performance comparison of Different Text Mining and Clustering Techniques for Functional Dependency are provided in the upcoming pages.
Figures 1 and 2 illustrate the results of utilizing 7 different string distance algorithms for the text mining where the Agglomerative algorithm is used for the clustering in Figure 1. A total of 5 clusters were achieved and mirrored by the the Agglomerative clustering algorithm in Figures 1a, 1b, 1c and 1d respectively. The results of using some normalized compression distance algorithms for text mining and DBSCAN and HDBSCAN algorithms are presented in Figures 3 and 4. As emphasized before the HDBSCAN algorithm can provide a cluster of the non-clusterable data points which can be interpreted as independent test cases in this study. Generally, the HDBSCAN algorithm provides more clusters compared to all other utilized clustering algorithms. As we can see in Figures 3a, 3b, 3c, 4a and 4b more than 200 clusters are generated where each color represent a unique cluster. However, the combination of the same text mining method with the DBSCAN leads to having all test cases inside of one cluster mirrored in Figure 3d. The visualization results of employing two machine learning approaches are mirrored in Figure 5, where Fig- ure 5a represents the combination of the Doc2Vec with Agglomerative and Figure 5b indicates the combination of SBERT with Affinity respectively.
(a) Overlap coefficient with Agglomerative. | (b) Ratcliff-Obershelp with Agglomerative. |
(c) Jaro with Agglomerative. | (d) Levenshtein with Agglomerative. |
(a) Jaccard with Affinity. | (b) Sorensen–Dice coefficient with Affinity. |
(c) |
(a) bzip with HDBSCAN. | (b) Deflate with HDBSCAN. |
(c) gzip with HDBSCAN. | (d) XZ with DBSCAN. |
(a) Zlip with HDBSCAN. | (b) Zstd with HDBSCAN. |
(a) Doc2Vec with Agglomerative. | (b) SBERT with Affinity. |
Matrix of Mantel correlations between the employed text mining algorithms is presented in Figure 6.
Figure 6 - Matrix of Mantel correlations between the employed text mining algorithms (both tokenized and non- tokenized version) distances between all pairs of 784 source points.The rows and columns of the matrix represent each of the 28 text mining algorithms. The color of the cell corresponds to the magnitude of the Mantel $r_M$ correlation between the algorithms distances, indicated by the intersection of the row and column.