Supplemental Materials for "Duet: Helping Data Analysis Novices Conduct Pairwise Comparisons by Minimal Specification"
Links to Videos Introducing Duet’s User Interface
Explanations for the “Logistic Regression” Folder
Explanations for the “User Study” Folder
Clarification of Literature Review
Tutorial Video Used in the User Study: Link
Analyzing US College Scorecard Data Using Duet: Link
Duet’s prototype: Link
README:
1. Use Google Chrome for better experience.
2. Some datasets are provided in the “Datasets” folder for trying out the system.
In the following, we explain the materials in the “Logistic Regression” folder. The “Logistic Regression” folder provides details for the model in Sec. 5.3.2 (Multinomial Logistic Regression for Classification) of the paper.
This folder contains 520 distributions pairs we collected from the 83 R datasets. Each row in a csv file is a data point. There are three important columns in each csv file: “newGroupName”, “newAttributeName” and, “attributeValue”. “newGroupName” is the name of the group to which a data point belongs, “newAttributeName” is the name of an attribute. “attributeValue” is the value that a data point has for the attribute.
We used SPSS to model the data. This csv file is the input to SPSS for modelling. The “fileName” column is the file name of a distribution pair inside the “520 Distribution Pairs” folder. “BhCoefficicent” is the Bhattacharyya coefficient for a distribution pair and “class” is the label of distribution pair we collected from people.
It is the code of the interface we used for asking 10 subjects to relabel 150 marginal cases. You need Python 3 and Flask to run the interface. To run the code, go to the directory using the console if you are using a Mac and enter “python server.py”. For those who have difficulties running the tool, we provide the screenshots of the labelling tool as follows:
Each csv file in the folder contains two columns: “filename” that is the file name of a distribution pair in the “520 Distribution Pairs” folder and “class” that is the label provided by a subject.
It is the screenshot of the output generated by SPSS. “Bh Coefficient + Labels.csv” is used as the input to SPSS. The following explains how the model in Sec. 5.3.2 corresponds to the SPSS output. Formally, our logistic regression model is
This text file contains the R code for computing the cross-validation accuracy of our logistic regression model using 10-fold cross validation. The input file is “Bh Coefficient + Labels.csv”. The cross-validation accuracy is around 78.1%. We envision that this accuracy can be improved by using more advanced machine learning models and more predictor variables.
The “User Study” folder contains all the materials for the qualitative user study in Sec. 6 (Evaluation) of the paper. The materials inside are described as follows:
This folder contains the car dataset “cars.csv” we used for the training session, a link to the tutorial video we showed to the participants and the training tasks to get participants familiar with Duet’s interface.
During each analysis session, we first showed the participants “Task Description.pdf”. We then gave them some time to review either “Description for College Dataset.pdf” or “Description for City Dataset.pdf” to get them familiar with the dataset they were about to analyze. The “Datasets” folder contains the city dataset and the college dataset we used for the analysis session.
At the end of the study, we first showed them “Three Main Features of the Tool.pdf” to ensure the participants know the terminology like “minimal specification” we are going to use in the interview. This folder contains the questions for the semi-structured interview (“Interview Questions.pdf”) and the survey questions (“Survey Questions.pdf”).
It is a summary of the survey result.
We drew inspiration from the literature to develop the idea of minimal specification. As described in the paper, there are two high-level considerations in designing minimal specification:
-
To address execution barriers, minimal specification allows users to focus on what they know (the objects of interest in answering a pairwise comparison question) rather than what they might not know (system operations).
-
To address interpretation barriers, the recommendations offered should be explained in order to result in better understanding of the recommendations and stronger feeling of trust.
The following two sections describes the basis of these two components.
Addressing Execution Barrier
This idea of allowing users to focus on what they know by shielding them from what they might not know is grounded in the following three ideas that have been explored by the HCI community:
Addressing Interpretation Barrier
Explaining the recommendations help users understand why they are recommended and inspire users’ trust in the system. This idea is grounded in the movement of explainable artificial intelligence (XAI).