This repo holds the projects I worked on for my M.S. in Statistics at Stanford.
Class: CS 224N - Natural Language Processing with Deep Learning
Code: https://github.com/qqlabs/cs224n-project
Question Answering (QA), or the task of asking a model to answer a question correctly given a passage, is one of the most promising areas in NLP. However, state-of-the-art QA models tend to overfit to training data and do not generalize well to new domains, requiring additional training on domain-specific datasets to adapt. In this project, we aim to design a QA system that is robust to domain shifts and can perform well on out-of-domain (OOD) fewshot data.
We implement a variety of techniques that boost the robustness of a QA model trained with domain adversarial learning and evaluated on out-of-domain data, yielding a 16% increase in F1 score in development and 10% increase in test. We find that the following innovations boost model performance: 1) finetuning the model on augmented out-of-domain augmented data, 2) aggregating Wikipedia type datasets during adversarial training to simplify the domain discriminator’s task, and 3) supplementing the training data with synthetic QA pairs generated with roundtrip consistency. We also ensemble the best-performing models on each dataset and find that ensembling yields further performance increases.
Class: STATS 207 - Time Series Analysis
Code (in Python): Google Colab
We sought to build a 3 day ahead prediction for the air quality index of Santa Clara County. We approached the problem with increasingly complex models (ARMA, VARIMA, LSTM) and evaluated the performance increase with a sliding window cross-validation strategy. We specifically included AQI and meteorology data features from surrounding counties and improved performance when using relevant features.
Our presentation slides can be found here.
Impact of COVID Misinformation on Vaccination - A Reanalysis Identifying and Addressing Covariate Imbalances
Class: STATS 209 - Causal Inference
Code (in R): Google Colab
We reanalyzed a randomized controlled trial (Original Paper, Original Github) that exposed participants to COVID misinformation and measured its impact on vaccination intent. We evaluated the study’s randomization and show that it is significantly imbalanced (p-value < 0.0001) using a Monte Carlo simulation of the Mahalanobis distance between Treatment and Control. We then reduced the bias of the estimates by applying matching estimators and performing regression adjustment using Lin’s Estimator. We also explored heterogeneous treatment effects and provide some intuitive insights.
Class: STATS 263 - Design of Experiments
Code (in R): Google Colab
We were interested in understanding what factors can help improve an average person’s skills in shooting games and whether investing in better equipment (gaming mouse, high refresh rate monitor) really does make a significant impact on shooting skills. In addition, we wanted to see if a stimulant like coffee will be able to further enhance a player’s performance.
We structured our experiment as a combination of a strip-plot and stepped-wedge design. There were many logistic details that we considered during the design, ranging from how to parallelize the runs to whether we should serve hot or cold coffee.
This project focused on the design of the experiment and the data collection process. A proper experiment design helped drastically simplify what we needed to analyze.
Class: STATS 202 - Data Mining and Analysis
Code (in R): Google Colab
We were tasked with making a 9 day forecast at the 5-second granularity for 9 anonymized stock tickers. After exploring baseline and ARIMA models, we ultimately structured the problem as a direct forecasting problem and used the forecast-ml package to train and make predictions. Note that we did not actually learn how to do time series analysis for this class, so we had to convert the problem into a structure we were familiar with.