This project uses unsupervised learning techniques to explore a psychological survey of young adults. The survey has many sections, covering the attitudes, behaviors, and beliefs surrounding adulthood.
Hypotheses
Technology
Data
Preprocessing
Execution
Conclusion
- The features (survey questions) will reduce to an interpretable set of topics.
- Holding out the pre-defined "Subjective Well-being" section, the data will still cluster meaningfully around that topic.
- The clusters will contain different demographic distributions
This project leveraged the Python Data Science stack:
- ScitKit Learn
- Pandas
- Numpy
- Matplotlib
- Jupyter
The data comes from Open Science Framework, a free and open platform to support research and enable collaboration.
The EAMMI2 is a large-scale collaborative project with 32 primary contributors. The initial data collection ended in December 2016. 90% of the entries come from the US, with 10% coming from England, Greece, and Grenada.
Grahe, J. E., Faas, C., Chalk, H. M., Skulborstad, H. M., Barlett, C., Peer, J. W., … Reifman, A. (2019, February 21). Emerging Adulthood Measured at Multiple Institutions 2: The Next Generation (EAMMi2). https://doi.org/10.17605/OSF.IO/TE54B
This was a survey given primarily to young adults ages 18-25 regarding their attitudes, behaviors, and beliefs related to Emerging Adulthood. On average it took about 30 minutes to complete, and contained around 200 questions spanning categories such as:
- Markers of Adulthood
- Idea
- Subjective Well-being
- Mindful
- Belonging
- Efficacy
- Support
- Transgressions
- Stress
- Marriage
- Narcissism
Most of the answers were ordinal (e.g. "on a scale of 1-7, how likely you agree with a statement")
Example of Subjective Well-being questions:
These cleaning steps were done by the collaborators, which include dropping observations that met the following conditions:
- < 10 minutes to complete
- < 80% completed
- Missed the "attention" prompts
- High-bias responders
My preprocessing can be found in main/EAMMI_1_processing.ipynb, which included:
- Renaming columns for readability
- Dropping open-ended questions
- Remapping answers to retain ordinality
- Fill missing values with median
- Binning sparse categories
- Creating target variables for use with supervised learning
Here is a quick snapshot of the demographics of the cleaned dataset:
You can find these steps in main/EAMMI_2_final.ipynb, which includes docstrings and comments/explanations.
Note: The Subjective Well-being section, as well as demographics, have been held out so that the resulting clusters can be examined with regard to these attributes.
Using Non-negative Matrix Factorization for topic extraction.
I found that reducing the features (survey questions) down to seven topics maintained interpretability. Below are the seven topics (the labels are my interpretation) along with a few survey questions associated with each topic. A more comprehensive list of questions can be found in the notebook.
- Self-worth / Confidence
- I can solve most problems if I invest the necessary effort.
- I can remain calm when facing difficulties because I can rely on my coping abilities.
- I make independent decisions.
- Mindfulness
- It seems I am running on automatic, without much awareness of what I’m doing.
- I break or spill things because of carelessness or not paying attention.
- I tend not to notice feelings of physical tension until they really grab my attention.
- Achievement
- I am capable of supporting a family financially.
- I am no longer living in parents' household.
- I am settled into a long-term career.
- Family
- Marriage is an important aspect of adulthood.
- Being capable of caring for children is an important aspect of adulthood.
- Being capable of supporting parents financially is an important aspect of adulthood.
- Support
- I get the emotional help and support I need from my family.
- There is a special person in my life who cares about my feelings.
- I can count on my friends when things go wrong.
- Self-control / Responsibility
- I avoid becoming drunk.
- I accept responsibility for my actions.
- I use contraception if sexually active and not trying to conceive a child.
- Neuroticism
- My feelings are easily hurt when I feel that others do not accept me.
- I feel that I am unable to control the important things in my life.
- Is this period of your life a time of feeling stressed out?
Using Hierarchical Clustering on reduced feature set
Using Ward's linkage method to minimize within-cluster variance
Using Chi-squared test to test independence of the clusters with regard to the held out "Subjective Well-being" questions.
The young adults were binned by their cumulative SWB scores (low, neutral, high), and the distribution of each cluster was tested against that of every other cluster. The p-value, shown in the yellow triangle, is the probability of observing these (or more extreme) distributions given that the clusters are not independent.
Therefore, the lower p-values support the claim of independence.
As shown by the p-values regarding Subjective Well-being, Clusters 2 and 3 are similar, as are Clusters 4 and 5.
Note: The following are quick visual snapshots of some demographic distributions. No statistical tests have been run to confirm the differences.
Cluster 1 contains more of the Other gender.
Clusters 2, 3, and 4 show different age distributions.
Cluster 3 shows a different education distribution.
- The features reduced to an interpretable set of topics.
- The data clustered meaningfully around the held-out "Subjective Well-being" questions.
- Upon first glance, it seems promising that the clusters contain different demographic distributions.
- Dig into the differences of the clusters and confirm with statistical tests
- Include open-ended text sections
- NLP / Sentiment Analysis
- Include the "duration" feature, which measures the time spent answering each section
- Use supervised learning to predict classification labels
- e.g. Subjective Well-being - low, neutral, high
- Measure relationship between family/upbringing and belonging/support