Last night I had the pleasure of presenting a talk for the PyData Boston Meetup. I presented a project I started earlier this summer, using data from the General Social Survey to measure and predict trends in religious affiliation and belief in the U.S.
The slides, which include the results so far and an overview of the methodology, are here:
And the code and data are all in this Jupyter notebook. I'll post additional results and discussion over the next few weeks.
Thanks to Milos Miljkovic, organizer of the PyData Boston Meetup, for inviting me, and to O'Reilly Media for hosting the meeting.
Wednesday, June 14, 2017
Thursday, June 1, 2017
Spring 2017 Data Science reports
In my Data Science class this semester, students worked on a series of reports where they explore a freely-available dataset, use data to answer questions, and present their findings. After each batch of reports, I will publish the abstracts here; you can follow the links below to see what they found.
One topic that enters popular discussion every four years is "who votes?" Every presidential election we see many discussions on which groups are more likely to vote, and which important voter groups each candidate needs to capture. One theme that is often part of this discussion is whether or not a candidate's biggest support is among groups likely to turn out. This analysis of the General Social Survey uses a number of different demographic variables to try and answer that question. Report
Using a dataset published by Medium on Kaggle, I explored the relationship between an employee's working conditions and the likelihood that they will quit their job. There were some expected trends, like lower salary leading to a higher attrition rate, but also some surprising ones, like having an accident at work leading to a lower likelihood of quitting! This observed information can be used by employers to determine the quitting probability of a specific individual, or to calculate the attrition rate of a larger group, like a department, and adjust their conditions accordingly.
Report
Politics has often been a polarizing subject amongst Americans, and in today's increasingly partisan political environment, that has not changed. Using data from the General Social Survey (GSS), an annual study designed and conducted by the National Opinion Research Center (NORC) at the University of Chicago, we identify variables that are correlated with a person's political views. We find that while marital status has a statistically significant apparent effect on political views, that apparent effect is drastically reduced when including confounding variables, particularly religion. Report
In the 1990s, the USDA put out the image of a Food Guide Pyramid to help direct dietary choices. It grouped foods into six categories: grains, proteins (meats, fish, eggs, etc), vegetables, fruits, dairy, and fats and oils. Since then, the pyramid has been revamped in 2005, and then pushed towards a plate with five categories (oils were dropped) in the 2010s. The general population has learned of these basic food groups since grade school, and over time either fully adopts them into their lifestyles, or abandons them to pursue their own balanced diet. In light of the controversy surrounding the Food Pyramid, we decided to ask whether the food categories found in the Food Pyramid truly represent the correct groupings for food, and if not, just how far off are they? Using K-Means clustering on an extensive food databank, we created 6 groupings of food based on their macronutrient composition, which was the primary criteria the original Food Pyramid used in its categorization. We found that the K-Means groups only overlapped with existing food groups from the Food Pyramid 50% of the time, potentially suggesting that the idea of the basic food groups could be outdated. Report
It is well known fact that excessive amount of default from subprime mortgages, which are mortgages normally issued to a borrower of low credit, was a leading cause of subprime mortgage crisis that led to a global financial meltdown in 2007. Because of this nightmarish experience, it seems plausible to assume that current home mortgages are much harder to get and much more conservative (in terms of risks the lender is taking, shown mainly as an interest rate) than pre-2007 mortgages. Using a dataset containing all home mortgages purchased or guaranteed from The Federal Home Loan Mortgage Corporation, more commonly known as Freddie Mac, I investigate whether there is any noticeable difference between the interest rates before and after subprime mortgage crisis.
Report
Players in the NBA are often compared to others, both active and retired, based on similar play styles. For example, it is common to hear statements such as “Russell Westbrook is the new Derrick Rose”. The purpose of our project is to apply machine learning in the form of clustering to see which players are actually similar based on 22 variables. We successfully generated clusters of players that are very similar quantitatively. It is up to the reader to decide whether this is qualitatively true. Report
We can tell where a food is from - at least, culturally - from just a few bites. There are palettes of ingredients and spices which are strongly associated with each other - giving cajun cooking its kick, and french cuisine its "je ne sais quoi." But, what exactly these palettes and pairings are varies - ask ten different chefs, and you'll get six different answers. We look for a statistical way to identify "trinities" like "onion, carrot, celery" or "garlic, sesame oil, soy sauce," in the process both finding several associations not typically reflected in culinary literature and creating a tool which extends recipes based on their already-known ingredients, in a manner akin to a food version of a cell phone's autocomplete. Report
I examined the Pew News Coverage Index dataset from the years 2010 and 2012 to see how the different topics and stories were covered across media sectors and sources. The combined dataset had over 70,000 stories from all media sectors: print, online, cable tv, network tv, and broadcast radio. From the data, topics have less variance in word count and duration than sources. Report
How Do You Predict Who Will Vote?
Sean CarterOne topic that enters popular discussion every four years is "who votes?" Every presidential election we see many discussions on which groups are more likely to vote, and which important voter groups each candidate needs to capture. One theme that is often part of this discussion is whether or not a candidate's biggest support is among groups likely to turn out. This analysis of the General Social Survey uses a number of different demographic variables to try and answer that question. Report
Designing the Optimal Employee Experience... For Employers
Joey MaaloufUsing a dataset published by Medium on Kaggle, I explored the relationship between an employee's working conditions and the likelihood that they will quit their job. There were some expected trends, like lower salary leading to a higher attrition rate, but also some surprising ones, like having an accident at work leading to a lower likelihood of quitting! This observed information can be used by employers to determine the quitting probability of a specific individual, or to calculate the attrition rate of a larger group, like a department, and adjust their conditions accordingly.
Report
Does being married have an effect on your political views?
Apurva Raman and William LuPolitics has often been a polarizing subject amongst Americans, and in today's increasingly partisan political environment, that has not changed. Using data from the General Social Survey (GSS), an annual study designed and conducted by the National Opinion Research Center (NORC) at the University of Chicago, we identify variables that are correlated with a person's political views. We find that while marital status has a statistically significant apparent effect on political views, that apparent effect is drastically reduced when including confounding variables, particularly religion. Report
Should you Follow the Food Groups for Dietary Advice?
Kaitlyn Keil and Kevin ZhangIn the 1990s, the USDA put out the image of a Food Guide Pyramid to help direct dietary choices. It grouped foods into six categories: grains, proteins (meats, fish, eggs, etc), vegetables, fruits, dairy, and fats and oils. Since then, the pyramid has been revamped in 2005, and then pushed towards a plate with five categories (oils were dropped) in the 2010s. The general population has learned of these basic food groups since grade school, and over time either fully adopts them into their lifestyles, or abandons them to pursue their own balanced diet. In light of the controversy surrounding the Food Pyramid, we decided to ask whether the food categories found in the Food Pyramid truly represent the correct groupings for food, and if not, just how far off are they? Using K-Means clustering on an extensive food databank, we created 6 groupings of food based on their macronutrient composition, which was the primary criteria the original Food Pyramid used in its categorization. We found that the K-Means groups only overlapped with existing food groups from the Food Pyramid 50% of the time, potentially suggesting that the idea of the basic food groups could be outdated. Report
Are Terms of Home Mortgage Less Favorable Now Compared to Pre Mortgage Crisis?
Sungwoo ParkIt is well known fact that excessive amount of default from subprime mortgages, which are mortgages normally issued to a borrower of low credit, was a leading cause of subprime mortgage crisis that led to a global financial meltdown in 2007. Because of this nightmarish experience, it seems plausible to assume that current home mortgages are much harder to get and much more conservative (in terms of risks the lender is taking, shown mainly as an interest rate) than pre-2007 mortgages. Using a dataset containing all home mortgages purchased or guaranteed from The Federal Home Loan Mortgage Corporation, more commonly known as Freddie Mac, I investigate whether there is any noticeable difference between the interest rates before and after subprime mortgage crisis.
Report
Finding NBA Players with Similar Styles
Willem Thorbecke and David PappPlayers in the NBA are often compared to others, both active and retired, based on similar play styles. For example, it is common to hear statements such as “Russell Westbrook is the new Derrick Rose”. The purpose of our project is to apply machine learning in the form of clustering to see which players are actually similar based on 22 variables. We successfully generated clusters of players that are very similar quantitatively. It is up to the reader to decide whether this is qualitatively true. Report
Food Trinities and Recipe Completion
Matt RuehleWe can tell where a food is from - at least, culturally - from just a few bites. There are palettes of ingredients and spices which are strongly associated with each other - giving cajun cooking its kick, and french cuisine its "je ne sais quoi." But, what exactly these palettes and pairings are varies - ask ten different chefs, and you'll get six different answers. We look for a statistical way to identify "trinities" like "onion, carrot, celery" or "garlic, sesame oil, soy sauce," in the process both finding several associations not typically reflected in culinary literature and creating a tool which extends recipes based on their already-known ingredients, in a manner akin to a food version of a cell phone's autocomplete. Report
All the News in 2010 and 2012
Radmer van der HeydeI examined the Pew News Coverage Index dataset from the years 2010 and 2012 to see how the different topics and stories were covered across media sectors and sources. The combined dataset had over 70,000 stories from all media sectors: print, online, cable tv, network tv, and broadcast radio. From the data, topics have less variance in word count and duration than sources. Report