Wednesday, June 14, 2017

Religion in the United States

Last night I had the pleasure of presenting a talk for the PyData Boston Meetup.  I presented a project I started earlier this summer, using data from the General Social Survey to measure and predict trends in religious affiliation and belief in the U.S.

The slides, which include the results so far and an overview of the methodology, are here:

And the code and data are all in this Jupyter notebook.  I'll post additional results and discussion over the next few weeks.

Thanks to Milos Miljkovic, organizer of the PyData Boston Meetup, for inviting me, and to O'Reilly Media for hosting the meeting.

Thursday, June 1, 2017

Spring 2017 Data Science reports

In my Data Science class this semester, students worked on a series of reports where they explore a freely-available dataset, use data to answer questions, and present their findings.  After each batch of reports, I will publish the abstracts here; you can follow the links below to see what they found.

How Do You Predict Who Will Vote?

Sean Carter

One topic that enters popular discussion every four years is "who votes?" Every presidential election we see many discussions on which groups are more likely to vote, and which important voter groups each candidate needs to capture. One theme that is often part of this discussion is whether or not a candidate's biggest support is among groups likely to turn out. This analysis of the General Social Survey uses a number of different demographic variables to try and answer that question. Report

Designing the Optimal Employee Experience... For Employers

Joey Maalouf

Using a dataset published by Medium on Kaggle, I explored the relationship between an employee's working conditions and the likelihood that they will quit their job. There were some expected trends, like lower salary leading to a higher attrition rate, but also some surprising ones, like having an accident at work leading to a lower likelihood of quitting! This observed information can be used by employers to determine the quitting probability of a specific individual, or to calculate the attrition rate of a larger group, like a department, and adjust their conditions accordingly.

Does being married have an effect on your political views?

Apurva Raman and William Lu

Politics has often been a polarizing subject amongst Americans, and in today's increasingly partisan political environment, that has not changed. Using data from the General Social Survey (GSS), an annual study designed and conducted by the National Opinion Research Center (NORC) at the University of Chicago, we identify variables that are correlated with a person's political views. We find that while marital status has a statistically significant apparent effect on political views, that apparent effect is drastically reduced when including confounding variables, particularly religion. Report

Should you Follow the Food Groups for Dietary Advice?

Kaitlyn Keil and Kevin Zhang

In the 1990s, the USDA put out the image of a Food Guide Pyramid to help direct dietary choices. It grouped foods into six categories: grains, proteins (meats, fish, eggs, etc), vegetables, fruits, dairy, and fats and oils. Since then, the pyramid has been revamped in 2005, and then pushed towards a plate with five categories (oils were dropped) in the 2010s. The general population has learned of these basic food groups since grade school, and over time either fully adopts them into their lifestyles, or abandons them to pursue their own balanced diet. In light of the controversy surrounding the Food Pyramid, we decided to ask whether the food categories found in the Food Pyramid truly represent the correct groupings for food, and if not, just how far off are they? Using K-Means clustering on an extensive food databank, we created 6 groupings of food based on their macronutrient composition, which was the primary criteria the original Food Pyramid used in its categorization. We found that the K-Means groups only overlapped with existing food groups from the Food Pyramid 50% of the time, potentially suggesting that the idea of the basic food groups could be outdated. Report

Are Terms of Home Mortgage Less Favorable Now Compared to Pre Mortgage Crisis?

Sungwoo Park

It is well known fact that excessive amount of default from subprime mortgages, which are mortgages normally issued to a borrower of low credit, was a leading cause of subprime mortgage crisis that led to a global financial meltdown in 2007. Because of this nightmarish experience, it seems plausible to assume that current home mortgages are much harder to get and much more conservative (in terms of risks the lender is taking, shown mainly as an interest rate) than pre-2007 mortgages. Using a dataset containing all home mortgages purchased or guaranteed from The Federal Home Loan Mortgage Corporation, more commonly known as Freddie Mac, I investigate whether there is any noticeable difference between the interest rates before and after subprime mortgage crisis.

Finding NBA Players with Similar Styles

Willem Thorbecke and David Papp

Players in the NBA are often compared to others, both active and retired, based on similar play styles. For example, it is common to hear statements such as “Russell Westbrook is the new Derrick Rose”. The purpose of our project is to apply machine learning in the form of clustering to see which players are actually similar based on 22 variables. We successfully generated clusters of players that are very similar quantitatively. It is up to the reader to decide whether this is qualitatively true. Report

Food Trinities and Recipe Completion

Matt Ruehle

We can tell where a food is from - at least, culturally - from just a few bites. There are palettes of ingredients and spices which are strongly associated with each other - giving cajun cooking its kick, and french cuisine its "je ne sais quoi." But, what exactly these palettes and pairings are varies - ask ten different chefs, and you'll get six different answers. We look for a statistical way to identify "trinities" like "onion, carrot, celery" or "garlic, sesame oil, soy sauce," in the process both finding several associations not typically reflected in culinary literature and creating a tool which extends recipes based on their already-known ingredients, in a manner akin to a food version of a cell phone's autocomplete. Report

All the News in 2010 and 2012

Radmer van der Heyde

I examined the Pew News Coverage Index dataset from the years 2010 and 2012 to see how the different topics and stories were covered across media sectors and sources. The combined dataset had over 70,000 stories from all media sectors: print, online, cable tv, network tv, and broadcast radio. From the data, topics have less variance in word count and duration than sources. Report