Cleaning and resampling
Here are the steps of the data cleaning pipeline:- I replace sentinel values with NaNs.
- I recode some of the explanatory variables, and shift "year born" and "interview year" so their mean is 0.
- Within each data collection round and each country, I resample the respondents using their post stratification weights (pspwght).
- In Rounds 1 and 2, there are a few countries that were not asked about Internet use. I remove these countries from those rounds.
- For the variables eduyrs and hinctnta, I replace each value with its rank (from 0-1) among respondents in the same round and country.
- I replace missing values with random samples from the same round and country.
- Finally, I merge rounds 1-5 into a single DataFrame and then group by country.
One other difference compared to the previous notebook/article: I've added the variable invyr70, which is the year the respondent was interviewed (between 2002 and 2012), shifted by 2007 so the mean is near 0.
Aggregate results
Using variables with missing values filled (as indicated by the _f suffix), I get the following results from logistic regression on rlgblg_f (belonging to a religion), with sample size 233,856:coef | std err | z | P>|z| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 1.1096 | 0.017 | 66.381 | 0.000 | 1.077 1.142 |
inwyr07_f | 0.0477 | 0.002 | 30.782 | 0.000 | 0.045 0.051 |
yrbrn60_f | -0.0090 | 0.000 | -32.682 | 0.000 | -0.010 -0.008 |
edurank_f | 0.0132 | 0.017 | 0.775 | 0.438 | -0.020 0.046 |
hincrank_f | 0.0787 | 0.016 | 4.950 | 0.000 | 0.048 0.110 |
tvtot_f | -0.0176 | 0.002 | -7.888 | 0.000 | -0.022 -0.013 |
rdtot_f | -0.0130 | 0.002 | -7.710 | 0.000 | -0.016 -0.010 |
nwsptot_f | -0.0356 | 0.004 | -9.986 | 0.000 | -0.043 -0.029 |
netuse_f | -0.1082 | 0.002 | -61.571 | 0.000 | -0.112 -0.105 |
Compared to the results from last time, there are a few changes
- Interview year has a substantial effect, but probably should not be taken too seriously in this model, since the set of countries included in each round varies. The apparent effect of time might reflect the changing mix of countries. I expect this variable to be more useful after we group by country.
- The effect of "year born" is similar to what we saw before. Younger people are less likely to be affiliated.
- The effect of education, now expressed in relative terms within each country, is no longer statistically significant. The apparent effect we saw before might have been due to variation across countries.
- The effect of income, now expressed in relative terms within each country, is now positive, which is more consistent with results in other studies. Again, the apparent negative effect in the previous analysis might have been due to variation across countries (see Simpson's paradox).
- The effect of the media variables is similar to what we saw before: Internet use has the strongest effect, 2-3 times bigger than newspapers, which are 2-3 times bigger than television or radio. And all are negative.
The inconsistent behavior of education and income as control variables is a minor concern, but I think the symptoms are most likely the result of combining countries, possibly made worse because I am not weighting countries by population, so smaller countries are overrepresented.
Here are the results from linear regression with rlgdgr_f (degree of religiosity) as the dependent variable:
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 6.0140 | 0.022 | 270.407 | 0.000 | 5.970 6.058 |
inwyr07_f | 0.0253 | 0.002 | 12.089 | 0.000 | 0.021 0.029 |
yrbrn60_f | -0.0172 | 0.000 | -46.121 | 0.000 | -0.018 -0.016 |
edurank_f | -0.2429 | 0.023 | -10.545 | 0.000 | -0.288 -0.198 |
hincrank_f | -0.1541 | 0.022 | -7.128 | 0.000 | -0.196 -0.112 |
tvtot_f | -0.0734 | 0.003 | -24.399 | 0.000 | -0.079 -0.067 |
rdtot_f | -0.0199 | 0.002 | -8.760 | 0.000 | -0.024 -0.015 |
nwsptot_f | -0.0762 | 0.005 | -15.673 | 0.000 | -0.086 -0.067 |
netuse_f | -0.1374 | 0.002 | -57.557 | 0.000 | -0.142 -0.133 |
In this model, all parameters are statistically significant. The effect of the media variables, including Internet use, is similar to what we saw before.
The effect of education and income is negative in this model, but I am not inclined to take it too seriously, again because we are combining countries in a way that doesn't mean much.
Breakdown by country
The following table shows results for logistic regression, with rlgblg_f as the dependent variable, broken down by country; the columns are country code, number of observations, and the estimated parameter associated with Internet use:
Country Num Coef of
code obs. netuse_f
------- ---- --------
AT 6918 -0.0795 **
BE 8939 -0.0299 **
BG 6064 0.0145
CH 9310 -0.0668 **
CY 3293 -0.229 **
CZ 8790 -0.0364 **
DE 11568 -0.0195 *
DK 7684 -0.0406 **
EE 6960 -0.0205
ES 9729 -0.0741 **
FI 7969 -0.0228
FR 5787 -0.0185
GB 11117 -0.0262 **
GR 9759 -0.0245
HR 3133 -0.0375
HU 7806 -0.0175
IE 10472 -0.0276 *
IL 7283 -0.0636 **
IS 579 0.0333
IT 1207 -0.107 **
LT 1677 -0.0576 *
LU 3187 -0.0789 **
LV 1980 -0.00623
NL 9741 -0.0589 **
NO 8643 -0.0304 **
PL 8917 -0.108 **
PT 10302 -0.103 **
RO 2146 0.00855
RU 7544 0.00437
SE 9201 -0.0374 **
SI 7126 -0.0336 **
SK 6944 -0.0635 **
TR 4272 -0.0857 *
UA 7809 -0.0422 **
** p < 0.01, * p < 0.05
In the majority of countries, there is a statistically significant relationship between Internet use and religious affiliation. In all of those countries the relationship is negative, with the magnitude of most coefficients between 0.03 and 0.11 (with one exceptionally large value in Cyprus).
Degree of religiosity
And here are the results of linear regression, with rlgdgr_f as the dependent variable:
Country Num Coef of
code obs. netuse_f
------- ---- --------
AT 6918 -0.0151 **
BE 8939 -0.0072 **
BG 6064 0.0023
CH 9310 -0.0132 **
CY 3293 -0.00221 **
CZ 8790 -0.005 **
DE 11568 -0.0045 *
DK 7684 -0.00909 **
EE 6960 -0.00363
ES 9729 -0.0165 **
FI 7969 -0.00501
FR 5787 -0.00429
GB 11117 -0.0061 **
GR 9759 -0.00362 **
HR 3133 -0.00559
HU 7806 -0.00478 *
IE 10472 -0.00412 *
IL 7283 -0.00419 **
IS 579 0.00752
IT 1207 -0.0212 **
LT 1677 -0.00746
LU 3187 -0.0152 **
LV 1980 -0.00147
NL 9741 -0.014 **
NO 8643 -0.00721 **
PL 8917 -0.00919 **
PT 10302 -0.0149 **
RO 2146 0.000303
RU 7544 0.00102
SE 9201 -0.00815 **
SI 7126 -0.00835 **
SK 6944 -0.0119 **
TR 4272 -0.00331 **
UA 7809 -0.00947 **
In most countries there is a negative and statistically significant relationship between Internet use and degree of religiosity.
In this model the effect of education is consistent: in most countries it is negative and statistically significant. In the two countries where it is positive, it is not statistically significant.
The effect of income is less consistent: in most countries it is not statistically significant; when it is, it is positive as often as negative.
But education and income are in the model primarily as control variables; they are not the focus on this study. If they are actually associated with religious affiliation, these variables should be effective controls; if not, they contribute some noise, but otherwise do no harm.
Next steps
For now I am using StatsModels to estimate parameters and compute confidence intervals, but that's not quite right because I am using resampled data and filling missing values with random samples. To account correctly for these sources of random error, I have to run the whole process repeatedly:
- Resample the data.
- Fill missing values.
- Estimate parameters.
Collecting the estimated parameters from multiple runs, I can estimate the sampling distribution of the parameters and compute confidence intervals.
Once I have implemented that, I plan to translate the results into a form that is easier to interpret (rather than just estimated coefficients), and generate visualizations to make the results easier to explore.
I would also like to relate the effect of Internet use in each country with the average level of religiosity, to see whether, for example, the effect is bigger in more religious countries.
While I am working on that, I am open to suggestions for additional explorations people might be interested in. You can explore the variables in the ESS using their "Cumulative Data Wizard"; let me know what you find!
Nice blog Allen! Why did you replace each value of `eduyrs` and `hinctnta` with its rank (from 0-1)? Is this a particular technique used when controlling variables in a regression model?
ReplyDeleteGood question. I did that partly to deal with data issues: there are some suspiciously large values in eduyrs, partly to standardize the variables (which makes it easier to interpret the estimated parameters) and partly in order to make these variables relative to other respondents from the same country, to avoid comparisons across countries with different levels of income and education.
Delete