Probably Overthinking It: Internet use and religion, part two

In the previous article, I posted a preliminary exploration of the relationship between Internet use and religious affiliation in Europe. In this article I clean up some data issues and present results broken by country.

Cleaning and resampling

Here are the steps of the data cleaning pipeline:

I replace sentinel values with NaNs.
I recode some of the explanatory variables, and shift "year born" and "interview year" so their mean is 0.
Within each data collection round and each country, I resample the respondents using their post stratification weights (pspwght).
In Rounds 1 and 2, there are a few countries that were not asked about Internet use. I remove these countries from those rounds.
For the variables eduyrs and hinctnta, I replace each value with its rank (from 0-1) among respondents in the same round and country.
I replace missing values with random samples from the same round and country.
Finally, I merge rounds 1-5 into a single DataFrame and then group by country.

This IPython notebook has the details, and summaries of the variables after processing.

One other difference compared to the previous notebook/article: I've added the variable invyr70, which is the year the respondent was interviewed (between 2002 and 2012), shifted by 2007 so the mean is near 0.

Aggregate results

Using variables with missing values filled (as indicated by the _f suffix), I get the following results from logistic regression on rlgblg_f (belonging to a religion), with sample size 233,856:

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	1.1096	0.017	66.381	0.000	1.077 1.142
inwyr07_f	0.0477	0.002	30.782	0.000	0.045 0.051
yrbrn60_f	-0.0090	0.000	-32.682	0.000	-0.010 -0.008
edurank_f	0.0132	0.017	0.775	0.438	-0.020 0.046
hincrank_f	0.0787	0.016	4.950	0.000	0.048 0.110
tvtot_f	-0.0176	0.002	-7.888	0.000	-0.022 -0.013
rdtot_f	-0.0130	0.002	-7.710	0.000	-0.016 -0.010
nwsptot_f	-0.0356	0.004	-9.986	0.000	-0.043 -0.029
netuse_f	-0.1082	0.002	-61.571	0.000	-0.112 -0.105

Compared to the results from last time, there are a few changes

Interview year has a substantial effect, but probably should not be taken too seriously in this model, since the set of countries included in each round varies. The apparent effect of time might reflect the changing mix of countries. I expect this variable to be more useful after we group by country.
The effect of "year born" is similar to what we saw before. Younger people are less likely to be affiliated.
The effect of education, now expressed in relative terms within each country, is no longer statistically significant. The apparent effect we saw before might have been due to variation across countries.
The effect of income, now expressed in relative terms within each country, is now positive, which is more consistent with results in other studies. Again, the apparent negative effect in the previous analysis might have been due to variation across countries (see Simpson's paradox).
The effect of the media variables is similar to what we saw before: Internet use has the strongest effect, 2-3 times bigger than newspapers, which are 2-3 times bigger than television or radio. And all are negative.

The inconsistent behavior of education and income as control variables is a minor concern, but I think the symptoms are most likely the result of combining countries, possibly made worse because I am not weighting countries by population, so smaller countries are overrepresented.

Here are the results from linear regression with rlgdgr_f (degree of religiosity) as the dependent variable:

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	6.0140	0.022	270.407	0.000	5.970 6.058
inwyr07_f	0.0253	0.002	12.089	0.000	0.021 0.029
yrbrn60_f	-0.0172	0.000	-46.121	0.000	-0.018 -0.016
edurank_f	-0.2429	0.023	-10.545	0.000	-0.288 -0.198
hincrank_f	-0.1541	0.022	-7.128	0.000	-0.196 -0.112
tvtot_f	-0.0734	0.003	-24.399	0.000	-0.079 -0.067
rdtot_f	-0.0199	0.002	-8.760	0.000	-0.024 -0.015
nwsptot_f	-0.0762	0.005	-15.673	0.000	-0.086 -0.067
netuse_f	-0.1374	0.002	-57.557	0.000	-0.142 -0.133

In this model, all parameters are statistically significant. The effect of the media variables, including Internet use, is similar to what we saw before.

The effect of education and income is negative in this model, but I am not inclined to take it too seriously, again because we are combining countries in a way that doesn't mean much.

Breakdown by country

The following table shows results for logistic regression, with rlgblg_f as the dependent variable, broken down by country; the columns are country code, number of observations, and the estimated parameter associated with Internet use:

Country Num Coef of

code obs. netuse_f

------- ---- --------

AT 6918 -0.0795 **

BE 8939 -0.0299 **

BG 6064 0.0145

CH 9310 -0.0668 **

CY 3293 -0.229 **

CZ 8790 -0.0364 **

DE 11568 -0.0195 *

DK 7684 -0.0406 **

EE 6960 -0.0205

ES 9729 -0.0741 **

FI 7969 -0.0228

FR 5787 -0.0185

GB 11117 -0.0262 **

GR 9759 -0.0245

HR 3133 -0.0375

HU 7806 -0.0175

IE 10472 -0.0276 *

IL 7283 -0.0636 **

IS 579 0.0333

IT 1207 -0.107 **

LT 1677 -0.0576 *

LU 3187 -0.0789 **

LV 1980 -0.00623

NL 9741 -0.0589 **

NO 8643 -0.0304 **

PL 8917 -0.108 **

PT 10302 -0.103 **

RO 2146 0.00855

RU 7544 0.00437

SE 9201 -0.0374 **

SI 7126 -0.0336 **

SK 6944 -0.0635 **

TR 4272 -0.0857 *

UA 7809 -0.0422 **

** p < 0.01, * p < 0.05

In the majority of countries, there is a statistically significant relationship between Internet use and religious affiliation. In all of those countries the relationship is negative, with the magnitude of most coefficients between 0.03 and 0.11 (with one exceptionally large value in Cyprus).

Degree of religiosity

And here are the results of linear regression, with rlgdgr_f as the dependent variable:

Country Num Coef of

code obs. netuse_f

------- ---- --------

AT 6918 -0.0151 **

BE 8939 -0.0072 **

BG 6064 0.0023

CH 9310 -0.0132 **

CY 3293 -0.00221 **

CZ 8790 -0.005 **

DE 11568 -0.0045 *

DK 7684 -0.00909 **

EE 6960 -0.00363

ES 9729 -0.0165 **

FI 7969 -0.00501

FR 5787 -0.00429

GB 11117 -0.0061 **

GR 9759 -0.00362 **

HR 3133 -0.00559

HU 7806 -0.00478 *

IE 10472 -0.00412 *

IL 7283 -0.00419 **

IS 579 0.00752

IT 1207 -0.0212 **

LT 1677 -0.00746

LU 3187 -0.0152 **

LV 1980 -0.00147

NL 9741 -0.014 **

NO 8643 -0.00721 **

PL 8917 -0.00919 **

PT 10302 -0.0149 **

RO 2146 0.000303

RU 7544 0.00102

SE 9201 -0.00815 **

SI 7126 -0.00835 **

SK 6944 -0.0119 **

TR 4272 -0.00331 **

UA 7809 -0.00947 **

In most countries there is a negative and statistically significant relationship between Internet use and degree of religiosity.

In this model the effect of education is consistent: in most countries it is negative and statistically significant. In the two countries where it is positive, it is not statistically significant.

The effect of income is less consistent: in most countries it is not statistically significant; when it is, it is positive as often as negative.

But education and income are in the model primarily as control variables; they are not the focus on this study. If they are actually associated with religious affiliation, these variables should be effective controls; if not, they contribute some noise, but otherwise do no harm.

Next steps

For now I am using StatsModels to estimate parameters and compute confidence intervals, but that's not quite right because I am using resampled data and filling missing values with random samples. To account correctly for these sources of random error, I have to run the whole process repeatedly:

Resample the data.
Fill missing values.
Estimate parameters.

Collecting the estimated parameters from multiple runs, I can estimate the sampling distribution of the parameters and compute confidence intervals.

Once I have implemented that, I plan to translate the results into a form that is easier to interpret (rather than just estimated coefficients), and generate visualizations to make the results easier to explore.

I would also like to relate the effect of Internet use in each country with the average level of religiosity, to see whether, for example, the effect is bigger in more religious countries.

While I am working on that, I am open to suggestions for additional explorations people might be interested in. You can explore the variables in the ESS using their "Cumulative Data Wizard"; let me know what you find!

Probably Overthinking It

Tuesday, November 17, 2015

Internet use and religion, part two

Cleaning and resampling

Aggregate results

Breakdown by country

Degree of religiosity

Next steps

2 comments: