Tuesday, November 17, 2015

Internet use and religion, part two

In the previous article, I posted a preliminary exploration of the relationship between Internet use and religious affiliation in Europe.  In this article I clean up some data issues and present results broken by country.

Cleaning and resampling

Here are the steps of the data cleaning pipeline:

  1. I replace sentinel values with NaNs.
  2. I recode some of the explanatory variables, and shift "year born" and "interview year" so their mean is 0.
  3. Within each data collection round and each country, I resample the respondents using their post stratification weights (pspwght). 
  4. In Rounds 1 and 2, there are a few countries that were not asked about Internet use.  I remove these countries from those rounds.
  5. For the variables eduyrs and hinctnta, I replace each value with its rank (from 0-1) among respondents in the same round and country.
  6. I replace missing values with random samples from the same round and country.
  7. Finally, I merge rounds 1-5 into a single DataFrame and then group by country.
This IPython notebook has the details, and summaries of the variables after processing.

One other difference compared to the previous notebook/article: I've added the variable invyr70, which is the year the respondent was interviewed (between 2002 and 2012), shifted by 2007 so the mean is near 0.

Aggregate results

Using variables with missing values filled (as indicated by the _f suffix), I get the following results from logistic regression on rlgblg_f (belonging to a religion), with sample size 233,856:

coefstd errzP>|z|[95.0% Conf. Int.]
Intercept1.10960.01766.3810.0001.077 1.142
inwyr07_f0.04770.00230.7820.0000.045 0.051
yrbrn60_f-0.00900.000-32.6820.000-0.010 -0.008
edurank_f0.01320.0170.7750.438-0.020 0.046
hincrank_f0.07870.0164.9500.0000.048 0.110
tvtot_f-0.01760.002-7.8880.000-0.022 -0.013
rdtot_f-0.01300.002-7.7100.000-0.016 -0.010
nwsptot_f-0.03560.004-9.9860.000-0.043 -0.029
netuse_f-0.10820.002-61.5710.000-0.112 -0.105

Compared to the results from last time, there are a few changes
  1. Interview year has a substantial effect, but probably should not be taken too seriously in this model, since the set of countries included in each round varies.  The apparent effect of time might reflect the changing mix of countries.  I expect this variable to be more useful after we group by country.
  2. The effect of "year born" is similar to what we saw before.  Younger people are less likely to be affiliated.
  3. The effect of education, now expressed in relative terms within each country, is no longer statistically significant.  The apparent effect we saw before might have been due to variation across countries.
  4. The effect of income, now expressed in relative terms within each country, is now positive, which is more consistent with results in other studies.  Again, the apparent negative effect in the previous analysis might have been due to variation across countries (see Simpson's paradox).
  5. The effect of the media variables is similar to what we saw before:  Internet use has the strongest effect, 2-3 times bigger than newspapers, which are 2-3 times bigger than television or radio.  And all are negative.
The inconsistent behavior of education and income as control variables is a minor concern, but I think the symptoms are most likely the result of combining countries, possibly made worse because I am not weighting countries by population, so smaller countries are overrepresented.

Here are the results from linear regression with rlgdgr_f (degree of religiosity) as the dependent variable:

coefstd errtP>|t|[95.0% Conf. Int.]
Intercept6.01400.022270.4070.0005.970 6.058
inwyr07_f0.02530.00212.0890.0000.021 0.029
yrbrn60_f-0.01720.000-46.1210.000-0.018 -0.016
edurank_f-0.24290.023-10.5450.000-0.288 -0.198
hincrank_f-0.15410.022-7.1280.000-0.196 -0.112
tvtot_f-0.07340.003-24.3990.000-0.079 -0.067
rdtot_f-0.01990.002-8.7600.000-0.024 -0.015
nwsptot_f-0.07620.005-15.6730.000-0.086 -0.067
netuse_f-0.13740.002-57.5570.000-0.142 -0.133
In this model, all parameters are statistically significant.  The effect of the media variables, including Internet use, is similar to what we saw before.

The effect of education and income is negative in this model, but I am not inclined to take it too seriously, again because we are combining countries in a way that doesn't mean much.

Breakdown by country

The following table shows results for logistic regression, with rlgblg_f as the dependent variable, broken down by country; the columns are country code, number of observations, and the estimated parameter associated with Internet use:

Country Num      Coef of     
code    obs.     netuse_f
------- ----     --------
AT 6918 -0.0795 **
BE 8939 -0.0299 **
BG 6064 0.0145
CH 9310 -0.0668 **
CY 3293 -0.229 **
CZ 8790 -0.0364 **
DE 11568 -0.0195 *
DK 7684 -0.0406 **
EE 6960 -0.0205
ES 9729 -0.0741 **
FI 7969 -0.0228
FR 5787 -0.0185
GB 11117 -0.0262 **
GR 9759 -0.0245
HR 3133 -0.0375
HU 7806 -0.0175
IE 10472 -0.0276 *
IL 7283 -0.0636 **
IS 579 0.0333
IT 1207 -0.107 **
LT 1677 -0.0576 *
LU 3187 -0.0789 **
LV 1980 -0.00623
NL 9741 -0.0589 **
NO 8643 -0.0304 **
PL 8917 -0.108 **
PT 10302 -0.103 **
RO 2146 0.00855
RU 7544 0.00437
SE 9201 -0.0374 **
SI 7126 -0.0336 **
SK 6944 -0.0635 **
TR 4272 -0.0857 *
UA 7809 -0.0422 **

** p < 0.01, * p < 0.05

In the majority of countries, there is a statistically significant relationship between Internet use and religious affiliation.  In all of those countries the relationship is negative, with the magnitude of most coefficients between 0.03 and 0.11 (with one exceptionally large value in Cyprus).

Degree of religiosity

And here are the results of linear regression, with rlgdgr_f as the dependent variable:

Country Num      Coef of     
code    obs.     netuse_f
------- ----     --------
AT 6918 -0.0151     **
BE 8939 -0.0072     **
BG 6064  0.0023    
CH 9310 -0.0132     **
CY 3293 -0.00221     **
CZ 8790 -0.005     **
DE 11568 -0.0045     *
DK 7684 -0.00909     **
EE 6960 -0.00363    
ES 9729 -0.0165     **
FI 7969 -0.00501    
FR 5787 -0.00429    
GB 11117 -0.0061     **
GR 9759 -0.00362     **
HR 3133 -0.00559    
HU 7806 -0.00478     *
IE 10472 -0.00412     *
IL 7283 -0.00419     **
IS 579  0.00752    
IT 1207 -0.0212     **
LT 1677 -0.00746    
LU 3187 -0.0152     **
LV 1980 -0.00147    
NL 9741 -0.014     **
NO 8643 -0.00721     **
PL 8917 -0.00919     **
PT 10302 -0.0149     **
RO 2146  0.000303    
RU 7544  0.00102    
SE 9201 -0.00815     **
SI 7126 -0.00835     **
SK 6944 -0.0119     **
TR 4272 -0.00331     **
UA 7809 -0.00947     **

In most countries there is a negative and statistically significant relationship between Internet use and degree of religiosity.

In this model the effect of education is consistent: in most countries it is negative and statistically significant.  In the two countries where it is positive, it is not statistically significant.

The effect of income is less consistent: in most countries it is not statistically significant; when it is, it is positive as often as negative.

But education and income are in the model primarily as control variables; they are not the focus on this study.  If they are actually associated with religious affiliation, these variables should be effective controls; if not, they contribute some noise, but otherwise do no harm.

Next steps

For now I am using StatsModels to estimate parameters and compute confidence intervals, but that's not quite right because I am using resampled data and filling missing values with random samples.  To account correctly for these sources of random error, I have to run the whole process repeatedly:
  1. Resample the data.
  2. Fill missing values.
  3. Estimate parameters.
Collecting the estimated parameters from multiple runs, I can estimate the sampling distribution of the parameters and compute confidence intervals.

Once I have implemented that, I plan to translate the results into a form that is easier to interpret (rather than just estimated coefficients), and generate visualizations to make the results easier to explore.

I would also like to relate the effect of Internet use in each country with the average level of religiosity, to see whether, for example, the effect is bigger in more religious countries.

While I am working on that, I am open to suggestions for additional explorations people might be interested in.  You can explore the variables in the ESS using their "Cumulative Data Wizard"; let me know what you find!


  1. Nice blog Allen! Why did you replace each value of `eduyrs` and `hinctnta` with its rank (from 0-1)? Is this a particular technique used when controlling variables in a regression model?

    1. Good question. I did that partly to deal with data issues: there are some suspiciously large values in eduyrs, partly to standardize the variables (which makes it easier to interpret the estimated parameters) and partly in order to make these variables relative to other respondents from the same country, to avoid comparisons across countries with different levels of income and education.