-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathEU-voting-preferences.Rmd
478 lines (361 loc) · 16.5 KB
/
EU-voting-preferences.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
---
title: "EU-voter-preferences"
author: "Joseph Arber"
date: "21/06/2020"
output:
html_document:
theme: cosmo
highlight: tango
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# What determines support for the European Union?
## Introduction
In the aftermath of the UK public’s vote to leave the EU in the 2016 referendum, much attention has been paid to whether support for the EU varies predictably across different types of individuals. In this question, you will use an appropriate binary dependent variable model to improve our understanding of which types of citizens are more or less likely to vote to leave the European Union if a referendum on membership were to be held in their country.
The data for this question comes from the 2016 European Social Survey (ESS) and includes information on the political attitudes and demographics of European citizens.
The question given to survey particpants was: "Imagine there were a referendum in your country tomorrow about membership of the European Union. Would you vote for your country to remain a member of the European Union or to leave the European Union?"
```{r Package dependencies}
#Important packages for analysis and modelling
library(tidyverse)
library(tidyr)
library(dplyr)
library(texreg)
library(foreign)
library(ggplot2)
#library(glmnet)
#library(hrbrthemes)
library(kableExtra)
options(knitr.table.format = "html")
library(patchwork)
```
### Loading the Data
Above we have loaded the required packages for this analysis. We can now load in the data:
```{r load data, message=FALSE}
library(readr)
ess <- read_csv("data/ess.csv")
```
### Data Manipulation
We need to complete some data wrangling. There are several variables that should be converted into factor variables, this will aid the regression modelling later on, but will also provide clearer labels for the categories within the features.
```{r Data manipulation}
#Variable coercion
str(ess$trade_union)
table(ess$trade_union)
#Turn trade union into a factor variable
ess$trade_union <-factor(ess$trade_union, levels = c(0,1), labels = c("Non-Member", "Member"))
summary(ess$trade_union)
class(ess$trade_union)
#Variable coercion
str(ess$unemployed)
table(ess$unemployed)
#Turn unemployed into a factory variable
ess$unemployed <-factor(ess$unemployed, levels = c(FALSE,TRUE), labels =c("Employed", "Unemployed"))
summary(ess$unemployed)
class(ess$unemployed)
```
There are a total of **12557** respondents who are employed, whilst there are only **500** who are unemployed. On the other hand there are around 50000 trade union members compared to 80000 non-members. We should remember these insights for the following analysis.
```{r More data manipulation}
#Take a look at the level of country attachment
str(ess$country_attach)
summary(ess$country_attach)
#Let's sequence the country attachment variable
attach_country<-seq(0,10, length.out = 100)
summary(ess$country_attach)
str(ess$country_attach)
#Take a look at the leave variable
str(ess$leave)
table(ess$leave)
#Coerce leave into a factor variable
ess$leave <- factor(ess$leave, levels =c(0,1), labels = c("no","yes"))
summary(ess$leave)
table(ess$leave)
str(ess$leave)
```
We used the 'seq' generator function in R, it is useful for creating proportional sequences with a given length. The rationale for doing this is that we will be able to draw more insighful conclusions by spreading the variable over a length of 100 rather than 10. The package is referenced here:
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/seq
By coercing the leave variable into a factor variable we can see that amount of people that would vote to leave was around **2308** whilst **10767** would vote to remain.
### Exploratory Analysis & Visualization
What is the vote split in the dataset?
```{r Visualization 1}
library(ggplot2)
ggplot(data = ess, aes(x = leave)) +
geom_bar()
```
Let's try to segment the main demographic groups in the dataset. This should help us in the later modelling phases.
#### Religion and the European Union
We will look at the categories in the religion feature.
**Religious segmentations**
```{r}
table(ess$religion)
```
Including an 'Other' category, there are 5 major religions. Lets now examine which religious groups across Europe are more opposed to the EU as an institution.
```{r Segmenting by Voter Preference}
leave_vote <- ess %>%
filter(leave == "yes")
leave_vote
#plot to see which religions are more opposed to the EU.
library(ggplot2)
ggplot(data = leave_vote, aes(x = religion, fill = religion)) +
geom_bar()
```
Generally **Muslims** and **Jews** are more supportive of the EU, whilst **Roman Catholics** are more opposed. Let's look at voting preferences within a religious segmentation. To do this we have to create a bucket that contains the values for all Muslims who particpated in the survey. We use the pipe operator to do this,
```{r Exploratory Analysis: Islam}
Islamic_view <- ess %>%
filter(religion == "Islamic")
Islamic_view
```
Now lets visualise the split in voter preferences for Muslims.
```{r Visualization of Islam}
library(ggplot2)
ggplot(data = Islamic_view, aes(x = leave, y = "Islamic")) +
geom_col() + ggtitle("Voter Preferences for Muslims") + xlab("Vote to Leave") + ylab("Count")
```
We are going to do exactly the same process as did above for the Catholic segementation.
```{r Exploratory Analysis: Catholic}
Catholic_view <- ess %>%
filter(religion == "Roman Catholic")
Catholic_view
```
```{r Visualization: Catholic}
library(ggplot2)
ggplot(data = Catholic_view, aes(x = leave, y = "Roman Catholic")) +
geom_col() + ggtitle("Voter Preferences for Catholics") + xlab("Vote to Leave") + ylab("Count")
```
#### Gender and the European Union
After exploring the relgious groups views on Europe, it is worth considering if there is any variation between genders.
```{r}
ggplot(data = ess, aes(x = leave, y = gender, col = gender)) +
geom_col() + ggtitle("Voter Preferences by Gender") + xlab("Vote to Leave") + ylab("Count")
```
Not much variation.
We will now analyse some of the other featurs in this dataset. Lets take a quick look at the dataset to remind ourselves of these features.
```{r}
colnames(ess)
```
```{r}
grey_theme <- theme(axis.text.x = element_text(colour="grey20", size = 12,
angle = 90, hjust = 0.5,
vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text=element_text(size = 16))
ggplot(ess, aes(x = years_education, y = news_consumption)) + geom_point() + grey_theme + geom_jitter(alpha = 0.3)
```
#### More Visualization
Let's do some more plotting. We should be able to compare leave votes by trade union membership and employment status.
```{r Exploratory Visualisation}
#Leave by employment status
ggplot(data = ess, aes(x = unemployed, y = leave)) +
geom_col()
#Visualization 2
library(ggplot2)
ggplot(ess,aes(x = years_education, y = leave, color = unemployed)) +
geom_jitter(width = 0, height = 0.09, alpha = 0.7)
```
```{r}
#Member of trade union?
ggplot(data = ess, aes(x = trade_union, y = leave)) +
geom_col()
```
### Data Partition
Using the caret package we will split the data into training and test sets.
```{r}
#load caret package
library(caret)
#set the random seed
set.seed(123)
#perform train test split
training.samples <- ess$leave %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- ess[training.samples, ]
test.data <- ess[-training.samples, ]
head(train.data)
```
## Modelling
Lets see what features are the best predictors of EU voter preferences. We will first test trade union membership as a predictor in voting leave in EU elections.
Throughout the modelling we will specify the ```family``` argument as ```binomial```. This is because we are trying to predict the odds of an event taking place. Binomial logistic regression is a particular type of logistic regression in which the dependent variable y is a discrete random variable that takes on values such as 0, 1, 5, 67 etc. Each value represents the number of ‘successes’ observed in m trials. Thus y follows the binomial distribution.
### Model 1: Voter preferences by membership of trade union
```{r}
#Logistic Regression Modelliing
library(aod)
#Model 1
logit_M1 <- glm(leave ~ trade_union + years_education + country_attach + eu_integration, data = train.data, family = binomial(link = "logit"))
screenreg(logit_M1)
summary(logit_M1)
```
```{r}
probabilities <- logit_M1 %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
#predicted.classes
#mean(predicted.classes==test.data$leave)
```
#### Model 1: Results and Evaluation
The first logistic regression model produced some interesting results. We can conclude that trade union membership is positively correlated with a decision to vote to leave in the EU survey. This would support Coulter (2016) who argues that trade unions as interest groups, particuarly in the UK have tended to be more sceptical of EU integration.
Paper: http://eprints.lse.ac.uk/68929/1/LEQSPaper121Coulter.pdf
We now run confidence interval tests on the model. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function. We can also get CIs based on just the standard errors by using the default method.
```{r}
#Confidence intervals
confint(logit_M1)
#Confidence intervals with standard error
confint.default(logit_M1)
```
We can also test for an overall effect of rank using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms.
```{r}
#Wald test
library(car)
Anova(logit_M1, type="II", test="Wald")
```
The results from the wald/chi-square tests would suggest that predictor feature variables are indeed significant. We can now proceed to the next stages of the modelling.
### Model 2: Voter preferences by employment status
For this model we will play particular attention to news consumption levels and how this interacts with trust for politicians and emotional attachment to a country.
```{r}
#Model 2
logit_M2 <- glm(leave ~ unemployed + country_attach + news_consumption + trust_politicians, data = train.data, family = binomial(link = "logit"))
summary(logit_M2)
screenreg(logit_M2)
```
The feature variable unemployed is associated with a 0.10% increase in the likelihood of voting leave. However, the variable is not significant. Lets take a look at the confidence intervals for the model.
#### Model 2: Results and Evaluation
```{r}
#Confidence intervals
confint(logit_M2)
#Confidence intervals with standard error
confint.default(logit_M2)
```
```{r}
#Wald test
library(aod)
wald.test(b = coef(logit_M2), Sigma = vcov(logit_M1), Terms = 1:4)
```
### Full Logistic Regression Model
Let's programme a full logistic regression that tests all the predictors.
```{r}
full.model <- glm(leave ~., data = train.data, family = binomial)
coef(full.model)
```
Country code seems to be highly correlated. Lets remove it from the dataset and try again.
```{r}
train.data1 = train.data[!grepl("^country_code",names(train.data))]
colnames(train.data1)
```
We run the full model again.
```{r}
full.model <- glm(leave ~., data = train.data1, family = binomial)
coef(full.model)
```
#### Perform stepwise variable selection
Select the most contributive variables:
```{r}
library(MASS)
step.model <- full.model %>% stepAIC(trace = FALSE)
coef(step.model)
```
There are a number of variables that seem to be highly correlated with the voting preferences. In particular, years education, concerns about the economic impact of immigration and concerns about the cultural impact of immigration. Indeed, an ongoing academic discussion focusses on whether cultural or economic concerns about immigration are more important as predictors of support for the European Union.
### Model 3: Likelihood to vote leave by attitudes towards immigration
```{r}
logit_M3 <- glm(leave ~ immig_econ + immig_culture, data = train.data, family = binomial(link = "logit"))
summary(logit_M3)
screenreg(logit_M3)
```
#### Model 3: Results and Evaluation
```{r}
#Confidence intervals
confint(logit_M3)
#Confidence intervals with standard error
confint.default(logit_M3)
```
### Model 4: Voter preferences by number of years of education and EU integration level
```{r}
logit_M4 <- glm(leave ~ years_education + eu_integration, data = train.data, family = binomial(link = "logit"))
summary(logit_M4)
screenreg(logit_M4)
```
#### Model 4: Results and Evaluation
```{r}
#Confidence intervals
confint(logit_M4)
#Confidence intervals with standard error
confint.default(logit_M4)
```
### Model 5: Voter preferences by number of years of education and attachment to the country
```{r}
logit_M5 <- glm(leave ~ years_education + country_attach, data = train.data, family = binomial(link = "logit"))
summary(logit_M5)
screenreg(logit_M5)
```
## Predicted Probablities
```{r}
#FitStatistics
mean(train.data$leave)
summary(train.data$leave)
#Fitted Values and Predicted Probabilities
train.data$pps1 <- predict(logit_M1, newdata = train.data, type = "response")
train.data$evs1 <- ifelse(train.data$pps1 > 0.5, yes = 1, no = 0)
```
### Confusion Matrix: Model Accuracy
```{r}
#Confusion matrix to find model fit - actual outcomes
confusion <- table(actual = train.data$leave, expected.value = train.data$evs1)
confusion #Expected values for leave and remain
sum(diag(confusion)) / sum(confusion)
```
Our model succesfully predicted with 82% accuracy.
```{r}
#Likelihood to vote 'leave'; EU integration and 13 years of education
eu_integration_0<- predict(
logit_M4,
newdata = data.frame(years_education = 10, eu_integration = 5),
type = "response"
)
eu_integration_0
#ikelihood to vote 'leave'; EU integration and 20 years of education
eu_integration_10<- predict(
logit_M4,
newdata = data.frame(years_education = 20, eu_integration = 5),
type = "response"
)
eu_integration_10
```
Liklihood to vote leave is 10% given education at university level compared to around 17% with 10 years of education. However, it is clear the feature, eu_integration is far a more signficant predictor in to vote leave.
**Attachment to country by years of education**
```{r}
#Predicted probabilities of voting leave for those who a strongly attatched to their country
country_attatchment1 <- predict(logit_M5,
newdata = data.frame(country_attach = 10, years_education = 20),
type = "response"
)
country_attatchment1
```
There is a 10% chance of voting to leave with 20 years of education and high attachment to the country.
```{r}
country_attatchment2 <- predict(logit_M5,
newdata = data.frame(country_attach = 10, years_education = 13),
type = "response"
)
country_attatchment2#Those less emotionally attatched more likley to vote to remain
#Difference between the predicted probabilities
country_attatchment1 - country_attatchment2
```
On the other hand for those who have had less education but a strongly attached the country there is a 15% chance of voting to leave.
### Plots
```{r}
#Sequence years education
years_education_profiles <- data.frame(years_education = seq(from = 0, to = 54, by = .5),eu_integration = 0)
head(years_education_profiles)
#create a new dataframe for years education profiles
years_education_profiles$predicted_probs <- predict(logit_M4, newdata = years_education_profiles, type = "response")
```
Lets now plot the relationship between years education and voter preferences.
```{r}
#Plot 1: Voting Leave by Years of Education:
ggplot(years_education_profiles, aes(x = years_education, y = predicted_probs)) +
geom_line(alpha = 0.5) + ylab("Probability of Voting Leave") + xlab("Number of Year of Education") + ggtitle("Voting Leave by Years of Education")
```
Plot the standardized residuals for the full model.
```{r}
plot(fitted(full.model),
rstandard(full.model))
```
### Conclusion: Findings and Future Work
The analysis above shows the complexity in predicting attitudes towards politics in general. With that said we can make some kind of conclusion that years of education and attitudes toward immigration are strong predictors of attitudes towards the EU. Going forward it would be interesting to see if we could predict which way a respondent would vote based on one or two features. We could use classification algorithms such as MNB or SVM.