<- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2024_United_Kingdom_general_election"
web_page_url <- xml2::read_html(web_page_url) # Download the webpage and parse as XML webpage
In which I try and figure out whether there are house effects in UK opinion polls and what that tells me.
There’s quite a lot of interest in opinion polls at present in the UK - rather unexpectedly we’re having an election in a few weeks time and although reading the news does me no good at all, looking at the polls is less likely to lead me into doom scrolling news.
When I do look at the polls they look like there’s lots of variation - but then then I look at polls by the same pollster it looked like there might be less change. So I wondered if I could be a bit more systematic about it. There are lots of sites that aim to bring the polls together, but they don’t say much about how they do it, I wondered if a DIY approach was possible.
This post is very much a work in progress, it's here in the interests of getting suggestions and comments, rather than because it is anything like finished (and because I haven't posted anything in ages).
Conveniently Wikipedia has a page that collates all the polls - less conviently it’s only presented as a webpage (as far as I know).
Getting the data
I used the rvest
and xml2
package to pull down the data and parse it.
I then use html_table
from the rvest package to get a list of all the tables on the page and manually. Then I manually inspected the tables and compared them to the webpage to work out which table I wanted (and hoped that the structure of the table didn’t change…)
I turned out that there is more than one table - there is one for each year: 2024, 2023, 2022, 2021, 2020. So I actually need 5 tables, numbers 2 - 6.
Let’s start by working out what to do with one of the tables, the one for 2024.
# test <- html_table(webpage)
<- html_table(webpage)[[3]] %>%
polls_2024_raw ::as_tibble(.name_repair = "unique") tibble
This is great, but there are a few problems when you look at the data.
- Good news - the spanning rows just get the content repeated in each cell, so get parsed in a way that is easy to deal with
- Bad news - some cells contain footnotes that are included as text in the cell: “4%[a]”
- Bad news - the cells have ‘%’ after all the percentages rather than being a number
- Bad news - there are ‘in cell’ comments in the ‘Other’ category that generally follow the % sign
- Bad news - the dates are really non-standard - as each is a range of dates
- One cell in the ‘Other’ category is not formatted correctly because of a parsing error, so doesn’t get cleaned
- Dates that are 31st don’t seem to be parsed correctly and end up as NAs
The next bit of code tries to tidy these things up.
<- polls_2024_raw %>%
polls_2024 mutate(across(Con:Others, ~ str_remove(.x, pattern = "\\%.*"))) %>% # getting rid of anything after %, so also gets rid of footnotes
mutate(across(Con:Others, ~ str_remove(.x, pattern = "\\s.*"))) %>% # getting rid of anything after " " gets rid fo comments in 'Other' category
mutate(across(Con:Others, as.numeric)) %>% # Turn all polling data into numbers from characters - also has the advantage of making all the 'spanning' commentary columns into 'NA' cells
drop_na(Con) %>% # Drop any row in which the percentage support for 'Con' is NA which drops the rows which were 'spanning' rows about political events
mutate(day = str_replace(Datesconducted, "^(\\d+).*", "\\1")) %>% # Extract the number at the start of the string
mutate(month = str_replace(Datesconducted, "(.*)(\\b\\S+)$", "\\2")) %>% # Extract the month at the end of the string
mutate(date = dmy(paste0(day, " ", month, " ", 2024)))
Warning: There were 7 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(Con:Others, as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 6 remaining warnings.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = dmy(paste0(day, " ", month, " ", 2024))`.
Caused by warning:
! 9 failed to parse.
This is still producing warnings suggesting it needs more work, but let’s push on and see what we can see with imperfect data wrangling.
Let’s take a look and see what the data looks like:
ggplot(polls_2024, aes(x = date, y = Con, colour = Pollster)) +
geom_point() +
ylim(0, 30) +
theme(legend.position="bottom")
So we have lots of polls, lots of pollsters - and some of our polls seem to be in the future, and some pollsters have footnote labels attached to their names suggesting I haven’t got the parsing quite right yet. Let’s facet by pollster to see if we can see more about what is going on.
ggplot(polls_2024, aes(x = date, y = Con, group = Pollster, colour = Pollster)) +
geom_line() +
ylim(0, 30) +
facet_wrap(~Pollster)
There are some pollsters with lots of polls and some with many less - let’s look at how many polls each pollster has as I think I only care about pollsters with lots of polls.
%>%
polls_2024 group_by(Pollster) %>%
summarise(num_polls = n()) %>%
ggplot(aes(y = num_polls, x = fct_reorder(Pollster, num_polls))) + # Puts them in order of most polls
geom_bar(stat = "identity") +
coord_flip()
Let’s cut off at 10 polls.
<- polls_2024 %>%
polls_2024_sel group_by(Pollster) %>%
mutate(num_pols = n()) %>%
filter(num_pols >= 10)
# What does the data look like by pollster
ggplot(polls_2024_sel, aes(x = date, y = Con, group = Pollster, colour = Pollster)) +
geom_line() +
ylim(0, 30)
There do seem to be house effects - putting different pollsters higher or lower, but there also seem to be house effects in how the support is shifting - this is clearer with a faceted smoothed plot.
ggplot(polls_2024_sel, aes(x = date, y = Con, group = Pollster, colour = Pollster)) +
geom_smooth() +
ylim(0, 30) +
facet_wrap(~Pollster)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
So it’s pretty clear there are house effects - the next challenge is to work out how we take that into account. I feel a multi-level model coming on.
At which point I think the appropriate phrase is: “Tune in next time”
Multi-level models are still looking appropriate, but also complicated. But what I realised today is that the most sensible thing to test is not an individual party’s vote, but the poll lead - which should catch the house effects of influencing the polling for both Conservative and Labour parties.
ggplot(polls_2024_sel, aes(x = date, y = as.numeric(Lead), group = Pollster, colour = Pollster)) +
geom_smooth() +
ylim(0, 30) +
facet_wrap(~Pollster)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
So it looks like everyone things the lead has been steadily increasing - except for ‘More in Common’ and more dramatically YouGov.
ggplot(polls_2024_sel, aes(x = date, y = as.numeric(Lead), group = Pollster, colour = Pollster)) +
geom_smooth(se = FALSE) +
ylim(0, 30)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
A bit clearer in the non-faceted plot. Well, not long ’til we find out who was right.