Mix your own opinion poll – Thinking it through

In which I try and figure out whether there are house effects in UK opinion polls and what that tells me.

There’s quite a lot of interest in opinion polls at present in the UK - rather unexpectedly we’re having an election in a few weeks time and although reading the news does me no good at all, looking at the polls is less likely to lead me into doom scrolling news.

When I do look at the polls they look like there’s lots of variation - but then then I look at polls by the same pollster it looked like there might be less change. So I wondered if I could be a bit more systematic about it. There are lots of sites that aim to bring the polls together, but they don’t say much about how they do it, I wondered if a DIY approach was possible.

This post is very much a work in progress, it's here in the interests of getting suggestions and comments, rather than because it is anything like finished (and because I haven't posted anything in ages).

Conveniently Wikipedia has a page that collates all the polls - less conviently it’s only presented as a webpage (as far as I know).

Getting the data

I used the rvest and xml2 package to pull down the data and parse it.

web_page_url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2024_United_Kingdom_general_election"
webpage <- xml2::read_html(web_page_url) # Download the webpage and parse as XML

I then use html_table from the rvest package to get a list of all the tables on the page and manually. Then I manually inspected the tables and compared them to the webpage to work out which table I wanted (and hoped that the structure of the table didn’t change…)

I turned out that there is more than one table - there is one for each year: 2024, 2023, 2022, 2021, 2020. So I actually need 5 tables, numbers 2 - 6.

Let’s start by working out what to do with one of the tables, the one for 2024.

# test <- html_table(webpage)

polls_2024_raw <- html_table(webpage)[[3]] %>% 
  tibble::as_tibble(.name_repair = "unique")

This is great, but there are a few problems when you look at the data.

Good news - the spanning rows just get the content repeated in each cell, so get parsed in a way that is easy to deal with
Bad news - some cells contain footnotes that are included as text in the cell: “4%[a]”
Bad news - the cells have ‘%’ after all the percentages rather than being a number
Bad news - there are ‘in cell’ comments in the ‘Other’ category that generally follow the % sign
Bad news - the dates are really non-standard - as each is a range of dates
One cell in the ‘Other’ category is not formatted correctly because of a parsing error, so doesn’t get cleaned
Dates that are 31st don’t seem to be parsed correctly and end up as NAs

The next bit of code tries to tidy these things up.

polls_2024 <- polls_2024_raw %>% 
  mutate(across(Con:Others, ~ str_remove(.x, pattern = "\\%.*"))) %>%  # getting rid of anything after %, so also gets rid of footnotes
  mutate(across(Con:Others, ~ str_remove(.x, pattern = "\\s.*"))) %>% # getting rid of anything after " " gets rid fo comments in 'Other' category
  mutate(across(Con:Others, as.numeric)) %>% # Turn all polling data into numbers from characters - also has the advantage of making all the 'spanning' commentary columns into 'NA' cells
  drop_na(Con) %>%  # Drop any row in which the percentage support for 'Con' is NA which drops the rows which were 'spanning' rows about political events
  mutate(day = str_replace(Datesconducted, "^(\\d+).*", "\\1")) %>% # Extract the number at the start of the string
  mutate(month = str_replace(Datesconducted, "(.*)(\\b\\S+)$", "\\2")) %>% # Extract the month at the end of the string
  mutate(date = dmy(paste0(day, " ", month, " ", 2024)))

Warning: There were 7 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(Con:Others, as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 6 remaining warnings.

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = dmy(paste0(day, " ", month, " ", 2024))`.
Caused by warning:
!  9 failed to parse.

This is still producing warnings suggesting it needs more work, but let’s push on and see what we can see with imperfect data wrangling.

Let’s take a look and see what the data looks like:

ggplot(polls_2024, aes(x = date, y = Con, colour = Pollster)) +
  geom_point() +
  ylim(0, 30) +
  theme(legend.position="bottom")

So we have lots of polls, lots of pollsters - and some of our polls seem to be in the future, and some pollsters have footnote labels attached to their names suggesting I haven’t got the parsing quite right yet. Let’s facet by pollster to see if we can see more about what is going on.

ggplot(polls_2024, aes(x = date, y = Con, group = Pollster, colour = Pollster)) +
  geom_line() +
  ylim(0, 30) +
  facet_wrap(~Pollster)

There are some pollsters with lots of polls and some with many less - let’s look at how many polls each pollster has as I think I only care about pollsters with lots of polls.

polls_2024 %>% 
  group_by(Pollster) %>% 
  summarise(num_polls = n()) %>% 
  ggplot(aes(y = num_polls, x = fct_reorder(Pollster, num_polls))) + # Puts them in order of most polls
  geom_bar(stat = "identity") +
  coord_flip()

Let’s cut off at 10 polls.

polls_2024_sel <- polls_2024 %>% 
  group_by(Pollster) %>% 
  mutate(num_pols = n()) %>% 
  filter(num_pols >= 10)

# What does the data look like by pollster

ggplot(polls_2024_sel, aes(x = date, y = Con, group = Pollster, colour = Pollster)) +
  geom_line() +
  ylim(0, 30)

There do seem to be house effects - putting different pollsters higher or lower, but there also seem to be house effects in how the support is shifting - this is clearer with a faceted smoothed plot.

ggplot(polls_2024_sel, aes(x = date, y = Con, group = Pollster, colour = Pollster)) +
  geom_smooth() +
  ylim(0, 30) +
  facet_wrap(~Pollster)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

So it’s pretty clear there are house effects - the next challenge is to work out how we take that into account. I feel a multi-level model coming on.

At which point I think the appropriate phrase is: “Tune in next time”

Multi-level models are still looking appropriate, but also complicated. But what I realised today is that the most sensible thing to test is not an individual party’s vote, but the poll lead - which should catch the house effects of influencing the polling for both Conservative and Labour parties.

ggplot(polls_2024_sel, aes(x = date, y = as.numeric(Lead), group = Pollster, colour = Pollster)) +
  geom_smooth() +
  ylim(0, 30) +
  facet_wrap(~Pollster)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

So it looks like everyone things the lead has been steadily increasing - except for ‘More in Common’ and more dramatically YouGov.

ggplot(polls_2024_sel, aes(x = date, y = as.numeric(Lead), group = Pollster, colour = Pollster)) +
  geom_smooth(se = FALSE) +
  ylim(0, 30)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

A bit clearer in the non-faceted plot. Well, not long ’til we find out who was right.