June 10, 2024

In which I try and figure out whether there are house effects in UK opinion polls and what that tells me.

There’s quite a lot of interest in opinion polls at present in the UK - rather unexpectedly we’re having an election in a few weeks time and although reading the news does me no good at all, looking at the polls is less likely to lead me into doom scrolling news.

When I do look at the polls they look like there’s lots of variation - but then then I look at polls by the same pollster it looked like there might be less change. So I wondered if I could be a bit more systematic about it. There are lots of sites that aim to bring the polls together, but they don’t say much about how they do it, I wondered if a DIY approach was possible.

This post is very much a work in progress, it's here in the interests of getting suggestions and comments, rather than because it is anything like finished (and because I haven't posted anything in ages).

Conveniently Wikipedia has a page that collates all the polls - less conviently it’s only presented as a webpage (as far as I know).

Getting the data

I used the rvest and xml2 package to pull down the data and parse it.

web_page_url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2024_United_Kingdom_general_election"
webpage <- xml2::read_html(web_page_url) # Download the webpage and parse as XML

I then use html_table from the rvest package to get a list of all the tables on the page and manually. Then I manually inspected the tables and compared them to the webpage to work out which table I wanted (and hoped that the structure of the table didn’t change…)

I turned out that there is more than one table - there is one for each year: 2024, 2023, 2022, 2021, 2020. So I actually need 5 tables, numbers 2 - 6.

Let’s start by working out what to do with one of the tables, the one for 2024.

# test <- html_table(webpage)

polls_2024_raw <- html_table(webpage)[[3]] %>% 
  tibble::as_tibble(.name_repair = "unique")

This is great, but there are a few problems when you look at the data.

  • Good news - the spanning rows just get the content repeated in each cell, so get parsed in a way that is easy to deal with
  • Bad news - some cells contain footnotes that are included as text in the cell: “4%[a]”
  • Bad news - the cells have ‘%’ after all the percentages rather than being a number
  • Bad news - there are ‘in cell’ comments in the ‘Other’ category that generally follow the % sign
  • Bad news - the dates are really non-standard - as each is a range of dates
  • One cell in the ‘Other’ category is not formatted correctly because of a parsing error, so doesn’t get cleaned
  • Dates that are 31st don’t seem to be parsed correctly and end up as NAs

The next bit of code tries to tidy these things up.

polls_2024 <- polls_2024_raw %>% 
  mutate(across(Con:Others, ~ str_remove(.x, pattern = "\\%.*"))) %>%  # getting rid of anything after %, so also gets rid of footnotes
  mutate(across(Con:Others, ~ str_remove(.x, pattern = "\\s.*"))) %>% # getting rid of anything after " " gets rid fo comments in 'Other' category
  mutate(across(Con:Others, as.numeric)) %>% # Turn all polling data into numbers from characters - also has the advantage of making all the 'spanning' commentary columns into 'NA' cells
  drop_na(Con) %>%  # Drop any row in which the percentage support for 'Con' is NA which drops the rows which were 'spanning' rows about political events
  mutate(day = str_replace(Datesconducted, "^(\\d+).*", "\\1")) %>% # Extract the number at the start of the string
  mutate(month = str_replace(Datesconducted, "(.*)(\\b\\S+)$", "\\2")) %>% # Extract the month at the end of the string
  mutate(date = dmy(paste0(day, " ", month, " ", 2024)))
Warning: There were 7 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(Con:Others, as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 6 remaining warnings.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = dmy(paste0(day, " ", month, " ", 2024))`.
Caused by warning:
!  9 failed to parse.

This is still producing warnings suggesting it needs more work, but let’s push on and see what we can see with imperfect data wrangling.

Let’s take a look and see what the data looks like:

ggplot(polls_2024, aes(x = date, y = Con, colour = Pollster)) +
  geom_point() +
  ylim(0, 30) +

So we have lots of polls, lots of pollsters - and some of our polls seem to be in the future, and some pollsters have footnote labels attached to their names suggesting I haven’t got the parsing quite right yet. Let’s facet by pollster to see if we can see more about what is going on.

ggplot(polls_2024, aes(x = date, y = Con, group = Pollster, colour = Pollster)) +
  geom_line() +
  ylim(0, 30) +

There are some pollsters with lots of polls and some with many less - let’s look at how many polls each pollster has as I think I only care about pollsters with lots of polls.

polls_2024 %>% 
  group_by(Pollster) %>% 
  summarise(num_polls = n()) %>% 
  ggplot(aes(y = num_polls, x = fct_reorder(Pollster, num_polls))) + # Puts them in order of most polls
  geom_bar(stat = "identity") +

Let’s cut off at 10 polls.

polls_2024_sel <- polls_2024 %>% 
  group_by(Pollster) %>% 
  mutate(num_pols = n()) %>% 
  filter(num_pols >= 10)

# What does the data look like by pollster

ggplot(polls_2024_sel, aes(x = date, y = Con, group = Pollster, colour = Pollster)) +
  geom_line() +
  ylim(0, 30)

There do seem to be house effects - putting different pollsters higher or lower, but there also seem to be house effects in how the support is shifting - this is clearer with a faceted smoothed plot.

ggplot(polls_2024_sel, aes(x = date, y = Con, group = Pollster, colour = Pollster)) +
  geom_smooth() +
  ylim(0, 30) +
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

So it’s pretty clear there are house effects - the next challenge is to work out how we take that into account. I feel a multi-level model coming on.

At which point I think the appropriate phrase is: “Tune in next time”

Multi-level models are still looking appropriate, but also complicated. But what I realised today is that the most sensible thing to test is not an individual party’s vote, but the poll lead - which should catch the house effects of influencing the polling for both Conservative and Labour parties.

ggplot(polls_2024_sel, aes(x = date, y = as.numeric(Lead), group = Pollster, colour = Pollster)) +
  geom_smooth() +
  ylim(0, 30) +
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

So it looks like everyone things the lead has been steadily increasing - except for ‘More in Common’ and more dramatically YouGov.

ggplot(polls_2024_sel, aes(x = date, y = as.numeric(Lead), group = Pollster, colour = Pollster)) +
  geom_smooth(se = FALSE) +
  ylim(0, 30) 
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

A bit clearer in the non-faceted plot. Well, not long ’til we find out who was right.