analysis_rdraft.Rmd

---
title: "UCL Psychology NSS results - draft 1"
knit: (function(input_file, encoding) {
    out_dir <- 'docs';
    rmarkdown::render(input_file,
      encoding=encoding,
      output_file=file.path(dirname(input_file), out_dir, 'version1.html'))})
output: html_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  out.width = "90%",
  fig.width = 6, 
  fig.height = 4,
  comment = "#>"
)
```

This is a page analysing NSS data from 2014-2023, based on open source data from NSS wesbite. The code, data and markdown doc that generated this page are all on a github open repository <https://github.com/dcr-eyethink/NSS>.

Here's how UCL psychology has placed in the past few years in the league tables for the complete university guide, the Guardian and the Times

[![](docs/assets/league_year.png)](UCL%20psychology%20league%20tables)

So there are two messages here. (a) We are pretty consistently in the second half of the top ten across the years, and (b) it looks like we've dipped the past couple years. So, we're going to dive into the archives of NSS data to figure out if these things are true and what we could do about them.

Note: since making this page, I've found that UCL has a whole data analysis team that has compiled and benchmarked NSS and league tables. So in future drafts, I can use that data, which has the advantage of splitting by psych programmes within UCL.

The headline result, that we'll underline below, is simply that we do very poorly on assessment and feedback. From UCL's analysis:

![](docs/assets/theme23.png){width="600"}

You can look over the processing that I did on data below, or you can skip to the plotting of [UCL results across the years] or [UCL vs all other psych depts], or [UCL versus competitors].

## Preprocessing

If you don't have eyethinkdata tools, install from github

```{r, eval=FALSE}
devtools::install_github("dcr-eyethink/eyethinkdata")
```

Load in the package and the raw data. Note that this has been filtered to just psychology courses, and students on first degree only.

```{r}
library(eyethinkdata)
library(plotly)
full_data <- fread("NSS_2014-23.csv")
qkey <-  data.table(read.csv("qkey3.csv"))
```

First we need to translate the question numbers into the items and labels. There are 3 slightly difference question sets. Annoying. So we have to merge info from the question keys to three different sections. For each, we will have columns for the theme, the full question, and a one word question key q.

```{r}
d <- rbind( pid_merge(full_data[year<=2016], qkey[,.(QuestionNumber=qnum,theme=theme2014,q=item2014,question=set_full2014)],link="QuestionNumber"),
 pid_merge(full_data[year>2016 & year<2023], qkey[,.(QuestionNumber=qnum,theme=theme2017,q=item2017, question=set_full2017)],link="QuestionNumber"),
 pid_merge(full_data[year==2023], qkey[,.(QuestionNumber=qnum,theme=theme2023,q=item2023, question=set_full2023)],link="QuestionNumber"))

```

Now we want one number that summarises responses to the question. Right now we have the distribution of responses for each. All the items are framed positively, ie if they agree with the statement the student is saying something good about the university. So we're going to code into a response variable between -1 and 1 that is positive if they are agreeing, and negative if they are disagreeing. This is different to how the NSS processes this. They have a positivity score, which I assume just calculates the % of responses that are agreeing in anyway. That seems to ignore some of the graded information we have in this dataset though.

For pre 2023, each question has 5 columns giving % responses for the 5 options from strongly agree to strongly disagree. Convert these first to numbers, and then sum to a -1 to 1 scale in a variable r.

```{r}
d[year<2023,ans_sd:=ifelse(A1=="",0,as.numeric(gsub(A1,replacement = "",pattern="%")))]
d[year<2023,ans_d:=ifelse(A2=="",0,as.numeric(gsub(A2,replacement = "",pattern="%")))]
d[year<2023,ans_n:=ifelse(A3=="",0,as.numeric(gsub(A3,replacement = "",pattern="%")))]
d[year<2023,ans_a:=ifelse(A4=="",0,as.numeric(gsub(A4,replacement = "",pattern="%")))]
d[year<2023,ans_sa:=ifelse(A5=="",0,as.numeric(gsub(A5,replacement = "",pattern="%")))]
d[year<2023,r:= (ans_sa + ans_a*.5  + ans_d *-.5 + ans_sd*-1)/100]
```

For 2023 onwards, We have total number of responses for 4 options from strongly agree to strongly disagree, ie there is no 'neither agree nor disagree' option. Convert these to a -1 to 1 scale as above.

```{r}
d[year>=2023,ans_sa:=ifelse(A1=="",0,as.numeric(A1))]
d[year>=2023,ans_a:=ifelse(A2=="",0,as.numeric(A2))]
d[year>=2023,ans_d:=ifelse(A3=="",0,as.numeric(A3))]
d[year>=2023,ans_sd:=ifelse(A4=="",0,as.numeric(A4))]
d[year>=2023,r:= (ans_sa + ans_a*.5  + ans_d *-.5 + ans_sd*-1)/(ans_sa+ans_a+ans_d+ans_sd)]
```

I'm going to exclude the questions that aren't in the main list, and that are about the student union. Then let's check the distribution of those values over all questions for each year.

```{r}
d <- d[!theme=="union" & !is.na(theme)]
ggplot(d,aes(x=r,colour=year,group=year))+geom_density()
```

So it broadly looks as though agreement in these statement overall peaks around the positive, "agree" response, but that over the years, responses have been slipping down for everyone.

# UCL results across the years

Now let's plot UCL's last 10 years for each theme. There are vertical gray bars here to denote when questionnaire changed, making comparisons difficult. I am going to exclude the themes for mental_health, personal, overall satisfaction and freedom as they only have single question each that were only asked in a handful of years. We're going to make a few of these interactive plots, so I will make a function for them

```{r warning=FALSE}
d[,tm:=ifelse(theme %in% c("mental_health", "personal", "freedom","overall"),FALSE,TRUE)]
yearplotly <- function(data,y,colour_condition,title=NULL,cols=NULL){
  p <- ggplot(data,aes(x=year,y=.data[[y]],colour=.data[[colour_condition]],text=paste(colour_condition,q)))+
     geom_vline(data=data.table(year=c(2016.5,2022.5)),alpha=0.1,size=3,aes(xintercept=year))+
    geom_line(stat='summary', fun='mean', aes(colour=.data[[colour_condition]],y=.data[[y]],x=year,text=.data[[colour_condition]]),
              inherit.aes = F,size=1.5,alpha=.4)+
  geom_point(position = position_jitter(width = 0.3, height = 0.1),alpha=.6,size=2)+theme_bw()+ggtitle(title)
  if (!is.null(cols)){p <- p+ scale_color_manual(values=cols)}
ggplotly(p,tooltip = "text")
}
yearplotly(data=d[Institution=="University College London" & tm],y = "r",colour_condition = "theme")
```

This is an interactive plot, so you can hover over the dots to see the questions, or zoom into regions. Click/double click on the themes in the legend to hide or isolate them.

If you want to see the full text of the questions, you can [look at the full question key](https://github.com/dcr-eyethink/NSS/blob/main/qkey3.csv)

So people love our resources! The rankings seem pretty stable over time here. Rankings seem higher pre 2017 and there is a drop off in 2023, but as the grey lines show, these changes are confounded by a different set of questions (and responses). What does seem clear here is that our weak point is our assessments. These are ranked low and if anything have been getting worse.

# UCL vs all other psych depts

How do we match up with the average psych dept? Here's all rankings over the years, comparing UCL against all other psych depts.

```{r}
yp <- geom_vline(data=data.table(year=c(2016.5,2022.5)),alpha=0.1,size=3,aes(xintercept=year))
d[,ucl:=ifelse(Institution=="University College London",TRUE,FALSE)]
d[,ucl_all:=ifelse(Institution=="University College London","UCL","Others")]
pirateye(d[ (tm) ],x_condition = "year",colour="ucl_all",dodgewidth = 0,
         line = T,dv="r",violin = F,error_bars = T,dots=F, cols=c("#00BD5C" , "#ff0000"))+yp
```

So in general terms: we had a relatively great pandemic! At least up until 2023 (the cohort that were in their first year during pandemic). But again the picture is complicated by the change in survey. We can split those responses by theme and compare against other depts. It's a bit messy comparing all individual themes for both, so let's try and simplify this by using a baseline. We can z score the r values for each question, each year.

```{r}
d[, rzall:=scale(r),by=.(year,question)]
pirateye( d[ucl & tm],x_condition = "year",colour_condition =  "theme",dodgewidth = 0,
         line = T,dv="rzall",violin = F,error_bars = F,dots=F)+yp+geom_hline(aes(yintercept=0))
```

So now the psychology sector average for each year is 0, shown by heavy black line. Benchmarked like this, it looks like we had a good period of growth from 2016 onwards, and in the pandemic years we were well above the mean in almost everything. But again, our assessments are ranked below the mean and perhaps trending down. Let's try an interactive plot of the same info

```{r warning=FALSE}
yearplotly(data=d[Institution=="University College London" & tm],y = "rzall",colour_condition = "theme")
```

Hover on dots to see what individual questions are and click/double click on the themes in the legend to hide/isolate them.

# UCL versus competitors

Let's focus on a smaller set of competitor departments. Reasonably arbitrarily we have Leeds, Glasgow, York, Royal Holloway, Queen Mary, Bath, St Andrews, and Exeter. Oxford and Cambridge are missing lots of NSS data due to a boycott over several years, so there is lots of variance in their results. We can ID these in the data by UKPRN numbers. Since places have changed their institutional names and punctuation over the past decade, we'll rename them with shorter names, and assign set colours to them that will repeat across plots. First let's see how all results stacked up over past decade

```{r}
namecode <- data.table(UKPRN=c("10005553", "10007167", "10007774", "10007775", "10007784", "10007788", "10007792", "10007794", "10007795", "10007803", "10007850"),uni=c("Royal Holloway", "York", "Oxford", "QMU", "UCL", "Cambridge", "Exeter", "Glasgow", "Leeds", "St Andrews", "Bath"),unicol=c( "#DB8E00", "#AEA200", "#64B200", "#00BD5C" , "#ff0000","#00C1A7" ,"#00BADE", "#00A6FF", "#B385FF", "#EF67EB" ,"#FF63B6"))
d <- pid_merge(d,namecode,link = "UKPRN")
d$ucl_comp <- ifelse(!is.na(d$uni),TRUE,FALSE)
pirateye( d[ucl_comp & tm ],x_condition = "uni",
         dv="rzall",violin = T,error_bars = T,dots=F,reorder = T, cols = namecode[,setNames(unicol,uni)],cflip = T)
```

Here is UCL against all these competitors over time. This is still z-scored against the whole country by year:

```{r}
pirateye( d[ucl_comp & tm],x_condition = "year",colour_condition =  "ucl_all",dodgewidth = 0,
         line = T,dv="rzall",violin = F,error_bars = F,dots=F, cols=c("#00BD5C" , "#ff0000"))+yp+geom_hline(aes(yintercept=0))
```

So we were behind the pack a bit 10 years ago, then had a bit of growth from 2017 onwards. We did well during pandemic, as competitors did, though maybe not quite as well as us. But since then we all have dipped (but again - a different survey is being used).

Now let's look at each of the themes in turn, and plot us vs all the competitors. The full set of competitors is a bit too many for this plot, so we're going to narrow it down to smaller set of 7.

```{r warning=FALSE}
d[,ucl_compsm:=ifelse(uni %in% c("Royal Holloway", "York", "QMU", "UCL",  "Glasgow", "Leeds", "Bath"),TRUE,FALSE)]
themeplot <- "academic"
yearplotly(data=d[ucl_compsm & tm & theme==themeplot],y = "rzall",colour_condition = "uni",title = themeplot,cols=namecode[,setNames(unicol,uni)])
```

```{r warning=FALSE}
themeplot <- "assessment"
yearplotly(data=d[ucl_compsm & tm & theme==themeplot],y = "rzall",colour_condition = "uni",title = themeplot,cols=namecode[,setNames(unicol,uni)])
```

```{r warning=FALSE}
themeplot <- "learnopp"
yearplotly(data=d[ucl_compsm & tm & theme==themeplot],y = "rzall",colour_condition = "uni",title = themeplot,cols=namecode[,setNames(unicol,uni)])
```

```{r warning=FALSE}
themeplot <- "teaching"
yearplotly(data=d[ucl_compsm & tm & theme==themeplot],y = "rzall",colour_condition = "uni",title = themeplot,cols=namecode[,setNames(unicol,uni)])
```