New York Times Covid Datasource:

https://www.kaggle.com/datasets/kalilurrahman/new-york-times-covid19-dataset

There are multiple ways of going about forecasting data. I decided to use an autoregressive integrated moving average (ARIMA) in two different styles, manual and automatic arima, to forecast COVID-19 cases in the US. In this instance, I used the NYT Covid Dataset from Kaggle and looked at the daily COVID cases across all of the United States from the beginning of the pandemic to the end of January 2023. Data was split to training and test sets for modelling. The training set was all of the days except the last 7 days while the test set was the last 7 days of the dataset (for validation). I used the same forecasting model to predict the next 7 days of Feburary for cases.

# Convert data to tibble and clean dataset 
US <- tibble(US_all)

# view data
US[1:5, ]

So here is what the data looks like, we have a very long dataset with each row representing a day starting from Jan 21 2020. The columns contain data on different geoids (location), cases, deaths, and their respective averages. So in order to start playing around with this timeseries data, we need to convert the dates (currently is is in <char> format, need to convert it to <date>) so it is recognizable by other time-series operators.

# convert date to dates
US$date <- as_date(US$date, "%Y-%m-%d")

# Sort by dates
US %>% arrange(ymd(US$date))

Just to be thorough with the data and pick up any entries that are unrealistic we should look and see if there are any negative cases or deaths reported in this table. These values can be really detrimental to our models if they're not picked up and cleaned out of the dataset.

# dataset contains negative values for cases for three dates
neg_values_cases <- filter(US, cases < 0 | deaths < 0)
neg_values_cases

If we check the source we see that certain dates have no cases reported due to an anomaly in that day's reporting. So we can replace all the negative and missing values with NA and use multiple imputation to fill in the gaps. There is no systematic structure to the missingness and it is assumed that the data is missing at random (MAR) or missing completely at random (MCAR).

# convert negative cases to NA for cases and deaths

US_cleaned <- US

for (i in 1:nrow(US)) {
  if (US_cleaned$cases[i] < 0) {
    US_cleaned$cases[i] = NA
  } else{
    US_cleaned$cases[i] = US$cases[i]
  }
}


for (i in 1:nrow(US)) {
  if (US_cleaned$deaths[i] < 0) {
    US_cleaned$deaths[i] = NA
  } else{
    US_cleaned$deaths[i] = US$deaths[i]
  }
}

# check for no remaining negative values for cases and deaths
neg_values_cases_cleaned <-
  filter(US_cleaned, cases < 0 | deaths < 0)
neg_values_cases_cleaned

Now lets visualize the missing data, first by just tables but then with other plots just to see if there is any pattern we've missed to the missingness.

# visualize data
US_cleaned %>% map(is.na) %>% map(sum)
## $date
## [1] 0
## 
## $geoid
## [1] 0
## 
## $cases
## [1] 3
## 
## $cases_avg
## [1] 0
## 
## $cases_avg_per_100k
## [1] 0
## 
## $deaths
## [1] 2
## 
## $deaths_avg
## [1] 0
## 
## $deaths_avg_per_100k
## [1] 0
vis_miss(US_cleaned)

md.pattern(US_cleaned, rotate.names = TRUE)

##      date geoid cases_avg cases_avg_per_100k deaths_avg deaths_avg_per_100k
## 1103    1     1         1                  1          1                   1
## 2       1     1         1                  1          1                   1
## 1       1     1         1                  1          1                   1
## 1       1     1         1                  1          1                   1
##         0     0         0                  0          0                   0
##      deaths cases  
## 1103      1     1 0
## 2         1     0 1
## 1         0     1 1
## 1         0     0 2
##           2     3 5

It looks like there are very few data points missing and overall the data quality is really strong (as expected). It seems only a few dates either had a negative or missing value for its deaths and/or cases. Regardless, its still a good idea not to omit these datapoints from the dataset but to fill them in using multiple imputation. So below, we use a standard pmm method to impute these NA values.

# impute NA's using the standard pmm method
imp_US_cleaned <-
  mice(
    US_cleaned,
    m = 5,
    maxit = 10,
    method = 'pmm',
    seed = 500
  )
## Warning: Number of logged events: 3
summary(imp_US_cleaned)
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##                date               geoid               cases           cases_avg 
##                  ""                  ""               "pmm"                  "" 
##  cases_avg_per_100k              deaths          deaths_avg deaths_avg_per_100k 
##                  ""               "pmm"                  ""                  "" 
## PredictorMatrix:
##                    date geoid cases cases_avg cases_avg_per_100k deaths
## date                  0     0     1         1                  0      1
## geoid                 1     0     1         1                  0      1
## cases                 1     0     0         1                  0      1
## cases_avg             1     0     1         0                  0      1
## cases_avg_per_100k    1     0     1         1                  0      1
## deaths                1     0     1         1                  0      0
##                    deaths_avg deaths_avg_per_100k
## date                        1                   0
## geoid                       1                   0
## cases                       1                   0
## cases_avg                   1                   0
## cases_avg_per_100k          1                   0
## deaths                      1                   0
## Number of logged events:  3 
##   it im dep      meth                 out
## 1  0  0      constant               geoid
## 2  0  0     collinear  cases_avg_per_100k
## 3  0  0     collinear deaths_avg_per_100k

So, we've got 5 instances of imputations with 3 of them being cases and 2 of them being deaths (and some combination of case and death).

head(imp_US_cleaned$imp$cases)
head(imp_US_cleaned$imp$deaths)
complete_US_cleaned <- complete(imp_US_cleaned, 1)
summary(complete_US_cleaned)
##       date               geoid               cases           cases_avg       
##  Min.   :2020-01-21   Length:1107        Min.   :      0   Min.   :     0.1  
##  1st Qu.:2020-10-23   Class :character   1st Qu.:  25663   1st Qu.: 37577.8  
##  Median :2021-07-27   Mode  :character   Median :  55180   Median : 62809.4  
##  Mean   :2021-07-27                      Mean   :  92364   Mean   : 91571.0  
##  3rd Qu.:2022-04-29                      3rd Qu.: 121256   3rd Qu.:108138.1  
##  Max.   :2023-01-31                      Max.   :1433977   Max.   :806794.6  
##  cases_avg_per_100k     deaths          deaths_avg     deaths_avg_per_100k
##  Min.   :  0.00     Min.   :    0.0   Min.   :   0.0   Min.   :0.0000     
##  1st Qu.: 11.32     1st Qu.:  323.5   1st Qu.: 387.3   1st Qu.:0.1200     
##  Median : 18.93     Median :  701.0   Median : 734.5   Median :0.2200     
##  Mean   : 27.60     Mean   : 1010.7   Mean   : 980.6   Mean   :0.2956     
##  3rd Qu.: 32.59     3rd Qu.: 1427.0   3rd Qu.:1373.4   3rd Qu.:0.4150     
##  Max.   :243.15     Max.   :12718.0   Max.   :3342.2   Max.   :1.0100

A plot would tell a better story of these imputed values and also show us whether our imputation passes the "smell test" so to speak. Below are the entire case and deaths timeseries plot with the imputed values shown in red. Overall, they look to be in-line with the trends of the data.

# visualize imputed values
xyplot(imp_US_cleaned, cases ~ date, pch = 5, cex = 2) #looks about right

# Nov 11, 2022 has a massive spike in deaths but that was not an anomaly,
# that is reported as such by Google, thus retained in this dataset
xyplot(imp_US_cleaned, deaths ~ date, pch = 5, cex = 2)


So, now that our setups for the dataset is complete, we can move on to the fun part, forecasting cases in the future using this data as training. We'll start by converting this dataframe to a timeseries tibble, a tsibble if you will.

## Modelling NYT COVID Data
# convert to time series tibble (tsibble)
tsibble_US <- complete_US_cleaned %>% mutate(date = ymd(date)) %>%
  as_tsibble(index = date)



tsibble_US %>% ggplot(aes(x = date, y = cases)) +
  geom_line() +
  xlab("Date") + ylab("Cases") +
  ggtitle("United States Covid Cases by Date")

# There is SOME seasonality here, as cases rise in the winter months
# The caveat here is that we dont see a similar rise in 2023
# but this may just be due to changes in how cases are reported compared to
# 2020 - 2022, overall from medical professionals, we know that cases usually
# rise in the colder months, and drop in summer months.

# use a seasonal ARIMA model and a univariate tibble for modelling purposes
ts_Univ_US_cases <- tsibble_US[, c(1, 3)]


# separate into test and train sets
train_set_US_cases <-
  filter_index(ts_Univ_US_cases, "2020-01-21" ~ "2023-01-24")

test_set_US_cases <-
  filter_index(ts_Univ_US_cases, "2023-01-25" ~ .)

To get a better sense of the seasonality, lets log transform the dataset

# log transform cases to get a better idea of seasonality within the data
train_set_US_cases %>% autoplot(log(cases)) +
  xlab("Date") + ylab("log(Cases)") +
  ggtitle("United States Covid Cases")


For forecasting, an AutoRegressive Intergrated Moving Average (ARIMA) is commonly used for non-seasonal data (but they can also be used for seasonal data provided you transform the timeseries in some way). In R, there is an auto.arima() model and a manual ARIMA model, lets test both of these out to see what will give us the best results (lowest AIC score).

#### Test out different ARIMA model auto.arima vs ARIMA (manual)

# auto arima
auto.arima(log((train_set_US_cases$cases) + 1), seasonal = TRUE,
           trace = TRUE)
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,1,2) with drift         : 2079.858
##  ARIMA(0,1,0) with drift         : 2717.327
##  ARIMA(1,1,0) with drift         : 2705.796
##  ARIMA(0,1,1) with drift         : 2627.628
##  ARIMA(0,1,0)                    : 2715.443
##  ARIMA(1,1,2) with drift         : 2443.759
##  ARIMA(2,1,1) with drift         : 2401.474
##  ARIMA(3,1,2) with drift         : 1974.891
##  ARIMA(3,1,1) with drift         : 2347.431
##  ARIMA(4,1,2) with drift         : Inf
##  ARIMA(3,1,3) with drift         : 2070.411
##  ARIMA(2,1,3) with drift         : 2337.418
##  ARIMA(4,1,1) with drift         : 2228.334
##  ARIMA(4,1,3) with drift         : Inf
##  ARIMA(3,1,2)                    : 1974.174
##  ARIMA(2,1,2)                    : 2079.121
##  ARIMA(3,1,1)                    : 2348.258
##  ARIMA(4,1,2)                    : Inf
##  ARIMA(3,1,3)                    : 2069.753
##  ARIMA(2,1,1)                    : 2402.045
##  ARIMA(2,1,3)                    : 2337.658
##  ARIMA(4,1,1)                    : 2230.202
##  ARIMA(4,1,3)                    : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(3,1,2)                    : 1965.779
## 
##  Best model: ARIMA(3,1,2)
## Series: log((train_set_US_cases$cases) + 1) 
## ARIMA(3,1,2) 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1     ma2
##       0.8145  -0.4582  -0.3651  -1.4649  0.8581
## s.e.  0.0307   0.0359   0.0315   0.0189  0.0147
## 
## sigma^2 = 0.3467:  log likelihood = -976.85
## AIC=1965.7   AICc=1965.78   BIC=1995.72

The auto ARIMA model had an AIC of 1965.7 with the parameters of (3,1,2) as the (p,d,q) coefficients.

Below is a manual ARIMA with seasonal terms included in the ARIMA coefficients. In a non-seasonal model, the coefficients are just (p,d,q) whereas in a seasonal model, additional coefficients (P, D, Q)m are introduced. This seasonal part of the model (uppercase coeff.) consist of terms that are similar to the non-seasonal coefficients but are calculated using backshifting of the seasonal period. The 'm' is the number of observations per year and in our case it works out to be 7, which is around 1 observation per every ~2 months.

#Test out manual ARIMA with seasonal coefficients

train_set_US_cases %>% model(arima = ARIMA(log((cases + 1)))) %>% report()
## Series: cases 
## Model: ARIMA(4,1,0)(0,1,1)[7] 
## Transformation: log((cases + 1)) 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4     sma1
##       -0.6987  -0.4621  -0.2846  -0.1532  -0.6995
## s.e.   0.0309   0.0364   0.0363   0.0304   0.0277
## 
## sigma^2 estimated as 0.1353:  log likelihood=-457.49
## AIC=926.99   AICc=927.07   BIC=956.96

Therefore with the lowest AIC of 926.66 compared to AIC of 1965.7 of the auto.arima(), the manual seasonal arima was used for model prediction

# returns best seasonal arima model (4,1,0)(0,1,1)[7] with an AIC of 926.99



# add 1 to 0 cases to avoid -INF error, arima(4,1,0)(0,1,1)[7] was used
model_fit <-
  train_set_US_cases %>% model(arima = ARIMA(log((cases + 1)))) %>%
  report()
## Series: cases 
## Model: ARIMA(4,1,0)(0,1,1)[7] 
## Transformation: log((cases + 1)) 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4     sma1
##       -0.6987  -0.4621  -0.2846  -0.1532  -0.6995
## s.e.   0.0309   0.0364   0.0363   0.0304   0.0277
## 
## sigma^2 estimated as 0.1353:  log likelihood=-457.49
## AIC=926.99   AICc=927.07   BIC=956.96

So, using the seasonal arima, lets forecast the train period data and compare it with actual values. In the plot below, the dark purple represents the 80% confidence and the lighter purple represents the 95% confidence levels. The blue line represents forecasted data while the black line represents actual data.

# compare trained forecast to test set
model_plot <- train_set_US_cases %>%
  model(arima = ARIMA(log((cases + 1)))) %>%
  forecast(h = "7 days") %>%
  autoplot(filter_index(ts_Univ_US_cases, "2023-01-01" ~ .)) +
  xlab("Date") + ylab("Cases") +
  ggtitle("Forecasted Covid Cases with Train Set")
model_plot

# get predicted values from forecast model
forecast_train_data <- forecast(model_fit, h = 7)
predicted <- forecast_train_data$.mean
actual <- test_set_US_cases$cases

# compare predicted to actual values 
compare_train_test <- as.data.frame(cbind(predicted, actual))
compare_train_test$accuracy <-
  compare_train_test$predicted / compare_train_test$actual

compare_train_test

The model seems somewhat accurate to actual data, and follows the trends of decreasing and increasing within days, thus will use this ARIMA to forecast the next 7 days. The average accuracy is ~1.5x the actual data so our model over predicts actual cases most of the times. However, we see that the actual number of cases on the 3rd day of prediction is 14171 which is substantially lower than all other previous and subsequent dates and is the biggest source of error in our accuracy. By omiting this outlier date, our model's average accuracy is ~1.2x the actual predicted.



forecast_plot <- ts_Univ_US_cases %>%
  model(arima = ARIMA(log((cases + 1)))) %>%
  forecast(h = "7 days") %>%
  autoplot(filter_index(ts_Univ_US_cases, "2023-01-01" ~ .)) +
  xlab("Date") + ylab("Cases") +
  ggtitle("United States Covid Cases Forecast 7 Days")
forecast_plot

next_7_days_forecast <-
  model_fit <-
  ts_Univ_US_cases %>% model(arima = ARIMA(log((cases + 1))))

next_7_cases <- forecast(next_7_days_forecast, h = 7)


# forecast for next 7 days, Feb 1st to Feb 7th, .mean represents mean cases
# forecasted by the arima model
next_7_cases[, c(2, 4)]

The forecast for the next 7 days can be found in the table next_7_cases. The reported values are simply the mean values of the forecast whilst the compare_train_set contains both predicted and actual values of the cases for Feb 1-7th. The prediction plots with 85th and 95th percentile areas are also shown comparing the training set to the test set (Jan 25th - Jan31st) in model_plot and for the next 7 days past Jan 31st in forecast_plot. An ARIMA model was used as the main method for this time series forecasting as it has been used widely in the past to model and forecast previous time series data. Looking at the data, the cases plot when log transformed showed some signs of seasonality, with increasing and decreasing periods of cases numbers across the United Sates. To justify a seasonal ARIMA model, I used an auto.arima and a manual arima and compared the different AIC scores and used the model with the lowest AIC score as the more accurate model. Looking at the literature, Alabdulrazzaq et al.,(2021) also used a similar ARIMA based model to predict COVID-19 spread and found positive results in case prediction in the country of Kuwait. The AIC values of their best fit arima model was approx. ~900 - 1200 which is similar to the manual ARIMA values calculated here as opposed to the auto.arima values which further justified using the manual arima to forecast the cases rather than the auto.arima model. Finally, from the accuracy tables we see that the forecast model on average over-predicts the cases compared to the actual with a large discrepancy on day 3 where it predicted almost three times the actual reported cases. The average accuracy of predicted to actual cases is ~1.5x predicted compared to actual however this decreases to ~1.2 when omitting the day 3 spike. Although this method of forecasting COVID cases in the US is somewhat accurate, more tuning of the model can be made for a more accurate forecasting.

https://www.sciencedirect.com/science/article/pii/S2211379721006197

Predict Number of Cases for the next 92 days

current_date <- Sys.Date()
difference <-
  c(current_date - ts_Univ_US_cases[nrow(ts_Univ_US_cases), 1][[1]])

difference_days <- difference[[1]]

forecast_plot_2 <- ts_Univ_US_cases %>%
  model(arima = ARIMA(log((cases + 1)))) %>%
  forecast(h = difference_days) %>%
  autoplot(filter_index(ts_Univ_US_cases, "2023-01-01" ~ .)) +
  xlab("Date") + ylab("Cases") +
  ggtitle(paste0("Forecast US Covid Cases for next ", difference_days, " days"))
forecast_plot_2

current_day_forecast <-
  model_fit <-
  ts_Univ_US_cases %>% model(arima = ARIMA(log((cases + 1))))

current_day_cases <-
  forecast(current_day_forecast, h = difference_days)
current_day_cases
current_day_prediction <- tail(current_day_cases, n = 1)
print(
  paste0(
    "The current day predicition from the ARIMA model is " ,
    format(round(current_day_prediction$.mean, 0), nsmall = 0),
    " cases for ",
    current_date
  )
)
## [1] "The current day predicition from the ARIMA model is 343079 cases for 2023-05-03"

Using Feburary as an example, Feb 21st prediction: 39283 cases (29117 actual) Feb 22nd prediction: 119581 cases (127499 actual) Feb 23rd prediction: 81149 cases (64478 actual)

The NYT cases timeline shows there is a spike in cases every Wednesday. For ex: Feb 15th showed 102914 cases, Feb 8th showed 112912 cases etc. The prediction of 119581 cases for Feb 22nd from this model is high compared to the predicted 39283 cases the day before, but it is in line with trends in the actual data. Overall, about a month into the forecast, still see that the model over-predicts the data but it is not making consistently large over-predctions or under-predictions.

T he manual arima model was used to predict the cases for the current day. We see that whilst the model was deemed somewhat accurate in predicting cases in the immediate 7 days past the training dataset, the model instead shows this cyclic behaviour when looking at the larger timeline for forecasting. We see that as the days to forecast increase, the range of prediction also increases massively and the data may not be accurate at all considering the large variability within the 95%ile projection. It is also important to note that data for March is being predicted using the lastest data from January, which may not show accurate results considering that the entire month of Feburary is missing from the forecast model.

Trips and Covid Cases by States

The lockdowns that followed the onset of COVID were effective in reducing the transmission of the disease. Although it was followed (generally) by the population, many individuals in the US did not follow strict lockdown procedures. The following analysis basically asks the question, if you travelled (indicated by number of trips) were you more likely to get / transmit COVID?

States <- as_tibble(
  read.csv("filepath"))

# Doesn't make sense to keep the time-series for merging datasets, group
# by state and take average of each variable

States_avg <- States[, !names(States) %in% c("date", "geoid")]

# Summarize each state by means of each variable, mortality rate was determined
# by mean(deaths_avg_per_100k) * 1000 to unit match data from ext resources
# https://www.statista.com/statistics/1109011/coronavirus-covid19-death-rates-us-by-state/
# I just didn't want to work with decimals and converting that to actual number
# of deaths per 100K made more sense from a interpretation perspective 

States_2 <-
  States_avg %>% group_by(state) %>% summarise(
    Mean_Cases = mean(cases),
    Mean_Cases_avg = mean(cases_avg),
    Mean_Cases_avg_per_100k = mean(cases_avg_per_100k),
    Mean_Deaths = mean(deaths),
    Mean_Deaths_avg = mean(deaths_avg),
    Mean_Deaths_avg_per_100K = mean(deaths_avg_per_100k),
    Mortality_Rate = mean(deaths_avg_per_100k) * 1000,
    .groups = 'drop'
  )

Import ext. dataset. Contains data on trips made by the US population.

# this file is massive so it may take a while to import
Trips_all <- read.csv("filepath")

# use only the state level data, filter dates to match COVID database
Trips_state <- filter(Trips_all, Level == "State")

# convert dates to date
Trips_state$Date <- as_date(Trips_state$Date, "%Y-%m-%d")
# convert to tibble 
Trips_state <- as_tibble(Trips_state)

#filter dates between Jan 21 2020 - Jan 31 2023 to match COVID dataset
Trips_state <- Trips_state %>% filter(Date > "2020-01-20" & Date < "2023-01-31")

# summarize by states 
Trips_2 <-
  Trips_state %>% group_by(State.Postal.Code) %>% summarise(
    Pop_at_home = mean(Population.Staying.at.Home),
    Pop_not_home = mean(Population.Not.Staying.at.Home),
    Avg_trips = mean(Number.of.Trips),
    .groups = 'drop'
  )

# merge datsets, dropping non-states in the COVID database
Trips_2 <- filter(Trips_2, State.Postal.Code != "DC")

States_3 <-
  filter(
    States_2,
    !state %in% c(
      "District of Columbia",
      "American Samoa",
      "Northern Mariana Islands",
      "Puerto Rico",
      "Virgin Islands",
      "Guam"
    )
  )

State_Names <- c()

# Change state abbreviation to state names 
for (i in 1 :nrow(Trips_2)){
  State_Names[i] <- state.name[grep(Trips_2$State.Postal.Code[i], state.abb)]
}

# arrange by state postal codes to attach state names 
Trips_2 <- Trips_2 %>% arrange(State.Postal.Code)
Trips_3 <- cbind(Trips_2, State_Names)

# arrange by state names now in the final dataframe to attach to states database
Trips_3 <- Trips_3 %>% relocate(State_Names, .before = Pop_at_home)
Trips_3 <- Trips_3 %>% arrange(State_Names)

# arrange states database by state names
States_3 <-  States_3 %>% arrange(state)


# merge dataset, called Mortality_by_trips

Mortality_by_trips <- cbind(States_3, Trips_3[,c(2:5)])

# check to see if all state names and corresponding data are in the right place
# this should be 50, otherwise a state was misplaced 
sum(Mortality_by_trips$state == Mortality_by_trips$State_Names)
## [1] 50
head(Mortality_by_trips)

I picked the dataset from the USGOV that looked at overall trips made in the United States to merge into the dataset as I wanted to how COVID cases were affected by the number of trips Americans had made over the duration of the pandemic. Essentially, I was looking to see whether the number of trips made during the pandemic could be correlated with the number of cases in each state as a proxy-measurement of whether people followed or ignored the “stay at home when possible” guidelines set by healthcare providers and professionals. To be clear here, I was not looking specifically at the lockdown periods but rather making the general assumption that those who did not take this pandemic seriously would travel more frequently and thus this could show up in the state COVID cases and mortality records.

Plot state-by-state mortality rates (y) by the number of trips (x)

#ggplot  
library(plotly)
library(ggplot2)
library(ggrepel)
library(ggpubr)
ggplot(data = Mortality_by_trips, aes(x = Avg_trips, y = Mortality_Rate)) +
  geom_point() +
  geom_text_repel(aes(label = ifelse(Mortality_Rate>420 | Mortality_Rate <160 | Avg_trips> 60000000, as.character(state),'')),
                  box.padding = 1.5,
                  point.padding = 0.5,
                  force = 100,
                  segment.size = 0.2) +
  # geom_label_repel(aes(label = ifelse(Mortality_Rate>420 | Mortality_Rate <160 | Avg_trips> 60000000, as.character(state),'')),
  #                  box.padding = 0.35,
  #                  point.padding = 0.5,
  #                  segment.color = 'black') +
  geom_smooth(method = 'lm') +
  stat_cor(label.y= 500) + 
  ylab("Mortality Rate (deaths/100k)") +
  xlab("Average Number of Trips") +
  ggtitle("US State Mortality Rates across Average Number of Trips") +
  theme_classic()

# The vast difference in total population is probably influencing too much of 
# this data, need to normalize by population

# normalize by total population of the state
# get total population by adding those at home and those not at home
# this is a hack-y way to get total pop but I'll double check with
# official data to see if this is somewhat accurate 


totalpop <-  c(Mortality_by_trips$Pop_at_home + Mortality_by_trips$Pop_not_home)

# total pop is close enough to actual total pop as of 2021 data, based on
# google results 

Mortality_by_trips$totalpop <-  totalpop

ggplot(data = Mortality_by_trips, aes(x = (Avg_trips/totalpop), y = Mortality_Rate)) +
  geom_point() +
  geom_text_repel(aes(label = ifelse(Mortality_Rate>420 | Mortality_Rate <160 | Avg_trips> 60000000, as.character(state),'')),
                  box.padding = 1.5,
                  point.padding = 0.5,
                  force = 100,
                  segment.size = 0.2) +
  # geom_label_repel(aes(label = ifelse(Mortality_Rate>420 | Mortality_Rate <160 | Avg_trips> 60000000, as.character(state),'')),
  #                  box.padding = 0.35,
  #                  point.padding = 0.5,
  #                  segment.color = 'black') +
  geom_smooth(method = 'lm') +
  stat_cor(label.y= 450) +
  ylab("Mortality Rate (deaths/100k)") +
  xlab("Average Number of Trips (normalized by State population)") +
  ggtitle("US State Mortality Rates across Normalized Avg Number of Trips") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

Looking at both of the graphs of state Mortality Rate by Avg Trips and Avg trips normalized by state population it is clear that the relationship between these two variables is not proportional. However we can use this graph to infer several things about the reaction to the pandemic in a state by state fashion. California drops from being the state with the highest number of trips taken to the lowest when controlled for it’s population. Although interestingly this is not seen in the state of New York. Vermont, we see has the opposite effect to California, where the absolute number of trips is low but moves to being high number of trips when accounting for population. Finally, in terms of mortality rates, we see that even states with high avg number of trips or high avg number of trips / population does not necessarily coincide with more people dying. This is probably indicative of the behaviour of the population when they do go out of their homes and travel (i.e. large gatherings, not wearing masks and taking proper safety precautions which was generally avoided in the “red” states of Arizona, Mississippi and Florida)

What if I looked only at the first year of covid and trips?

# filter by first covid year and mortality and trips 
Trips_state_2020 <-
  Trips_state %>% filter(Date > "2020-01-20" & Date < "2020-12-31")


# summarize by states 
Trips_2020 <-
  Trips_state_2020 %>% group_by(State.Postal.Code) %>% summarise(
    Pop_at_home = mean(Population.Staying.at.Home),
    Pop_not_home = mean(Population.Not.Staying.at.Home),
    Avg_trips = mean(Number.of.Trips),
    .groups = 'drop'
  )

# merge datsets, dropping non-states in the COVID database
Trips_2020 <- filter(Trips_2020, State.Postal.Code != "DC")

# manage states for first covid year 
States_2020 <- States

# convert to dates 
States_2020$date <- as_date(States$date)

# filter first year date 
States_2020 <-  States %>% filter(date > "2020-01-21" & date < "2020-12-31")

# remove unnecessary columns 
States_2020 <- States_2020[, !names(States) %in% c("date", "geoid")]

# average by year 
States_2020 <-
  States_2020 %>% group_by(state) %>% summarise(
    Mean_Cases = mean(cases),
    Mean_Cases_avg = mean(cases_avg),
    Mean_Cases_avg_per_100k = mean(cases_avg_per_100k),
    Mean_Deaths = mean(deaths),
    Mean_Deaths_avg = mean(deaths_avg),
    Mean_Deaths_avg_per_100K = mean(deaths_avg_per_100k),
    Mortality_Rate = mean(deaths_avg_per_100k) * 1000,
    .groups = 'drop'
  )


States_2020 <-
  filter(
    States_2020,
    !state %in% c(
      "District of Columbia",
      "American Samoa",
      "Northern Mariana Islands",
      "Puerto Rico",
      "Virgin Islands",
      "Guam"
    )
  )

State_Names_2020 <- c()

for (i in 1 :nrow(Trips_2020)){
  State_Names_2020[i] <- state.name[grep(Trips_2020$State.Postal.Code[i], state.abb)]
}

# arrange by state postal codes to attach state names 
Trips_2020 <- Trips_2020 %>% arrange(State.Postal.Code)
Trips_covid1 <- cbind(Trips_2020, State_Names_2020)

# arrange by state names now in the final dataframe to attach to states database
Trips_covid1 <- Trips_covid1 %>% relocate(State_Names_2020, .before = Pop_at_home)
Trips_covid1 <- Trips_covid1 %>% arrange(State_Names_2020)


States_2020 <-  States_2020 %>% arrange(state)

# merge dataset 
Mortality_by_trips_2020 <- cbind(States_2020, Trips_covid1[,c(2:5)])

# add in total population data 
totalpop_2020 <-  c(Mortality_by_trips_2020$Pop_at_home + Mortality_by_trips_2020$Pop_not_home)

Mortality_by_trips_2020$totalpop <-  totalpop_2020
ggplot(data = Mortality_by_trips_2020, aes(x = (Avg_trips / totalpop), y = Mortality_Rate)) +
  geom_point(color = dplyr::case_when(Mortality_by_trips_2020$Mortality_Rate > 420 ~ "red", 
                                      Mortality_by_trips_2020$Mortality_Rate < 200 ~ "green",
                                      TRUE ~ "#7570b3")) +
  geom_text_repel(aes(label = ifelse(Mortality_Rate>555 | Mortality_Rate <90 | Avg_trips/totalpop> 3.40, as.character(state),'')),
                  box.padding = 1.5,
                  point.padding = 0.5,
                  force = 50,
                  segment.size = 0.2) +
  # geom_label_repel(aes(label = ifelse(Mortality_Rate>420 | Mortality_Rate <160 | Avg_trips> 60000000, as.character(state),'')),
  #                  box.padding = 0.35,
  #                  point.padding = 0.5,
  #                  segment.color = 'black') +
  geom_smooth(method = 'lm', se = FALSE) +
  stat_cor(label.y= 500) +
  ylab("Mortality Rate (deaths/100k)") +
  xlab("Average Number of Trips (normalized by State population)") +
  ggtitle("2020 US State Mortality Rates by Normalized Average Trips") + 
  theme_pubclean()


Now looking at just the first year and normalizing by the state population we do see a relationship between the number of trips taken and mortality rate. With an R = 0.34, p = 0.015, there was a small positive correlation between trips taken and mortality rate within the state. In red, we see states that have a mortality rate > 420 deaths per 100K and in green we see states that have a mortality rate of > 200 deaths per 100K. The mortality rate of covid itself probably was dimished by mid 2021 due to the availability of the vaccine and so this relationship between trips and mortality probably only exists within the first year of the pandemic.

All this work was done in part of my STAT 847: Exploratory Data Analysis course which ended up being one of my favorite courses I've taken throughout all my undergraduate and graduate work.

You can check out more of my work using the menu on the top right or by going back home.