Regression, Dimension Reduction & US Mass Shooting Victims

The goal of this project was to look at the factors behind the 'deadliness' of a mass shooting event across the United States. We ask the questions:

What factors make a mass shooting incident more or less deadly than another?

Is there a difference in this deadliness between 'red' and 'blue' states

Does gun ownership rates and strictness of gun laws impact deadliness of mass shootings?

I want to make it clear, this analysis does not predict the number of mass shooting events, it only looks at what factors make it more or less deadly once they occur. This analysis also only looks at mass shooting events as defined by The Violence Project, which follows the Congressional Research Service definition as:

…a multiple homicide incident in which four or more victims are murdered with firearms—not including the offender(s)—within one event, and at least some of the murders occurred in a public location or locations in close geographical proximity (e.g., a workplace, school, restaurant, or other public settings), and the murders are not attributable to any other underlying criminal activity or commonplace circumstance (armed robbery, criminal competition, insurance fraud, argument, or romantic triangle).


Reference


Peterson, J., & Densley, J. (2022). The Violence Project database of mass shootings in the United States (Version 6). https://www.theviolenceproject.org

This work was only possible through the help of The Violence Project and their permission for access to their database. I have in this project used a subset version of the database and have omitted publishing of the entire set.



Jump to analysis



To get started with this question, we first we need to define the criterion for evaluation of the violent measure of an event, or it's deadliness. One way to do this is to look at how many casualties were reported in a certain event. We can calculate this by adding the sum of total injured and total killed in an event. As such, this will be considered our response variable. To gather the explanatory variables, I used the “Full Dataset” and “Community Dataset” from the Violence Project Mass Shooter Databse to initialize the first few sets of explanatory variables. Then I looked at appending other sources of data to this base set such as gun ownership rates, gun law strictness (by state), and the party affiliation of the state in the last federal election. The statistical analyses performed were linear correlation, stepwise regression and PCA of each of the explanatory variables and how they correlate with the response variable and to see which combination of variables best explained the deadliness of a mass shooting event.

It is important to note here that the variables of gun law strictness, gun ownership and party affiliation used the most modern data and were not strictly date matched to the shooting events that occured. For example I look at shootings from 1995 - 2023 but not the specific gun law ranking for each state from 1995 - 2023 (and the year/date matched data). This is because some data was simply not existent in the first few decades (gun law strictness was only started in 2010 with the initial values being the actual letter grades (A,B,C etc.) instead of numbers which is used in the CSV currently, so the most modern values were used as a default. Federal election data also used the 2020 election numbers as I wanted a general sense of how violent shootings differed from a “red” state vs a “blue” state.

# import data from xlsx (not csv)

# the FullDB and CommunityDB was a specialized dataset that was acquired by
# requesting access to it from the Violence Project 
# (https://www.theviolenceproject.org/mass-shooter-database/)
FullDB <- read_excel("filepath/mass-shooter-database.xslx", 
     sheet = "Full Database", col_names = TRUE, 
     skip = 1)

CommunityDB <- read_excel("filepath/communityDB.xlsx", 
    sheet = "Community Data")


# import gun law data from:
# (https://worldpopulationreview.com/state-rankings/strictest-gun-laws-by-state)
GunLawScorecard <- read_csv("filepath/gunlawdata.csv")
# import gun density data from: 
# (https://worldpopulationreview.com/state-rankings/gun-ownership-by-state)

GunDensity <- read_csv("filepath/gunDensity.csv")
# import election data from: 
# (https://dataverse.harvard.edu/file.xhtml?fileId=4299753&version=6.0)
PresElectnData <- read_csv("filepath/1976-2020-president.csv")
head(GunLawScorecard)
## # A tibble: 6 × 14
##    fips state   densi…¹ pop2023 pop2022 pop2020 pop2019 pop2010 growth…²  growth
##   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
## 1     6 Califo…    250.  3.89e7  3.90e7  3.95e7  3.93e7  3.73e7 -0.00291 -113649
## 2    34 New Je…   1259.  9.26e6  9.26e6  9.27e6  9.22e6  8.79e6 -0.00068   -6262
## 3     9 Connec…    749.  3.63e6  3.63e6  3.60e6  3.60e6  3.57e6  0.00079    2850
## 4    36 New Yo…    414.  1.95e7  1.97e7  2.01e7  2.00e7  1.94e7 -0.00916 -180341
## 5    15 Hawaii     223.  1.43e6  1.44e6  1.45e6  1.44e6  1.36e6 -0.00483   -6958
## 6    24 Maryla…    634.  6.15e6  6.16e6  6.17e6  6.13e6  5.77e6 -0.00161   -9950
## # … with 4 more variables: growthSince2010 <dbl>, lawsRank <dbl>,
## #   grade2019 <dbl>, gunDeathRate <dbl>, and abbreviated variable names
## #   ¹​densityMi, ²​growthRate
head(GunDensity)
## # A tibble: 6 × 14
##    fips state    densi…¹ pop2023 pop2022 pop2020 pop2019 pop2010 growth…² growth
##   <dbl> <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>  <dbl>
## 1    30 Montana     7.83 1139507 1122867 1087075  1.08e6  989415  0.0148   16640
## 2    56 Wyoming     6.01  583279  581381  577605  5.76e5  563626  0.00326   1898
## 3     2 Alaska      1.28  732984  733583  732923  7.31e5  710231 -0.00082   -599
## 4    16 Idaho      23.9  1973752 1939033 1849202  1.82e6 1567582  0.0179   34719
## 5    54 West Vi…   73.4  1764786 1775156 1791420  1.80e6 1852994 -0.00584 -10370
## 6     5 Arkansas   58.9  3063152 3045637 3014195  3.00e6 2915918  0.00575  17515
## # … with 4 more variables: growthSince2010 <dbl>, gunOwnership <dbl>,
## #   totalGuns <dbl>, rank <dbl>, and abbreviated variable names ¹​densityMi,
## #   ²​growthRate
head(PresElectnData)
## # A tibble: 6 × 15
##    year state   state_po state_…¹ state…² state…³ office candi…⁴ party…⁵ writein
##   <dbl> <chr>   <chr>       <dbl>   <dbl>   <dbl> <chr>  <chr>   <chr>   <lgl>  
## 1  1976 ALABAMA AL              1      63      41 US PR… "CARTE… DEMOCR… FALSE  
## 2  1976 ALABAMA AL              1      63      41 US PR… "FORD,… REPUBL… FALSE  
## 3  1976 ALABAMA AL              1      63      41 US PR… "MADDO… AMERIC… FALSE  
## 4  1976 ALABAMA AL              1      63      41 US PR… "BUBAR… PROHIB… FALSE  
## 5  1976 ALABAMA AL              1      63      41 US PR… "HALL,… COMMUN… FALSE  
## 6  1976 ALABAMA AL              1      63      41 US PR… "MACBR… LIBERT… FALSE  
## # … with 5 more variables: candidatevotes <dbl>, totalvotes <dbl>,
## #   version <dbl>, notes <lgl>, party_simplified <chr>, and abbreviated
## #   variable names ¹​state_fips, ²​state_cen, ³​state_ic, ⁴​candidate,
## #   ⁵​party_detailed
# trim full DB dataset and append GunLawScorecard, GunOwnership & PresElectnData

# convert to dataframes, it'll take care of the spaces in column names
FullDB <- data.frame(FullDB)
GunLawScorecard <- data.frame(GunLawScorecard)
CommunityDB <- data.frame(CommunityDB) 

I've removed the code and tables that display the Mass Shooter Database aquired from the Violence Project as I require explicit permission from the org to display any data that reveals their database. Upon permission I will re-upload the code used to clean and merge the datasets.


The data was collected using the download links provided in the comments above each dataset, with the exception of The Violence Project dataset which was specifically requested. The Violence Project excel file contained two important sheets, the FullDatabase and the CommunityDatabase. The FullDatabase contains variables which outlines each event and the specific details of the shooter (State, Year, Day, Location, Age, Sex, Gender, Race, History of Domestic Abuse, Affiliation with Hate Groups) all which were selected to be in the final database for EDA. The Community Database contained data relating to the community of impact such as employment rate, college graduate and high school graduate rate, total population (of the community), size of police department etc. These two datsets were joined using innerjoin() using the specific case numbers (which was the common link between these two datasets). Then the gun strictness and gun ownership rates were appended using the innerjoin() with the specific state as their common link (i.e. all shootings that occured in California had the same values for grade2019 (strictness) and gunOwnership). Renaming of the column names also occured here to simplify variable names.

Outlier removal occurs sporadically here as I visualized the explanatory and response variables. The two main culprits are total victims and community population that both have a few values that are far outside the range of the median and 75%ile which were all removed from the dataset as to not skew the results. The justification of removal for each outliers is explained in the comments.

Finally, the presidential election data was filtered to just the 2020 results and the maximum of the votes for each candidate in the states were used as the “party-affiliation” which was used as the indication of the alignment of each state to either the republican candidate or the democratic candidate. Here, I just wanted to look at “generally, do shootings get deadlier in a red state vs a blue state”.



Analysis

library(party)

# Give a ggpairs plot of 6 most important variables, one of them categorical,
# one of them continuous 

# 6 most variables chosen are: Age, History.of.Domestic.Abuse, College.Grad,
# Community.Population, grade2019, hate group association &total.victims 

init_pairs <-
  ggpairs(
    FinalDB4,
    columns = c(
      "Total.Victims",
      "Age",
      "History.of.Domestic.Abuse",
      "College.Grad",
      "Community.Population",
      "grade2019",
      "Hate.Group.Link"
    )
  )

init_pairs

# closer statistical look at party affiliation of the state and total victims 

# test for normality 
Party_Norm_test_REP <-
  shapiro.test(FinalDB4$Total.Victims[which(FinalDB4$party_simplified == "REPUBLICAN")])
Party_Norm_test_REP
## 
##  Shapiro-Wilk normality test
## 
## data:  FinalDB4$Total.Victims[which(FinalDB4$party_simplified == "REPUBLICAN")]
## W = 0.68461, p-value = 9.162e-09
Party_Norm_test_DEM <-
  shapiro.test(FinalDB4$Total.Victims[which(FinalDB4$party_simplified == "DEMOCRAT")])
Party_Norm_test_DEM
## 
##  Shapiro-Wilk normality test
## 
## data:  FinalDB4$Total.Victims[which(FinalDB4$party_simplified == "DEMOCRAT")]
## W = 0.59001, p-value = 8.884e-13
# both party shapiro-wilks test fail normality, use non-parametric 
# (Mann-Whitney) tests to look at differences between these two groups 

Party_Victims <-  ggbetweenstats(FinalDB4,
               y = Total.Victims,
               x = party_simplified,
               type = "np",
               ylab = "Victims of Mass Shooting",
               xlab = "Party",
               title = "Party Affiliation and Total Victims of a Mass Shooting Event",
               bf.message = FALSE)
Party_Victims

# closer look at the history of domestic abuse and total vicitms
# normality tests

shapiro.test(
  FinalDB4$Total.Victims[which(FinalDB4$History.of.Domestic.Abuse == 0)])
## 
##  Shapiro-Wilk normality test
## 
## data:  FinalDB4$Total.Victims[which(FinalDB4$History.of.Domestic.Abuse == 0)]
## W = 0.58669, p-value = 3.287e-13
shapiro.test(
  FinalDB4$Total.Victims[which(FinalDB4$History.of.Domestic.Abuse == 1)])
## 
##  Shapiro-Wilk normality test
## 
## data:  FinalDB4$Total.Victims[which(FinalDB4$History.of.Domestic.Abuse == 1)]
## W = 0.62278, p-value = 2.359e-06
shapiro.test(
  FinalDB4$Total.Victims[which(FinalDB4$History.of.Domestic.Abuse == 2)])
## 
##  Shapiro-Wilk normality test
## 
## data:  FinalDB4$Total.Victims[which(FinalDB4$History.of.Domestic.Abuse == 2)]
## W = 0.90308, p-value = 0.2015
shapiro.test(
  FinalDB4$Total.Victims[which(FinalDB4$History.of.Domestic.Abuse == 3)])
## 
##  Shapiro-Wilk normality test
## 
## data:  FinalDB4$Total.Victims[which(FinalDB4$History.of.Domestic.Abuse == 3)]
## W = 0.7524, p-value = 0.003817
# at least 1 group also fails the shapiro-wilks test, thus a non-parametric 
# Kruskal Wallis ANOVA will be used 
History_DA_Victims <- ggbetweenstats(FinalDB4,
               y = Total.Victims,
               x = History.of.Domestic.Abuse,
               type = "np",
               ylab = "Victims of Mass Shooting",
               xlab = "History of Domestic Abuse",
               title = "History of Domestic Abuse and Total.Victims",
               bf.message = FALSE)
History_DA_Victims

Results

The choice of the 6 most important variables to explain the response variable of Total.Victims were Age, History of Domestic Abuse, College Graduate %, Community Population, Gun Law Strictness (grade2019) & Association with Hate Groups. These variables were chosen as I was curious to see whether data of the community where mass shootings took place were correlated with the deadliness of the shooting, i.e. does the education rate (in terms of college graduates) and size of the community correlate with deadlier shootings? And does the gun strictness laws of the state correlate with the number of total victims. What about whether the shooter has a history of domestic abuse or a link to a hate group? These could be important as they might shed light as to the extraneous variables of what motivates a person to shoot and kill more people rather than internal variables such as race, gender, ethnicity etc. Overall from the ggpairs, we see that the response variable Total.Victims is significantly correlated with Age ( r = -0.300, p < 0.001) and Hate Group Association (r = 0.228, p <0.05). There is also a significant relationship between Age and Hate.Group.Link (r = -0.257, p < 0.01). What the ggpairs() plot shows is that as age increases, the total victims decrease, meaning that younger perpetrators of mass shooting events are deadlier than older ones. It also shows that being associated with a hate group is also correlated with a deadlier event. There is also colinearity between our two significant explanatory variables as age and hate group links, possibly showing that the effects of these variables on the explanatory variable may not be independent. College graduate levels (r = 0.167) has a p value of 0.069, which is almost significant showing that maybe there might be some small influence of increased education levels correlating with less violent shooting events. No other variable showed a significant correlation with Total Victims indicating that neither community population not the strictness of the gun laws in the state influenced the violent outcomes of a mass shooting event.

Finally, comparing history of domestic abuse and party affiliation to total victims was done in the between group comparisons to find whether a difference in party affiliation or history of domestic abuse resulted in a difference in total victims of a mass shooting event. Both tests used a non-parametric method (Mann Whitney & Kruskal Wallis) as both explanatory variables failed the assumption of normality as shown by the shapiro wilks test. ( I could have log transformed or rank transformed this data but I decided not to for no particular reason). The between groups comparisons showed no effect of group on total victims in either comparison (p > 0.05 for both the Mann Whitney and KW test), indicating that neither a history of party affiliation nor a history of prior domestic abuse was indicative of a more violent mass shooting event (again, we’re not investigating the frequency here).


Classification Trees and Random Forest

Lets try to imagine this question as a decision tree, what factors and at what levels best describes the quantity of victims of a mass shooting?

# Two classificaiton trees
# use random forest first classificaiton tree: 
rfit <-
  randomForest(
    Total.Victims ~ 
      Age + 
      College.Grad + 
      Hate.Group.Link + 
      party_simplified + 
      grade2019 + 
      History.of.Domestic.Abuse,
    data = FinalDB4
  )
print(rfit)
## 
## Call:
##  randomForest(formula = Total.Victims ~ Age + College.Grad + Hate.Group.Link +      party_simplified + grade2019 + History.of.Domestic.Abuse,      data = FinalDB4) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 167.1573
##                     % Var explained: 0.92
importance_rfit <- importance(rfit)
importance_rfit
##                           IncNodePurity
## Age                           5142.2965
## College.Grad                  4964.5875
## Hate.Group.Link               1634.2994
## party_simplified               457.1165
## grade2019                     1658.3498
## History.of.Domestic.Abuse     1220.2150
fit <- rpart(Total.Victims ~ 
               Age + 
               College.Grad 
             + Hate.Group.Link 
             + party_simplified
             + grade2019
             + History.of.Domestic.Abuse,
             data = FinalDB4)
fitplot <- rpart.plot(fit)

fitplot
## $obj
## n= 118 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 118 19906.7800 12.898310  
##    2) Age>=26.5 76  3883.6320  9.289474  
##      4) History.of.Domestic.Abuse=0,1,3 68  2439.2210  8.338235  
##        8) Hate.Group.Link< 0.5 60   982.6000  7.700000 *
##        9) Hate.Group.Link>=0.5 8  1248.8750 13.125000 *
##      5) History.of.Domestic.Abuse=2 8   859.8750 17.375000 *
##    3) Age< 26.5 42 13242.2900 19.428570  
##      6) College.Grad< 18.05 20  1566.8000 12.400000  
##       12) Age>=19.5 12   247.6667  8.166667 *
##       13) Age< 19.5 8   781.5000 18.750000 *
##      7) College.Grad>=18.05 22  9789.2730 25.818180  
##       14) College.Grad>=35.9 14  3869.7140 19.857140 *
##       15) College.Grad< 35.9 8  4551.5000 36.250000 *
## 
## $snipped.nodes
## NULL
## 
## $xlim
## [1] 0 1
## 
## $ylim
## [1] 0 1
## 
## $x
##  [1] 0.48343971 0.24086640 0.12890949 0.05427155 0.20354743 0.35282331
##  [7] 0.72601301 0.57673713 0.50209919 0.65137507 0.87528890 0.80065096
## [13] 0.94992684
## 
## $y
##  [1] 0.93276038 0.66049330 0.38822622 0.03427902 0.03427902 0.03427902
##  [7] 0.66049330 0.38822622 0.03427902 0.03427902 0.38822622 0.03427902
## [13] 0.03427902
## 
## $branch.x
##        [,1]      [,2]      [,3]       [,4]      [,5]      [,6]      [,7]
## x 0.4834397 0.2408664 0.1289095 0.05427155 0.2035474 0.3528233 0.7260130
##          NA 0.2408664 0.1289095 0.05427155 0.2035474 0.3528233 0.7260130
##          NA 0.4834397 0.2408664 0.12890949 0.1289095 0.2408664 0.4834397
##        [,8]      [,9]     [,10]     [,11]     [,12]     [,13]
## x 0.5767371 0.5020992 0.6513751 0.8752889 0.8006510 0.9499268
##   0.5767371 0.5020992 0.6513751 0.8752889 0.8006510 0.9499268
##   0.7260130 0.5767371 0.5767371 0.7260130 0.8752889 0.8752889
## 
## $branch.y
##      [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]      [,8]
## y 1.00026 0.7279933 0.4557262 0.1017790 0.1017790 0.1017790 0.7279933 0.4557262
##        NA 0.8564560 0.5841889 0.3119219 0.3119219 0.5841889 0.8564560 0.5841889
##        NA 0.8564560 0.5841889 0.3119219 0.3119219 0.5841889 0.8564560 0.5841889
##        [,9]     [,10]     [,11]     [,12]     [,13]
## y 0.1017790 0.1017790 0.4557262 0.1017790 0.1017790
##   0.3119219 0.3119219 0.5841889 0.3119219 0.3119219
##   0.3119219 0.3119219 0.5841889 0.3119219 0.3119219
## 
## $labs
##  [1] "13\n100%" "9.3\n64%" "8.3\n58%" "7.7\n51%" "13\n7%"   "17\n7%"  
##  [7] "19\n36%"  "12\n17%"  "8.2\n10%" "19\n7%"   "26\n19%"  "20\n12%" 
## [13] "36\n7%"  
## 
## $cex
## [1] 1
## 
## $boxes
## $boxes$x1
##  [1] 0.43699084 0.20208799 0.09013108 0.01549313 0.17243947 0.32171535
##  [7] 0.68723460 0.53795872 0.46332078 0.62026712 0.83651048 0.76187254
## [13] 0.91881888
## 
## $boxes$y1
##  [1]  0.894608204  0.622341125  0.350074046 -0.003873157 -0.003873157
##  [6] -0.003873157  0.622341125  0.350074046 -0.003873157 -0.003873157
## [11]  0.350074046 -0.003873157 -0.003873157
## 
## $boxes$x2
##  [1] 0.52988857 0.27964481 0.16768790 0.09304996 0.23465539 0.38393127
##  [7] 0.76479143 0.61551555 0.54087760 0.68248303 0.91406731 0.83942937
## [13] 0.98103479
## 
## $boxes$y2
##  [1] 1.0002604 0.7279933 0.4557262 0.1017790 0.1017790 0.1017790 0.7279933
##  [8] 0.4557262 0.1017790 0.1017790 0.4557262 0.1017790 0.1017790
## 
## 
## $split.labs
## [1] ""
## 
## $split.cex
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $split.box
## $split.box$x1
##  [1]  0.407161292  0.008622063 -0.012141660           NA           NA
##  [6]           NA  0.597745959  0.500458718           NA           NA
## [11]  0.738925249           NA           NA
## 
## $split.box$y1
##  [1] 0.8271082 0.5548411 0.2825740        NA        NA        NA 0.5548411
##  [8] 0.2825740        NA        NA 0.2825740        NA        NA
## 
## $split.box$x2
##  [1] 0.5597181 0.4731107 0.2699606        NA        NA        NA 0.8542801
##  [8] 0.6530155        NA        NA 1.0116525        NA        NA
## 
## $split.box$y2
##  [1] 0.8858039 0.6135368 0.3412697        NA        NA        NA 0.6135368
##  [8] 0.3412697        NA        NA 0.3412697        NA        NA

A random forest was used to predict the total victims using Age + College.Grad + Hate.Group.Link + party_simplified + grade2019 + History.of.Domestic.Abuse. A sample of the classification tree is shown as fitplot. We see that party affiliation and strictness of the gun laws did not show a difference in total victims in the trees. Following this tree, we see the first split is the most correlative split with Age where age >= 27 (left side of the tree) is generally associated with less total victims compared to the right side of the tree. Following this left tree, a history of domestic abuse classified as 0,1,3 resulted in fewer victims than having a history of domestic abuse classified as 2. Looking specifically at the coding of domestic abuse, 0 = no evidence, 1 = abused romantic partner, 2 = abused other family and 3 is a combination of 1 and 2. This shows that having no history, history of abusing romantic partner and both history of abusing partner and family were separate from only abusing family (coded by the 2 value). Finally, finishing up the left side of the tree we see that with a hate group link, there were more victims than those without hate group links. It is important to note here that this final branch where age > 27, history of domestic abuse = 0,1,3 and hate group link = 0 had the least amount of victims compared to all other scenarios.

Following the right side of the tree we see that around 36% of all shooting events had the perpetrator who was under 27 yrs of age. Of these perpetrators, if the community had less than 18% college graduates, there were often less deadly mass shootings. Going into the left branch, if the age was < 27, and college graduate% was < 18, we see another split at age >= 20, where the perpetrators of age > 20 had fewer total victims than those younger than 20 yrs of age. Finally, the highest victim count is shown when the college graduation rate is above 18 but below 36%.

Looking at node importance, we see that Age and College Graduate levels had the highest IncNodePurity as shown by the importance_rfit table.

## second classificaiton tree 

rfit2 <-
  randomForest(
    party_simplified ~ 
      Age + 
      Total.Victims + 
      College.Grad + 
      Hate.Group.Link + 
      grade2019 + 
      Employed + 
      Community.Population,
    method = "class",
    data = FinalDB3
  )

fit2 = rpart(
  party_simplified ~ 
    Age + 
    Total.Victims + 
    College.Grad + 
    Hate.Group.Link + 
    grade2019 + 
    Employed,
  method = "class",
  data = FinalDB3
)
fit2 <- rpart.plot(fit2)

fit2
## $obj
## n= 120 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 120 48 DEMOCRAT (0.60000000 0.40000000)  
##   2) grade2019>=45 60  1 DEMOCRAT (0.98333333 0.01666667) *
##   3) grade2019< 45 60 13 REPUBLICAN (0.21666667 0.78333333) *
## 
## $snipped.nodes
## NULL
## 
## $xlim
## [1] -0.2  1.2
## 
## $ylim
## [1] -0.6  1.6
## 
## $x
## [1] 0.49624888 0.07707652 0.91542125
## 
## $y
## [1] 0.91524152 0.05313218 0.05313218
## 
## $branch.x
##        [,1]       [,2]      [,3]
## x 0.4962489 0.07707652 0.9154213
##          NA 0.07707652 0.9154213
##          NA 0.49624888 0.4962489
## 
## $branch.y
##       [,1]      [,2]      [,3]
## y 1.115394 0.2532844 0.2532844
##         NA 0.6892632 0.6892632
##         NA 0.6892632 0.6892632
## 
## $labs
## [1] "DEMOCRAT\n0.40\n100%"  "DEMOCRAT\n0.02\n50%"   "REPUBLICAN\n0.78\n50%"
## 
## $cex
## [1] 1
## 
## $boxes
## $boxes$x1
## [1]  0.37156137 -0.04761099  0.77820533
## 
## $boxes$y1
## [1]  0.77965455 -0.08245479 -0.08245479
## 
## $boxes$x2
## [1] 0.6209364 0.2017640 1.0526372
## 
## $boxes$y2
## [1] 1.1153937 0.2532844 0.2532844
## 
## 
## $split.labs
## [1] ""
## 
## $split.cex
## [1] 1 1 1
## 
## $split.box
## $split.box$x1
## [1] 0.3303966        NA        NA
## 
## $split.box$y1
## [1] 0.624698       NA       NA
## 
## $split.box$x2
## [1] 0.6621012        NA        NA
## 
## $split.box$y2
## [1] 0.7538285        NA        NA
print(rfit2)
## 
## Call:
##  randomForest(formula = party_simplified ~ Age + Total.Victims +      College.Grad + Hate.Group.Link + grade2019 + Employed + Community.Population,      data = FinalDB3, method = "class") 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 14.17%
## Confusion matrix:
##            DEMOCRAT REPUBLICAN class.error
## DEMOCRAT         61         11   0.1527778
## REPUBLICAN        6         42   0.1250000
importance(rfit2)
##                      MeanDecreaseGini
## Age                         4.9128404
## Total.Victims               4.2408617
## College.Grad                8.7979169
## Hate.Group.Link             0.8047618
## grade2019                  27.4892556
## Employed                    4.6945872
## Community.Population        6.0458177
rfit2_confusion <- rfit2$confusion
rfit2_confusion
##            DEMOCRAT REPUBLICAN class.error
## DEMOCRAT         61         11   0.1527778
## REPUBLICAN        6         42   0.1250000

Here, I wanted to look at what variables I could use to predict the state’s party affiliation in the last election, using Age + Total.Victims + College.Grad + Hate.Group.Link + grade2019 + Employed + Community.Population. Confusion matrix is shown with rfit2_confusion table. Although this tree is very simple, it does show important information that dividing states down party lines can be difficult using general data such as age, college graduate % in the community & employment % in the community etc. Really the best predictor (and one that I expected to see) was the gun grade laxity scores that best showed a difference between Republican and Democratic states. Typically, democratic states tend to have higher and more strict gun laws (grade > 45) than republican states (grade < 45) but that does not necessarily correlate with fewer victims of mass shootings. Although, here it is important to also mention that a gun grade of 45 (out of a maximum of 100) isn’t a great grade either. Perhaps this is even more indicative that even the most strict gun laws in the united states don’t do a great job of lowering total victims of a mass shooting when it occurs, hence why we see no difference in mass shooting victims in the states with the most strict gun laws and with the least strict gun laws (generally speaking).

# model building, setpwise regression to identify which variables best explain Total.Victims  
library(leaps)
library(FactoMineR)

# non-linear variable = Total.Victims / community population
FinalDB4$Prop.Victims <-
  FinalDB4$Total.Victims / FinalDB4$Community.Population

n = nrow(FinalDB4)

Tot.Victims_all_mod = lm(
  Total.Victims ~ 
    Age + 
    Hate.Group.Link + 
    Employed + 
    College.Grad  + 
    gunOwnership + 
    grade2019 ,
  data = FinalDB4
)
summary(Tot.Victims_all_mod)
## 
## Call:
## lm(formula = Total.Victims ~ Age + Hate.Group.Link + Employed + 
##     College.Grad + gunOwnership + grade2019, data = FinalDB4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.014  -6.281  -3.328   2.076  69.291 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      33.24774   12.07818   2.753  0.00691 **
## Age              -0.29779    0.09760  -3.051  0.00285 **
## Hate.Group.Link   1.93927    1.21051   1.602  0.11199   
## Employed          0.08001    0.09645   0.830  0.40858   
## College.Grad      0.07301    0.06587   1.108  0.27008   
## gunOwnership    -29.94971   18.92510  -1.583  0.11637   
## grade2019        -0.10785    0.05558  -1.940  0.05486 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.25 on 111 degrees of freedom
## Multiple R-squared:  0.1635, Adjusted R-squared:  0.1183 
## F-statistic: 3.615 on 6 and 111 DF,  p-value: 0.002589
stepwise_TotalVic <- step(Tot.Victims_all_mod, k = log(n))
## Start:  AIC=617.45
## Total.Victims ~ Age + Hate.Group.Link + Employed + College.Grad + 
##     gunOwnership + grade2019
## 
##                   Df Sum of Sq   RSS    AIC
## - Employed         1    103.23 16756 613.41
## - College.Grad     1    184.31 16837 613.98
## - gunOwnership     1    375.72 17028 615.31
## - Hate.Group.Link  1    385.03 17037 615.38
## - grade2019        1    564.90 17217 616.62
## <none>                         16652 617.45
## - Age              1   1396.55 18049 622.18
## 
## Step:  AIC=613.41
## Total.Victims ~ Age + Hate.Group.Link + College.Grad + gunOwnership + 
##     grade2019
## 
##                   Df Sum of Sq   RSS    AIC
## - gunOwnership     1    356.01 17112 611.12
## - Hate.Group.Link  1    359.18 17115 611.14
## - College.Grad     1    366.05 17122 611.19
## - grade2019        1    520.91 17277 612.25
## <none>                         16756 613.41
## - Age              1   1363.99 18120 617.87
## 
## Step:  AIC=611.12
## Total.Victims ~ Age + Hate.Group.Link + College.Grad + grade2019
## 
##                   Df Sum of Sq   RSS    AIC
## - grade2019        1    165.20 17277 607.48
## - Hate.Group.Link  1    378.49 17490 608.93
## - College.Grad     1    403.63 17515 609.10
## <none>                         17112 611.12
## - Age              1   1131.18 18243 613.90
## 
## Step:  AIC=607.48
## Total.Victims ~ Age + Hate.Group.Link + College.Grad
## 
##                   Df Sum of Sq   RSS    AIC
## - College.Grad     1    347.62 17625 605.06
## - Hate.Group.Link  1    388.39 17665 605.34
## <none>                         17277 607.48
## - Age              1   1206.38 18483 610.68
## 
## Step:  AIC=605.06
## Total.Victims ~ Age + Hate.Group.Link
## 
##                   Df Sum of Sq   RSS    AIC
## - Hate.Group.Link  1    485.46 18110 603.50
## <none>                         17625 605.06
## - Age              1   1245.17 18870 608.35
## 
## Step:  AIC=603.5
## Total.Victims ~ Age
## 
##        Df Sum of Sq   RSS    AIC
## <none>              18110 603.50
## - Age   1    1796.8 19907 609.89
stepwise_TotalVic
## 
## Call:
## lm(formula = Total.Victims ~ Age, data = FinalDB4)
## 
## Coefficients:
## (Intercept)          Age  
##     23.6595      -0.3171
# what about victims/community population
Tot.Victims_all_mod2 = lm(
  Prop.Victims ~ 
    Age + 
    Employed + 
    College.Grad + 
    gunOwnership + 
    grade2019 + 
    Hate.Group.Link,
  data = FinalDB4
)

summary(Tot.Victims_all_mod2)
## 
## Call:
## lm(formula = Prop.Victims ~ Age + Employed + College.Grad + gunOwnership + 
##     grade2019 + Hate.Group.Link, data = FinalDB4)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.003784 -0.001610 -0.000659  0.000238  0.067562 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)      8.239e-03  6.602e-03   1.248    0.215
## Age             -6.419e-05  5.335e-05  -1.203    0.231
## Employed         4.152e-05  5.272e-05   0.788    0.433
## College.Grad    -1.543e-05  3.601e-05  -0.428    0.669
## gunOwnership    -9.740e-03  1.034e-02  -0.942    0.348
## grade2019       -5.022e-05  3.038e-05  -1.653    0.101
## Hate.Group.Link -3.259e-04  6.617e-04  -0.493    0.623
## 
## Residual standard error: 0.006695 on 111 degrees of freedom
## Multiple R-squared:  0.04225,    Adjusted R-squared:  -0.009521 
## F-statistic: 0.8161 on 6 and 111 DF,  p-value: 0.5597
# hm it doesn't seem to be a better response variable to predict

A backwards stepwise regression was used to predict Total.Victims using the variables of Age + Hate.Group.Link + Employed + College.Grad + gunOwnership + grade2019 to see which combination of explanatory variables best predicted the response variable. The stepwise reg. showed the lowest AIC value (AIC = 603.5) in the model that only contained Age with the next best model containing both Age and Hate Group Link (AIC = 605.06). Again, this indicates that age is the best predictor of how deadly a mass shooting event may be with hate group links being the next best predictor.

To account and normalize across the different populations I tried to get a proportional victims to community by dividing total.victims / community population and creating a “non-linear” variable to predict. However, this did not result in any values being significantly correlated to this new predictor variable (Tot.Victims_all_mod2) and was left out of another stepwise regression test.

## PCA 
col_order <- c("Total.Victims", "Age", "Employed", "College.Grad", "Community.Population","grade2019", "gunOwnership", "Hate.Group.Link")

PCAOrder <- FinalDB4[,col_order]

Tot.Victims_PCA <- PCA(PCAOrder, quali.sup = 8)

summary(Tot.Victims_PCA)
## 
## Call:
## PCA(X = PCAOrder, quali.sup = 8) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               1.993   1.497   1.086   0.985   0.691   0.578   0.169
## % of var.             28.469  21.385  15.515  14.076   9.876   8.261   2.417
## Cumulative % of var.  28.469  49.854  65.370  79.445  89.321  97.583 100.000
## 
## Individuals (the 10 first)
##                          Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                    |  3.293 |  0.851  0.308  0.067 | -2.428  3.339  0.544 |
## 2                    |  3.674 |  2.010  1.717  0.299 | -0.609  0.210  0.027 |
## 3                    |  1.637 |  0.250  0.027  0.023 |  0.031  0.001  0.000 |
## 4                    |  4.531 | -3.147  4.212  0.483 | -1.810  1.856  0.160 |
## 5                    |  3.060 | -0.763  0.247  0.062 | -1.380  1.079  0.203 |
## 6                    |  1.846 | -1.397  0.830  0.573 | -0.695  0.273  0.142 |
## 7                    |  2.334 |  0.200  0.017  0.007 | -1.740  1.714  0.556 |
## 8                    |  2.279 |  1.471  0.920  0.416 | -1.032  0.603  0.205 |
## 9                    |  2.113 |  1.960  1.634  0.861 | -0.351  0.070  0.028 |
## 10                   |  2.394 | -0.933  0.370  0.152 |  1.307  0.967  0.298 |
##                       Dim.3    ctr   cos2  
## 1                    -1.463  1.669  0.197 |
## 2                    -1.904  2.830  0.269 |
## 3                     1.104  0.951  0.455 |
## 4                    -0.962  0.722  0.045 |
## 5                     1.826  2.603  0.356 |
## 6                     0.879  0.603  0.227 |
## 7                     1.067  0.888  0.209 |
## 8                    -0.096  0.007  0.002 |
## 9                    -0.043  0.001  0.000 |
## 10                   -1.180  1.086  0.243 |
## 
## Variables
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## Total.Victims        | -0.105  0.556  0.011 |  0.667 29.745  0.445 | -0.434
## Age                  |  0.274  3.759  0.075 | -0.513 17.608  0.264 |  0.584
## Employed             |  0.383  7.358  0.147 |  0.555 20.585  0.308 |  0.504
## College.Grad         |  0.322  5.193  0.103 |  0.670 29.985  0.449 |  0.367
## Community.Population |  0.255  3.266  0.065 |  0.013  0.011  0.000 | -0.153
## grade2019            |  0.888 39.569  0.789 | -0.109  0.800  0.012 | -0.274
## gunOwnership         | -0.896 40.298  0.803 |  0.138  1.266  0.019 |  0.262
##                         ctr   cos2  
## Total.Victims        17.381  0.189 |
## Age                  31.435  0.341 |
## Employed             23.408  0.254 |
## College.Grad         12.406  0.135 |
## Community.Population  2.152  0.023 |
## grade2019             6.897  0.075 |
## gunOwnership          6.321  0.069 |
## 
## Supplementary categories
##                          Dist    Dim.1   cos2 v.test    Dim.2   cos2 v.test  
## 0                    |  0.158 | -0.007  0.002 -0.117 | -0.139  0.769 -2.566 |
## 1                    |  0.657 |  0.382  0.338  1.029 |  0.502  0.584  1.562 |
## 3                    |  1.127 | -0.558  0.245 -0.801 | -0.038  0.001 -0.062 |
## 4                    |  1.712 | -0.406  0.056 -0.655 |  1.393  0.662  2.590 |
##                       Dim.3   cos2 v.test  
## 0                     0.066  0.171  1.422 |
## 1                    -0.011  0.000 -0.042 |
## 3                    -0.699  0.384 -1.358 |
## 4                    -0.670  0.153 -1.463 |
head(Tot.Victims_PCA$ind$coord)
##        Dim.1       Dim.2      Dim.3      Dim.4      Dim.5
## 1  0.8511016 -2.42847890 -1.4626542 -0.4859203  0.3390529
## 2  2.0095871 -0.60912653 -1.9044597  2.0089383 -1.0363564
## 3  0.2497381  0.03056528  1.1042196 -0.9771108  0.1127041
## 4 -3.1470891 -1.81048134 -0.9617541 -1.5199614  0.3547356
## 5 -0.7625083 -1.38029977  1.8263951 -0.6589522  1.5259096
## 6 -1.3972124 -0.69474944  0.8788370  0.2599342  0.2978147
PCAOrder_Dim <- Tot.Victims_PCA$ind$coord

FinalDB5 <- cbind(PCAOrder, PCAOrder_Dim)

# lm() based on the dimensions... 

Model_PCA_Victim_Dim <- lm(Total.Victims ~ 
                             Dim.1 + 
                             Dim.2 + 
                             Dim.3, 
                           data = FinalDB5)

summary(Model_PCA_Victim_Dim)
## 
## Call:
## lm(formula = Total.Victims ~ Dim.1 + Dim.2 + Dim.3, data = FinalDB5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4957  -6.2279  -0.2768   4.6222  30.6881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.8983     0.7247  17.799  < 2e-16 ***
## Dim.1        -0.9687     0.5133  -1.887   0.0617 .  
## Dim.2         7.0838     0.5923  11.960  < 2e-16 ***
## Dim.3        -5.4150     0.6954  -7.787 3.44e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.872 on 114 degrees of freedom
## Multiple R-squared:  0.6451, Adjusted R-squared:  0.6358 
## F-statistic: 69.08 on 3 and 114 DF,  p-value: < 2.2e-16
# a second lm() based on 5 dimensions 

Model_PCA_Victim_Dim_2 <- lm(Total.Victims ~ 
                               Dim.1 + 
                               Dim.2 + 
                               Dim.3 + 
                               Dim.4 + 
                               Dim.5, 
                             data = FinalDB5)

summary(Model_PCA_Victim_Dim_2)
## 
## Call:
## lm(formula = Total.Victims ~ Dim.1 + Dim.2 + Dim.3 + Dim.4 + 
##     Dim.5, data = FinalDB5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9655 -0.7545  0.0003  0.7683  4.0681 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.89831    0.12867 100.244  < 2e-16 ***
## Dim.1       -0.96866    0.09115 -10.628  < 2e-16 ***
## Dim.2        7.08381    0.10516  67.359  < 2e-16 ***
## Dim.3       -5.41505    0.12346 -43.859  < 2e-16 ***
## Dim.4        0.58386    0.12962   4.504 1.64e-05 ***
## Dim.5        9.13416    0.15475  59.024  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.398 on 112 degrees of freedom
## Multiple R-squared:  0.989,  Adjusted R-squared:  0.9885 
## F-statistic:  2016 on 5 and 112 DF,  p-value: < 2.2e-16

A PCA was used to predict which dimensions of a subest of the mass shooting data best described total victims. Using Hate Group Link as a qualitative supplementary variable, we see from the summary that the first 3 dimensions only explain ~65.3% of the variance in the whole dataset and most of the data is explained by using 6/7 dimensions ( ~97.58% variance). This is again indicative of how much variance there is in the data mass shooter database and the difficulty of determining predictor variables for how many victims there will be in any given mass shooting. Using these dimensions in a linear model we see that the best r squared value comes from Model_PCA_Victim_Dim_2 (r squared = 0.988, p < 2.2e-16) which contains 5 dimensions compared to the 3 dimension model which has a fairly lower r squared value of 0.635.

Ethical Concerns

As is the case with sensitive material such as a mass shooting database, there are several ethical concerns with the usage of this data. First and foremost, this dataset was obtained via request to the Violence Project who have restricted the publishing and publicaly presenting of any analysis undertaken without their explicit permission. From an analysis perspective, it was also an ethical (and moral consideration in my opinion) that the names of both the perpetrators and victims be unmentioned as they are all available in the main database. Though I omitted including the specific city of where each event occurred I did use the community population, the year, month and day of each event which can be traced back to each location. This was done deliberately as I was curious to see what variables best explained the violent nature of each mass shooting event.

Finally, there is a final point about gun control and how results of this analysis may impact how people view this hugely contested issue. Looking at the face value of the results of this analysis, it shows that gun laws did not correlate with a change in the number of victims of a mass shooting event. This result may make someone inclined to think that gun laws are therefore, not effective. This kind of analysis may be sound (ish), but not valid. Gun laws do work. This analysis did not look at frequency of gun violence and gun law strictness because that work is already established and has shown to work in reducing gun violence. This work ONLY looked at events once they had already occurred. To reduce overall violent nature of mass shootings, reducing the overall frequency of shootings is an incredibly effective method and this analysis should not be viewed as something that goes against gun laws and creating new gun laws.



Thank you for reading, if you want to check out more of my work click here to go back to the homepage.