Mass shootings have become a hot topic recently, so I found some data to explore on the website Shooting Tracker, which has recorded Mass Shootings from 2013 to present. Each incident is cited by a news source, and is defined as having 4 or more victims, either killed or injured.

I also want to start off by saying I am not advocating for or against gun-control, I simply wanted to explore the data.

library(RCurl)
library(data.table)
library(reshape2) # install newer data.table package to get melt function
library(ggmap)
library(dismo)
## Warning: no function found corresponding to methods exports from 'raster'
## for: 'overlay'
library(rgeos)
library(scales)

There are two options to load the data from Shooting Tracker, the code below shows both methods.

  1. Download yearly CSV files directly from the website
  2. Download data onto your computer then laod then into R

I will be using data I downloaded on December 2, 2015, after the shooting in San Bernardino, CA, and I’ve uploaded the file onto a GitHub repo so that these results can be reproduced.

Loading the data

# Option 1: Via website, direct download
link2013 <- "http://shootingtracker.com/tracker/2013MASTER.csv"
link2014 <- "http://shootingtracker.com/tracker/2014MASTER.csv"
link2015 <- "http://shootingtracker.com/tracker/2015CURRENT.csv"

x <- getURL(link2013)
shooting2013 <- setDT(read.csv(textConnection(x)))

x <- getURL(link2014)
shooting2014 <- setDT(read.csv(textConnection(x)))

x <- getURL(link2015)
shooting2015 <- setDT(read.csv(textConnection(x)))

# The names in the files aren't uniform, need to chnage 2013's and remove
# the columns referencing an article for support of each observation
dtList = vector("list", 3)
dtList[[1]] = shooting2013[, .(date, shooter, killed, wounded, location)]
setnames(dtList[[1]], names(dtList[[1]]), c("Date", "Shooter", "Dead", "Injured", "Location"))
dtList[[2]] = shooting2014[, .(Date, Shooter, Dead, Injured, Location)]
dtList[[3]] = shooting2015[, .(Date, Shooter, Dead, Injured, Location)]
DT <- rbindlist(dtList)

# Change column classes
DT[, c("Date","Shooter","Location") := 
     list(as.Date(Date, format="%m/%d/%Y"), as.character(Shooter), as.character(Location))]
# Option 2: Download Files and Load via HD
setwd('~/Downloads/') 
# File located on my GitHub page
DT <- fread("MassShootings_20151202.csv")
head(DT)
##          Date                   Shooter Dead Injured       Location
## 1: 2013-01-01           Carlito Montoya    4       0 Sacramento, CA
## 2: 2013-01-01                   Unknown    1       3  Hawthorne, CA
## 3: 2013-01-01               Julian Sims    0       4 McKeesport, PA
## 4: 2013-01-01 Desmen Noble, Damian Bell    1       4     Lorain, OH
## 5: 2013-01-05           Sonny Archuleta    4       0     Aurora, CO
## 6: 2013-01-07    Cedric and James Poore    5       0      Tulsa, OK

Once the data is loaded we’ll need to go through the Location column and manually clean up misspellings or slight variations of the same location. I’ve cleaned the data up through 2015-12-02, however I don’t show the code here. If you download the .Rmd file that produced this page on my GitHub repo you will see all the code in a hidden code chunk.

Once the data is cleaned we can start creating summary information to explore the data. First let’s break out the Date column into Year, Month, and monthYear. In the same line of code we’ll calculate the total numnber of victims of each incident, nVictims.

DT[, c("year", "month", "monthYear", "nVictims") := 
     list(year(Date), month(Date), paste0(month(Date), "-", year(Date)),Dead+Injured)]

Next we can break out which State the incident occured in, as well as calculate the number of incidents have occured in each state, StateFreq, and aggregate nVictimsPerState.

DT[, State := substr(DT$Location, start = nchar(DT$Location)-1, stop = nchar(DT$Location))]

DT[, c("StateFreq", "nVictimsPerState") := list(.N, sum(nVictims)), by = State ]

Summary Info & Plots

Now we can start creating some summary statistics and basic plots to explore the data. First let’s take a look at:

DT[, .N, by = year]
##    year   N
## 1: 2013 363
## 2: 2014 336
## 3: 2015 353

DT[, sum(nVictims), by = year]
##    year   V1
## 1: 2013 1768
## 2: 2014 1622
## 3: 2015 1774

Now we can create some plots of the data. Let’s take a look at the distribution of victims for each month of each year. We can see there tends to be an increasing number of shooting victims in the months between late-Spring and early-Fall, and the number of injured (not killed) victims far outweighs the number of victims killed.

DT[, c("nMYDead", "nMYInjured") := list(sum(Dead),sum(Injured)), by = monthYear ]

p1 = DT[,.(year, month, monthYear, nMYDead, nMYInjured)]
setnames(p1, c("nMYDead", "nMYInjured"), c("nDead", "nInjured"))

p1 <- unique(melt(p1, id = c("year", "month", "monthYear"), 
                  measure = c("nDead", "nInjured")))
setnames(p1, "variable", "victimType")

ggplot(p1, aes(x=factor(month), y=value, fill = victimType)) + 
  geom_bar(stat="identity") + 
  facet_wrap(~year) + 
  ggtitle("Number of Victims Per Month By Year") +
  xlab("Month") +
  ylab("Victims")

Another type of plot to look at is a scatterplot, where the x-axis is the monthYear, the y-axis is the number of incidents, nIncidentsMoYr, and the size of the points are the avgVictimsMoYr.

DT[, c("nIncidentsMoYr", "nVictimsMoYr", "avgVictimsMoYr") := 
     list(.N, sum(nVictims), mean(nVictims)), by = monthYear]

p2 = unique(DT[,.(monthYear, nIncidentsMoYr, avgVictimsMoYr)])
p2[, monthYear := as.Date(paste0("1-",p2$monthYear), format="%d-%m-%Y")]

# Excluded December 2015 as there is only 2 incidents 
ggplot(p2[-nrow(p2)], aes(x = monthYear, y = nIncidentsMoYr, size = avgVictimsMoYr)) +
  geom_point() + scale_size(range = c(1,13)) +
  ggtitle("Number of Victims per Month-Year \n Point size = Avg Victims Per Month-Year")

Now let’s start to take a look at where some of these incidents are occuring. Let’s start off by looking at state-level data. We’ve already calculated the nVictimsPerState, so here we’re just going to sort and plot the data. Here we plot the top 10 states with the most victims, stacked by year.

top10States = unique(DT[, .(State, nVictimsPerState)])[
                    order(-nVictimsPerState)][1:10][
                      ,State := factor(State, levels = State[1:10])]

p3 = unique(DT[State %in% top10States$State,
                           .(year, State, nVictims)][
                             ,nVictByYear := sum(nVictims), by=.(year, State) ][
                               ,nVictims:=NULL][
                                 ,State := factor(State, levels = top10States$State)] )
p3[, year := factor(year)]

ggplot(p3, aes(x=factor(State), y=nVictByYear, fill = year)) +
  geom_bar(stat="identity") + 
  ggtitle("Top 10 States with Most Victims") +
  xlab("State") +
  ylab("Victims")

If we would like, we can also dig deeper into the data by looking at the most victimous locations. I show the top 10 locations with the most victims, however I will not explore it further now.

DT[, c("LocationFreq", "nVictimsLocation") := list(.N, sum(nVictims)), by = Location]

# Find the locations with the largest number of mass shootings
locationFreqDT = unique(DT[,.(State, Location, LocationFreq, nVictimsLocation)],
                        by = "Location")[order(-nVictimsLocation)][1:10]
locationFreqDT
##     State         Location LocationFreq nVictimsLocation
##  1:    IL      Chicago, IL           49              243
##  2:    NY     New York, NY           29              136
##  3:    LA  New Orleans, LA           20              123
##  4:    MI      Detroit, MI           21              110
##  5:    DC   Washington, DC           14               88
##  6:    FL        Miami, FL           16               83
##  7:    MD    Baltimore, MD           16               80
##  8:    MO    St. Louis, MO           18               79
##  9:    PA Philadelphia, PA           15               72
## 10:    TX      Houston, TX           14               71

Forecasting the number of victims

We saw above that there seems to be a bit of a trend, so let’s try breaking down the number of total vitims into weeks instead of months. We’ll then put the data in a time series object to see if there is any autocorrelcation.

weeklyIncidents = DT[,.(Date, year, nVictims)][,week := week(Date)][order(Date)]
weeklyIncidents[,weeklyVictims := sum(nVictims), by = c("year", "week")] 
weeklyIncidents <- unique(weeklyIncidents[,.(year, week, weeklyVictims)], by = c("year", "week"))
weeklyTS <- ts(weeklyIncidents[,.(weeklyVictims)], start = c(2013,1), freq = 52)
acf(weeklyTS)

We can see that there’s some autocorrelation, it’s not very significant, but it is nonetheless still present.

Now let’s forecast the total number of weekly victims over the next year. To do this, we’ll use the Holt-Winters method. For this method, the extrapolated values are based on the trends in the period which the model was fitted, and would be a sensible prediction assuming the trends continue. The extrapoloation looks fairly appropriate, however unforeseen events could lead to completely different future values than those shown here.

library(knitr)
weeklyHW <- HoltWinters(weeklyTS, seasonal = "multi")
weeklyPredict <- predict(weeklyHW, n.ahead = 52)
ts.plot(weeklyTS, weeklyPredict, lty = 1:2, main = "Holt-Winters fit for weekly shooting victims \n 52 weeks")

data.table(yearWeek = c(paste0("2015-",start(weeklyPredict)[2]:52), paste0("2016-",1:23)),
           nVictims_Prediction = weeklyPredict[1:26],
           yearWeek = paste0("2016-",c(24:end(weeklyPredict)[2])),
           nVictims_Prediction = weeklyPredict[27:52])
##     yearWeek nVictims_Prediction yearWeek nVictims_Prediction
##  1:  2015-50            26.55206  2016-24            65.69150
##  2:  2015-51            28.16228  2016-25            44.63000
##  3:  2015-52            50.44726  2016-26            35.89538
##  4:   2016-1            24.87902  2016-27            67.18090
##  5:   2016-2            23.84165  2016-28            41.20263
##  6:   2016-3            34.96725  2016-29            75.54498
##  7:   2016-4            19.89960  2016-30            48.98417
##  8:   2016-5            25.44636  2016-31            47.86119
##  9:   2016-6            22.52320  2016-32            54.49418
## 10:   2016-7            26.79052  2016-33            50.92211
## 11:   2016-8            28.50432  2016-34            56.89886
## 12:   2016-9            20.10224  2016-35            39.31355
## 13:  2016-10            34.55163  2016-36            40.62397
## 14:  2016-11            20.09817  2016-37            49.34495
## 15:  2016-12            26.64073  2016-38            52.65025
## 16:  2016-13            47.24012  2016-39            51.39803
## 17:  2016-14            16.86422  2016-40            37.79078
## 18:  2016-15            38.02083  2016-41            27.80265
## 19:  2016-16            40.00142  2016-42            27.58412
## 20:  2016-17            24.37004  2016-43            42.86138
## 21:  2016-18            31.24427  2016-44            22.41134
## 22:  2016-19            34.31264  2016-45            32.47545
## 23:  2016-20            56.69916  2016-46            32.73432
## 24:  2016-21            36.98326  2016-47            58.35132
## 25:  2016-22            52.07376  2016-48            31.81461
## 26:  2016-23            33.67889  2016-49            40.92262
##     yearWeek nVictims_Prediction yearWeek nVictims_Prediction

Plotting data on maps

Let’s now take a look at an actual map. The first thing we’ll do is download all the lat and lon coordinates for each location. This will take a few minutes, depending on your internet connection. For convenience, I’ve also uploaded a csv file with all the coordinates to GitHub, so instead of downloading them all now you can simply upload the file.

Once we have the coordinates loaded, we’ll merge them into the main data.table.

uniqueLocations <- unique(DT$Location)
coords <- suppressMessages(geocode(uniqueLocations))
locationCoords = data.table(Location = uniqueLocations,
                            lon = coords$lon,
                            lat = coords$lat, key="Location")

# Bring in the lat-lon coordinates
setkey(DT, Location)
setkey(locationCoords, Location)
DT = locationCoords[DT]

Now that we have the coordinates loaded we can plot a map. First let’s plot the average number of victims per incident in each location, avgVictimsPerLocation. Once we have that we can load the map and plot the data.

DT[, avgVictimsPerLocation := mean(nVictims), by = Location]

getTheMap <- suppressMessages(get_map(location = 'united states', 
                                      zoom = 4, source = 'google'))

mapDT <- unique( DT[, .(lon, lat, avgVictimsPerLocation)] )
p4 <- suppressMessages( ggmap(getTheMap) + 
                          geom_point(aes(x=lon, y=lat), 
                                     data = mapDT,
                                     size=sqrt(mapDT$avgVictimsPerLocation),
                                     colour="red", 
                                     alpha=.5) + 
                          ggtitle("Avg Number of Victims Per Incident & Location"))
print(p4)

We can also hone in on a specific area and find all the incidents within a certain radius of a city. To do this, we’ll need to use a few spatial packages. I’ll focus in on Los Angeles, CA, because that is the largest city near San Bernardino, and I’ll show all the incidents that occur within a 75 miles radius of the city.

First let’s create a function that will help us to select the other cities based on their lat/lon coordinates. The distance calculation below is based on the Great-Circle.

Note: I found the two functions below here: Great-circle distance calculations in R

deg2rad <- function(deg) return(deg*pi/180)

gcd.slc <- function(long1, lat1, long2, lat2) {
  long1 <- deg2rad(long1)
  lat1 <- deg2rad(lat1)
  long2 <- deg2rad(long2)
  lat2 <- deg2rad(lat2)
  R <- 6371 # Earth mean radius [km]
  d <- acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1)) * R
  d <- d*0.621371 #in miles
  return(d) # Distance in miles
}

Now we’ll create data.tables for our city of interest, and all other cities, then we’ll use the fuction created above to find all cities within 75 miles of Los Angeles.

cityOfInterest = unique(DT[Location == "Los Angeles, CA",.(Location, lon, lat, LocationFreq, nVictimsLocation)])
otherCities = DT[Location != "Los Angeles, CA",.(Location, lon, lat, LocationFreq, nVictimsLocation)]

dist <- gcd.slc(cityOfInterest$lon, cityOfInterest$lat, otherCities$lon, otherCities$lat)
distIx <- which( dist <= 75 )
metroAreaDT = rbind(unique(otherCities[distIx]), cityOfInterest)
metroAreaDT[,.(Location, LocationFreq, nVictimsLocation)][order(-nVictimsLocation)]
##                    Location LocationFreq nVictimsLocation
##  1:         Los Angeles, CA           14               63
##  2:      San Bernardino, CA            4               47
##  3:          Long Beach, CA            5               21
##  4:              Pomona, CA            3               13
##  5:       Moreno Valley, CA            2               11
##  6:        Santa Monica, CA            1               10
##  7:           Riverside, CA            1                9
##  8:              Tustin, CA            1                7
##  9: San Fernando Valley, CA            1                6
## 10:            Highland, CA            1                5
## 11:            Pasadena, CA            1                5
## 12:              Perris, CA            1                5
## 13:           Santa Ana, CA            1                5
## 14:          Bellflower, CA            1                4
## 15:             Compton, CA            1                4
## 16:             Fontana, CA            1                4
## 17:           Hawthorne, CA            1                4
## 18:           Lancaster, CA            1                4
## 19:       Mission Viejo, CA            1                4
## 20:             Norwalk, CA            1                4
## 21:          Wilmington, CA            1                4
##                    Location LocationFreq nVictimsLocation

Finally, we’re going to convert our data into spatial objects. The package here uses distance in meters, so we’ll use a function to find the maximum meters from the cities we selected above.

Note: The code below was found on GIS StackExchange

cityMap <- gmap("Los Angeles, CA", zoom = 8, scale = 2)

coordinates(cityOfInterest) <- ~ lon + lat
projection(cityOfInterest) <- "+init=epsg:4326"
centralLocation <- spTransform(cityOfInterest, CRS = CRS(projection(cityMap)))

mile2meter <- function(x) {
  x * 1609.344
}
distLayer <- rgeos::gBuffer(centralLocation, width = mile2meter(75) )


plot(cityMap)
plot(distLayer, col = alpha("blue", .35), add = TRUE)

# now plot the incident locations
coordinates(metroAreaDT) <- ~ lon + lat
projection(metroAreaDT) <- "+init=epsg:4326"
allIncidents <- spTransform(metroAreaDT, CRS = CRS(projection(cityMap)))
points(allIncidents, cex = 2, pch = 20)