Mass shootings have become a hot topic recently, so I found some data to explore on the website Shooting Tracker, which has recorded Mass Shootings from 2013 to present. Each incident is cited by a news source, and is defined as having 4 or more victims, either killed or injured.
I also want to start off by saying I am not advocating for or against gun-control, I simply wanted to explore the data.
library(RCurl)
library(data.table)
library(reshape2) # install newer data.table package to get melt function
library(ggmap)
library(dismo)
## Warning: no function found corresponding to methods exports from 'raster'
## for: 'overlay'
library(rgeos)
library(scales)
There are two options to load the data from Shooting Tracker, the code below shows both methods.
I will be using data I downloaded on December 2, 2015
, after the shooting in San Bernardino, CA, and I’ve uploaded the file onto a GitHub repo so that these results can be reproduced.
# Option 1: Via website, direct download
link2013 <- "http://shootingtracker.com/tracker/2013MASTER.csv"
link2014 <- "http://shootingtracker.com/tracker/2014MASTER.csv"
link2015 <- "http://shootingtracker.com/tracker/2015CURRENT.csv"
x <- getURL(link2013)
shooting2013 <- setDT(read.csv(textConnection(x)))
x <- getURL(link2014)
shooting2014 <- setDT(read.csv(textConnection(x)))
x <- getURL(link2015)
shooting2015 <- setDT(read.csv(textConnection(x)))
# The names in the files aren't uniform, need to chnage 2013's and remove
# the columns referencing an article for support of each observation
dtList = vector("list", 3)
dtList[[1]] = shooting2013[, .(date, shooter, killed, wounded, location)]
setnames(dtList[[1]], names(dtList[[1]]), c("Date", "Shooter", "Dead", "Injured", "Location"))
dtList[[2]] = shooting2014[, .(Date, Shooter, Dead, Injured, Location)]
dtList[[3]] = shooting2015[, .(Date, Shooter, Dead, Injured, Location)]
DT <- rbindlist(dtList)
# Change column classes
DT[, c("Date","Shooter","Location") :=
list(as.Date(Date, format="%m/%d/%Y"), as.character(Shooter), as.character(Location))]
# Option 2: Download Files and Load via HD
setwd('~/Downloads/')
# File located on my GitHub page
DT <- fread("MassShootings_20151202.csv")
head(DT)
## Date Shooter Dead Injured Location
## 1: 2013-01-01 Carlito Montoya 4 0 Sacramento, CA
## 2: 2013-01-01 Unknown 1 3 Hawthorne, CA
## 3: 2013-01-01 Julian Sims 0 4 McKeesport, PA
## 4: 2013-01-01 Desmen Noble, Damian Bell 1 4 Lorain, OH
## 5: 2013-01-05 Sonny Archuleta 4 0 Aurora, CO
## 6: 2013-01-07 Cedric and James Poore 5 0 Tulsa, OK
Once the data is loaded we’ll need to go through the Location
column and manually clean up misspellings or slight variations of the same location. I’ve cleaned the data up through 2015-12-02, however I don’t show the code here. If you download the .Rmd file that produced this page on my GitHub repo you will see all the code in a hidden code chunk.
Once the data is cleaned we can start creating summary information to explore the data. First let’s break out the Date
column into Year
, Month
, and monthYear
. In the same line of code we’ll calculate the total numnber of victims of each incident, nVictims
.
DT[, c("year", "month", "monthYear", "nVictims") :=
list(year(Date), month(Date), paste0(month(Date), "-", year(Date)),Dead+Injured)]
Next we can break out which State
the incident occured in, as well as calculate the number of incidents have occured in each state, StateFreq
, and aggregate nVictimsPerState
.
DT[, State := substr(DT$Location, start = nchar(DT$Location)-1, stop = nchar(DT$Location))]
DT[, c("StateFreq", "nVictimsPerState") := list(.N, sum(nVictims)), by = State ]
Now we can start creating some summary statistics and basic plots to explore the data. First let’s take a look at:
DT[, .N, by = year]
## year N
## 1: 2013 363
## 2: 2014 336
## 3: 2015 353
DT[, sum(nVictims), by = year]
## year V1
## 1: 2013 1768
## 2: 2014 1622
## 3: 2015 1774
Now we can create some plots of the data. Let’s take a look at the distribution of victims for each month of each year. We can see there tends to be an increasing number of shooting victims in the months between late-Spring and early-Fall, and the number of injured (not killed) victims far outweighs the number of victims killed.
DT[, c("nMYDead", "nMYInjured") := list(sum(Dead),sum(Injured)), by = monthYear ]
p1 = DT[,.(year, month, monthYear, nMYDead, nMYInjured)]
setnames(p1, c("nMYDead", "nMYInjured"), c("nDead", "nInjured"))
p1 <- unique(melt(p1, id = c("year", "month", "monthYear"),
measure = c("nDead", "nInjured")))
setnames(p1, "variable", "victimType")
ggplot(p1, aes(x=factor(month), y=value, fill = victimType)) +
geom_bar(stat="identity") +
facet_wrap(~year) +
ggtitle("Number of Victims Per Month By Year") +
xlab("Month") +
ylab("Victims")
Another type of plot to look at is a scatterplot, where the x-axis is the monthYear
, the y-axis is the number of incidents, nIncidentsMoYr
, and the size of the points are the avgVictimsMoYr
.
DT[, c("nIncidentsMoYr", "nVictimsMoYr", "avgVictimsMoYr") :=
list(.N, sum(nVictims), mean(nVictims)), by = monthYear]
p2 = unique(DT[,.(monthYear, nIncidentsMoYr, avgVictimsMoYr)])
p2[, monthYear := as.Date(paste0("1-",p2$monthYear), format="%d-%m-%Y")]
# Excluded December 2015 as there is only 2 incidents
ggplot(p2[-nrow(p2)], aes(x = monthYear, y = nIncidentsMoYr, size = avgVictimsMoYr)) +
geom_point() + scale_size(range = c(1,13)) +
ggtitle("Number of Victims per Month-Year \n Point size = Avg Victims Per Month-Year")
Now let’s start to take a look at where some of these incidents are occuring. Let’s start off by looking at state-level data. We’ve already calculated the nVictimsPerState
, so here we’re just going to sort and plot the data. Here we plot the top 10 states with the most victims, stacked by year.
top10States = unique(DT[, .(State, nVictimsPerState)])[
order(-nVictimsPerState)][1:10][
,State := factor(State, levels = State[1:10])]
p3 = unique(DT[State %in% top10States$State,
.(year, State, nVictims)][
,nVictByYear := sum(nVictims), by=.(year, State) ][
,nVictims:=NULL][
,State := factor(State, levels = top10States$State)] )
p3[, year := factor(year)]
ggplot(p3, aes(x=factor(State), y=nVictByYear, fill = year)) +
geom_bar(stat="identity") +
ggtitle("Top 10 States with Most Victims") +
xlab("State") +
ylab("Victims")
If we would like, we can also dig deeper into the data by looking at the most victimous locations. I show the top 10 locations with the most victims, however I will not explore it further now.
DT[, c("LocationFreq", "nVictimsLocation") := list(.N, sum(nVictims)), by = Location]
# Find the locations with the largest number of mass shootings
locationFreqDT = unique(DT[,.(State, Location, LocationFreq, nVictimsLocation)],
by = "Location")[order(-nVictimsLocation)][1:10]
locationFreqDT
## State Location LocationFreq nVictimsLocation
## 1: IL Chicago, IL 49 243
## 2: NY New York, NY 29 136
## 3: LA New Orleans, LA 20 123
## 4: MI Detroit, MI 21 110
## 5: DC Washington, DC 14 88
## 6: FL Miami, FL 16 83
## 7: MD Baltimore, MD 16 80
## 8: MO St. Louis, MO 18 79
## 9: PA Philadelphia, PA 15 72
## 10: TX Houston, TX 14 71
We saw above that there seems to be a bit of a trend, so let’s try breaking down the number of total vitims into weeks instead of months. We’ll then put the data in a time series object to see if there is any autocorrelcation.
weeklyIncidents = DT[,.(Date, year, nVictims)][,week := week(Date)][order(Date)]
weeklyIncidents[,weeklyVictims := sum(nVictims), by = c("year", "week")]
weeklyIncidents <- unique(weeklyIncidents[,.(year, week, weeklyVictims)], by = c("year", "week"))
weeklyTS <- ts(weeklyIncidents[,.(weeklyVictims)], start = c(2013,1), freq = 52)
acf(weeklyTS)
We can see that there’s some autocorrelation, it’s not very significant, but it is nonetheless still present.
Now let’s forecast the total number of weekly victims over the next year. To do this, we’ll use the Holt-Winters method. For this method, the extrapolated values are based on the trends in the period which the model was fitted, and would be a sensible prediction assuming the trends continue. The extrapoloation looks fairly appropriate, however unforeseen events could lead to completely different future values than those shown here.
library(knitr)
weeklyHW <- HoltWinters(weeklyTS, seasonal = "multi")
weeklyPredict <- predict(weeklyHW, n.ahead = 52)
ts.plot(weeklyTS, weeklyPredict, lty = 1:2, main = "Holt-Winters fit for weekly shooting victims \n 52 weeks")
data.table(yearWeek = c(paste0("2015-",start(weeklyPredict)[2]:52), paste0("2016-",1:23)),
nVictims_Prediction = weeklyPredict[1:26],
yearWeek = paste0("2016-",c(24:end(weeklyPredict)[2])),
nVictims_Prediction = weeklyPredict[27:52])
## yearWeek nVictims_Prediction yearWeek nVictims_Prediction
## 1: 2015-50 26.55206 2016-24 65.69150
## 2: 2015-51 28.16228 2016-25 44.63000
## 3: 2015-52 50.44726 2016-26 35.89538
## 4: 2016-1 24.87902 2016-27 67.18090
## 5: 2016-2 23.84165 2016-28 41.20263
## 6: 2016-3 34.96725 2016-29 75.54498
## 7: 2016-4 19.89960 2016-30 48.98417
## 8: 2016-5 25.44636 2016-31 47.86119
## 9: 2016-6 22.52320 2016-32 54.49418
## 10: 2016-7 26.79052 2016-33 50.92211
## 11: 2016-8 28.50432 2016-34 56.89886
## 12: 2016-9 20.10224 2016-35 39.31355
## 13: 2016-10 34.55163 2016-36 40.62397
## 14: 2016-11 20.09817 2016-37 49.34495
## 15: 2016-12 26.64073 2016-38 52.65025
## 16: 2016-13 47.24012 2016-39 51.39803
## 17: 2016-14 16.86422 2016-40 37.79078
## 18: 2016-15 38.02083 2016-41 27.80265
## 19: 2016-16 40.00142 2016-42 27.58412
## 20: 2016-17 24.37004 2016-43 42.86138
## 21: 2016-18 31.24427 2016-44 22.41134
## 22: 2016-19 34.31264 2016-45 32.47545
## 23: 2016-20 56.69916 2016-46 32.73432
## 24: 2016-21 36.98326 2016-47 58.35132
## 25: 2016-22 52.07376 2016-48 31.81461
## 26: 2016-23 33.67889 2016-49 40.92262
## yearWeek nVictims_Prediction yearWeek nVictims_Prediction
Let’s now take a look at an actual map. The first thing we’ll do is download all the lat
and lon
coordinates for each location. This will take a few minutes, depending on your internet connection. For convenience, I’ve also uploaded a csv file with all the coordinates to GitHub, so instead of downloading them all now you can simply upload the file.
Once we have the coordinates loaded, we’ll merge them into the main data.table.
uniqueLocations <- unique(DT$Location)
coords <- suppressMessages(geocode(uniqueLocations))
locationCoords = data.table(Location = uniqueLocations,
lon = coords$lon,
lat = coords$lat, key="Location")
# Bring in the lat-lon coordinates
setkey(DT, Location)
setkey(locationCoords, Location)
DT = locationCoords[DT]
Now that we have the coordinates loaded we can plot a map. First let’s plot the average number of victims per incident in each location, avgVictimsPerLocation
. Once we have that we can load the map and plot the data.
DT[, avgVictimsPerLocation := mean(nVictims), by = Location]
getTheMap <- suppressMessages(get_map(location = 'united states',
zoom = 4, source = 'google'))
mapDT <- unique( DT[, .(lon, lat, avgVictimsPerLocation)] )
p4 <- suppressMessages( ggmap(getTheMap) +
geom_point(aes(x=lon, y=lat),
data = mapDT,
size=sqrt(mapDT$avgVictimsPerLocation),
colour="red",
alpha=.5) +
ggtitle("Avg Number of Victims Per Incident & Location"))
print(p4)
We can also hone in on a specific area and find all the incidents within a certain radius of a city. To do this, we’ll need to use a few spatial packages. I’ll focus in on Los Angeles, CA, because that is the largest city near San Bernardino, and I’ll show all the incidents that occur within a 75 miles radius of the city.
First let’s create a function that will help us to select the other cities based on their lat/lon coordinates. The distance calculation below is based on the Great-Circle.
Note: I found the two functions below here: Great-circle distance calculations in R
deg2rad <- function(deg) return(deg*pi/180)
gcd.slc <- function(long1, lat1, long2, lat2) {
long1 <- deg2rad(long1)
lat1 <- deg2rad(lat1)
long2 <- deg2rad(long2)
lat2 <- deg2rad(lat2)
R <- 6371 # Earth mean radius [km]
d <- acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1)) * R
d <- d*0.621371 #in miles
return(d) # Distance in miles
}
Now we’ll create data.tables for our city of interest, and all other cities, then we’ll use the fuction created above to find all cities within 75 miles of Los Angeles.
cityOfInterest = unique(DT[Location == "Los Angeles, CA",.(Location, lon, lat, LocationFreq, nVictimsLocation)])
otherCities = DT[Location != "Los Angeles, CA",.(Location, lon, lat, LocationFreq, nVictimsLocation)]
dist <- gcd.slc(cityOfInterest$lon, cityOfInterest$lat, otherCities$lon, otherCities$lat)
distIx <- which( dist <= 75 )
metroAreaDT = rbind(unique(otherCities[distIx]), cityOfInterest)
metroAreaDT[,.(Location, LocationFreq, nVictimsLocation)][order(-nVictimsLocation)]
## Location LocationFreq nVictimsLocation
## 1: Los Angeles, CA 14 63
## 2: San Bernardino, CA 4 47
## 3: Long Beach, CA 5 21
## 4: Pomona, CA 3 13
## 5: Moreno Valley, CA 2 11
## 6: Santa Monica, CA 1 10
## 7: Riverside, CA 1 9
## 8: Tustin, CA 1 7
## 9: San Fernando Valley, CA 1 6
## 10: Highland, CA 1 5
## 11: Pasadena, CA 1 5
## 12: Perris, CA 1 5
## 13: Santa Ana, CA 1 5
## 14: Bellflower, CA 1 4
## 15: Compton, CA 1 4
## 16: Fontana, CA 1 4
## 17: Hawthorne, CA 1 4
## 18: Lancaster, CA 1 4
## 19: Mission Viejo, CA 1 4
## 20: Norwalk, CA 1 4
## 21: Wilmington, CA 1 4
## Location LocationFreq nVictimsLocation
Finally, we’re going to convert our data into spatial objects. The package here uses distance in meters, so we’ll use a function to find the maximum meters from the cities we selected above.
Note: The code below was found on GIS StackExchange
cityMap <- gmap("Los Angeles, CA", zoom = 8, scale = 2)
coordinates(cityOfInterest) <- ~ lon + lat
projection(cityOfInterest) <- "+init=epsg:4326"
centralLocation <- spTransform(cityOfInterest, CRS = CRS(projection(cityMap)))
mile2meter <- function(x) {
x * 1609.344
}
distLayer <- rgeos::gBuffer(centralLocation, width = mile2meter(75) )
plot(cityMap)
plot(distLayer, col = alpha("blue", .35), add = TRUE)
# now plot the incident locations
coordinates(metroAreaDT) <- ~ lon + lat
projection(metroAreaDT) <- "+init=epsg:4326"
allIncidents <- spTransform(metroAreaDT, CRS = CRS(projection(cityMap)))
points(allIncidents, cex = 2, pch = 20)