An In-Depth Analysis of Gun Violence in America¶

Will M, Ethan B, Zichao L¶

Part 1: Introduction¶

Gun violence has become a significant problem in America today. We are constantly reminded by news reports and social media that gun violence is a part of our lives - as a result, our lives are being disrupted by this threat. Schools are enforcing shooting drills, products like bulletproof vests are becoming ever more common, and our politics are being divided over what the right thing to do is.

In 2020, gun violence was the most common cause of death among people younger than 19. Between 1968 and 2011, an estimated 1.4 million Americans died from gun violence. The gun-related homicide rate in the United States is 25 times higher than in other developed countries. Because of these statistics, it makes sense that the general public be informed about this issue.

In this tutorial, we will do an in-depth analysis of the history, causes and effects of gun violence. The data we will be using can be found here. The ultimate goal is to understand the factors that contribute the most to gun violence.

Part 2: Data¶

We will start by importing the necessary packages.

In [1]:
import pandas as pd
import numpy as np

The first thing we need to do is to read in our data. This can be done with pandas, and here is the result:

In [2]:
data = pd.read_csv("stage3.csv")
data.head()
Out[2]:
incident_id date state city_or_county address n_killed n_injured incident_url source_url incident_url_fields_missing ... participant_age participant_age_group participant_gender participant_name participant_relationship participant_status participant_type sources state_house_district state_senate_district
0 461105 2013-01-01 Pennsylvania Mckeesport 1506 Versailles Avenue and Coursin Street 0 4 http://www.gunviolencearchive.org/incident/461105 http://www.post-gazette.com/local/south/2013/0... False ... 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female 0::Julian Sims NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:... http://pittsburgh.cbslocal.com/2013/01/01/4-pe... NaN NaN
1 460726 2013-01-01 California Hawthorne 13500 block of Cerise Avenue 1 3 http://www.gunviolencearchive.org/incident/460726 http://www.dailybulletin.com/article/zz/201301... False ... 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male 0::Bernard Gillis NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:... http://losangeles.cbslocal.com/2013/01/01/man-... 62.0 35.0
2 478855 2013-01-01 Ohio Lorain 1776 East 28th Street 1 3 http://www.gunviolencearchive.org/incident/478855 http://chronicle.northcoastnow.com/2013/02/14/... False ... 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male 0::Damien Bell||1::Desmen Noble||2::Herman Sea... NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic... http://www.morningjournal.com/general-news/201... 56.0 13.0
3 478925 2013-01-05 Colorado Aurora 16000 block of East Ithaca Place 4 0 http://www.gunviolencearchive.org/incident/478925 http://www.dailydemocrat.com/20130106/aurora-s... False ... 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male 0::Stacie Philbrook||1::Christopher Ratliffe||... NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su... http://denver.cbslocal.com/2013/01/06/officer-... 40.0 28.0
4 478959 2013-01-07 North Carolina Greensboro 307 Mourning Dove Terrace 2 2 http://www.gunviolencearchive.org/incident/478959 http://www.journalnow.com/news/local/article_d... False ... 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 0::Danielle Imani Jameison||1::Maurice Eugene ... 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su... http://myfox8.com/2013/01/08/update-mother-sho... 62.0 27.0

5 rows × 29 columns

This table is rather big, so we will need to do some cleaning and tidying before we can start our analysis.

Firstly, we won't need all the data in this table. According to the dataset, some of the columns are not required - and thus, may contain NaN values. We don't want this as it will make our analysis more difficult than it needs to be. Out of the 29 columns, only 9 are required. That being said, we don't want to remove all of these unnecessary columns, as some also contain value information we will need. The columns we will be removing are those that are not required and necessary for this analysis.

The following columns will be removed:

  • source_url
  • congressional_district
  • location_description
  • notes
  • participant_name
  • sources
  • state_house_district
  • state_senate_district

Here is the result:

In [3]:
columns_to_remove = [
    "source_url",
    "congressional_district",
    "location_description",
    "notes",
    "participant_name",
    "sources",
    "state_house_district",
    "state_senate_district",
]
data = data.drop(columns=columns_to_remove)
data.head()
Out[3]:
incident_id date state city_or_county address n_killed n_injured incident_url incident_url_fields_missing gun_stolen ... incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
0 461105 2013-01-01 Pennsylvania Mckeesport 1506 Versailles Avenue and Coursin Street 0 4 http://www.gunviolencearchive.org/incident/461105 False NaN ... Shot - Wounded/Injured||Mass Shooting (4+ vict... 40.3467 -79.8559 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:...
1 460726 2013-01-01 California Hawthorne 13500 block of Cerise Avenue 1 3 http://www.gunviolencearchive.org/incident/460726 False NaN ... Shot - Wounded/Injured||Shot - Dead (murder, a... 33.9090 -118.3330 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:...
2 478855 2013-01-01 Ohio Lorain 1776 East 28th Street 1 3 http://www.gunviolencearchive.org/incident/478855 False 0::Unknown||1::Unknown ... Shot - Wounded/Injured||Shot - Dead (murder, a... 41.4455 -82.1377 2.0 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic...
3 478925 2013-01-05 Colorado Aurora 16000 block of East Ithaca Place 4 0 http://www.gunviolencearchive.org/incident/478925 False NaN ... Shot - Dead (murder, accidental, suicide)||Off... 39.6518 -104.8020 NaN 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...
4 478959 2013-01-07 North Carolina Greensboro 307 Mourning Dove Terrace 2 2 http://www.gunviolencearchive.org/incident/478959 False 0::Unknown||1::Unknown ... Shot - Wounded/Injured||Shot - Dead (murder, a... 36.1140 -79.9569 2.0 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...

5 rows × 21 columns

Secondly, we need to remove columns that were well-formed but are either unnecessary or contain sensitive information, like an address. We want this analysis to remain as anonymous as possible, and we want to respect those who were affected by these incidents.

We will handle NaN values on a per-situation basis. Pandas allows us to deal with these situations by offering functions like isnull() which checks if a row of data contains any NaNs. With this, we can continue our analysis without much trouble.

In [4]:
labels = ["address", "incident_url", "incident_url_fields_missing"]
data = data.drop(columns=labels)
data.head()
Out[4]:
incident_id date state city_or_county n_killed n_injured gun_stolen gun_type incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
0 461105 2013-01-01 Pennsylvania Mckeesport 0 4 NaN NaN Shot - Wounded/Injured||Mass Shooting (4+ vict... 40.3467 -79.8559 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:...
1 460726 2013-01-01 California Hawthorne 1 3 NaN NaN Shot - Wounded/Injured||Shot - Dead (murder, a... 33.9090 -118.3330 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:...
2 478855 2013-01-01 Ohio Lorain 1 3 0::Unknown||1::Unknown 0::Unknown||1::Unknown Shot - Wounded/Injured||Shot - Dead (murder, a... 41.4455 -82.1377 2.0 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic...
3 478925 2013-01-05 Colorado Aurora 4 0 NaN NaN Shot - Dead (murder, accidental, suicide)||Off... 39.6518 -104.8020 NaN 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...
4 478959 2013-01-07 North Carolina Greensboro 2 2 0::Unknown||1::Unknown 0::Handgun||1::Handgun Shot - Wounded/Injured||Shot - Dead (murder, a... 36.1140 -79.9569 2.0 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...

Since 2013 was when data collection it is not exhaustive (as stated in the dataset) so it doesn't give an accurate representation on the year. We decided to remove it due to this.

In [5]:
data = data[data["date"].str.contains("2013") == False]
data.head()
Out[5]:
incident_id date state city_or_county n_killed n_injured gun_stolen gun_type incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
278 95289 2014-01-01 Michigan Muskegon 0 0 NaN NaN Shots Fired - No Injuries 43.2301 -86.2514 NaN NaN 0::Adult 18+ 0::Female NaN 0::Unharmed 0::Victim
279 92401 2014-01-01 New Jersey Newark 0 0 NaN NaN Officer Involved Incident 40.7417 -74.1695 NaN NaN NaN NaN NaN NaN NaN
280 92383 2014-01-01 New York Queens 1 0 NaN NaN Shot - Dead (murder, accidental, suicide) 40.7034 -73.7474 NaN 0::22||1::26 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Killed||1::Unharmed 0::Victim||1::Subject-Suspect
281 92142 2014-01-01 New York Brooklyn 0 1 NaN NaN Shot - Wounded/Injured 40.6715 -73.9476 NaN 0::34 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Injured 0::Victim||1::Subject-Suspect
282 95261 2014-01-01 Missouri Springfield 0 1 NaN NaN Shot - Wounded/Injured 37.2646 -93.3007 NaN 0::6||1::12 0::Child 0-11||1::Teen 12-17 0::Female NaN 0::Injured||1::Unharmed 0::Victim||1::Subject-Suspect

Now we will convert date to datetime objects so we can use it later.

In [6]:
data["date"] = pd.to_datetime(data["date"])
data.head()
Out[6]:
incident_id date state city_or_county n_killed n_injured gun_stolen gun_type incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
278 95289 2014-01-01 Michigan Muskegon 0 0 NaN NaN Shots Fired - No Injuries 43.2301 -86.2514 NaN NaN 0::Adult 18+ 0::Female NaN 0::Unharmed 0::Victim
279 92401 2014-01-01 New Jersey Newark 0 0 NaN NaN Officer Involved Incident 40.7417 -74.1695 NaN NaN NaN NaN NaN NaN NaN
280 92383 2014-01-01 New York Queens 1 0 NaN NaN Shot - Dead (murder, accidental, suicide) 40.7034 -73.7474 NaN 0::22||1::26 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Killed||1::Unharmed 0::Victim||1::Subject-Suspect
281 92142 2014-01-01 New York Brooklyn 0 1 NaN NaN Shot - Wounded/Injured 40.6715 -73.9476 NaN 0::34 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Injured 0::Victim||1::Subject-Suspect
282 95261 2014-01-01 Missouri Springfield 0 1 NaN NaN Shot - Wounded/Injured 37.2646 -93.3007 NaN 0::6||1::12 0::Child 0-11||1::Teen 12-17 0::Female NaN 0::Injured||1::Unharmed 0::Victim||1::Subject-Suspect

Now we will create columns for each part of the date.
Here is the final result, and the data we will be using in the rest of the analysis:

In [7]:
data["year"] = data["date"].dt.year
data["month"] = data["date"].dt.month
data["day"] = data["date"].dt.day
data["month_year"] = data["date"].dt.to_period("M")
data.head()
Out[7]:
incident_id date state city_or_county n_killed n_injured gun_stolen gun_type incident_characteristics latitude ... participant_age participant_age_group participant_gender participant_relationship participant_status participant_type year month day month_year
278 95289 2014-01-01 Michigan Muskegon 0 0 NaN NaN Shots Fired - No Injuries 43.2301 ... NaN 0::Adult 18+ 0::Female NaN 0::Unharmed 0::Victim 2014 1 1 2014-01
279 92401 2014-01-01 New Jersey Newark 0 0 NaN NaN Officer Involved Incident 40.7417 ... NaN NaN NaN NaN NaN NaN 2014 1 1 2014-01
280 92383 2014-01-01 New York Queens 1 0 NaN NaN Shot - Dead (murder, accidental, suicide) 40.7034 ... 0::22||1::26 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Killed||1::Unharmed 0::Victim||1::Subject-Suspect 2014 1 1 2014-01
281 92142 2014-01-01 New York Brooklyn 0 1 NaN NaN Shot - Wounded/Injured 40.6715 ... 0::34 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Injured 0::Victim||1::Subject-Suspect 2014 1 1 2014-01
282 95261 2014-01-01 Missouri Springfield 0 1 NaN NaN Shot - Wounded/Injured 37.2646 ... 0::6||1::12 0::Child 0-11||1::Teen 12-17 0::Female NaN 0::Injured||1::Unharmed 0::Victim||1::Subject-Suspect 2014 1 1 2014-01

5 rows × 22 columns

Now that our data has been cleaned up, it's time to explain what we are looking at. This dataset tracked every since recorded incident of gun violence between early 2013 and early 2018 in the United States. It contains all the critical information we need to understand each incident that occurred, such as where and when it happened, who was involved, and what the outcome was. Below is a summary of each column and what it tells us about the incident.

  • date: when the incident occurred
  • state: what state the incident occurred in
  • city_or_county: what city or county the incident occurred in
  • n_killed: how many people were killed in the incident
  • n_injured: how many people were injured in the incident
  • gun_stolen: whether or not the gun/guns used were stolen
  • gun_type: what type of gun/guns were used
  • incident_characteristics: specific details about the incident
  • latitude: geographic latitude of the incident
  • longitude: geographic longitude of the incident
  • n_guns_involved: how many guns involved in the incident
  • participant_age: a breakdown of each participant's age
  • participant_age_group: a breakdown of each participant's age group
  • participant_gender: a breakdown of each participant's gender
  • participant_relationship: a breakdown of each participant's relationship to other participants
  • participant_status: a breakdown of the outcome of each participant
  • participant_type: a breakdown of each participant's role in the incident
  • The extra date columns are just to make later analysis easier.

Part 3 - Analysis¶

Graphs¶

To begin our analysis, we want to get a good understanding of the data.

In [8]:
import matplotlib.pyplot as plt
import seaborn as sns

Distribution of Fatalities in Mass Shootings¶

Firstly, we want to examine the distribution of fatalities in Mass Shootings. A mass shooting is defined as 4 or more fatalities. By doing this, it will help us along in our furthur analysis.

In [9]:
# Collect the data
frequencies = {}
for _, row in data.iterrows():
    if row["n_killed"] not in frequencies:
        frequencies[row["n_killed"]] = 1
    else:
        frequencies[row["n_killed"]] += 1
for i in range(4):
    frequencies.pop(i, None)
In [10]:
# Plot the data
plt.bar(frequencies.keys(), frequencies.values(), width=0.7)
plt.xlim([0, 27])
plt.xlabel("Number of Fatalities")
plt.ylabel("Frequency")
plt.title("Distribution of Fatalities in Mass Shootings")
plt.show()

We can see that frequency is inversely proportional to number of fatalities.

Distribution of Fatalities Normalized¶

Another interesting visual would be a distribution of the number of fatalities. By normalizing their values we can get a better idea of the true distribution. This time we will inlude values less than 4.

In [11]:
# Plotting normalized value counts of fatalities
k_freq = data["n_killed"].value_counts(normalize=True).iloc[0:6]
sns.barplot(x=k_freq.index, y=k_freq.values)
plt.xlabel("Number of fatalities")
plt.ylabel("Percentage")
plt.title("Distribution of Fatalities Normalized")
plt.show()

As clearly seen the majority of gun violence has no fatalities. Only around 23% of incidents had one or more victims killed.

Frequency of Different Gun Types Used in Shootings¶

The goal of this analysis is to see the different gun types used in shootings. This will help us understand if there is a relationship between gun types and mass shootings.

In [12]:
# Collecting the data
gun_types = {"Handgun": 0, "Rifle": 0, "Shotgun": 0}
gun_type_df = data.dropna(subset=["gun_type"])
for _, row in gun_type_df.iterrows():
    gun_types["Handgun"] += row["gun_type"].count("Handgun")
    gun_types["Rifle"] += row["gun_type"].count("Rifle")
    gun_types["Shotgun"] += row["gun_type"].count("Shotgun")
In [13]:
# Plotting the data
plt.bar(gun_types.keys(), gun_types.values())
plt.xlabel("Gun Type")
plt.ylabel("Frequency")
plt.title("Frequency of Different Gun Types Used in Shootings")
plt.show()

Handguns are the most common type of gun used in shootings. Rifles and shotguns are less common.

Male Verses Deaths Involvement in Gun Violence¶

This is the first in-depth analysis of how different factors affect gun violence. We want to see if gender is a significant factor in gun violence. We broke the data into 2 groups - male and female, as well as by age group - adult, teen and child, and looked at their amount of involement(deaths, injuries, etc).

In [14]:
# Create catagories
male_vs_female = {"Child 0-11": [0, 0], "Teen 12-17": [0, 0], "Adult 18+": [0, 0]}
gender_age_df = data.dropna(subset=["participant_gender", "participant_age_group"])

# Collect data
for _, row in gender_age_df.iterrows():
    tokens_gender = row["participant_gender"].split("||")
    tokens_gender = [e[3:] for e in tokens_gender]
    tokens_age_grp = row["participant_age_group"].split("||")
    tokens_age_grp = [e[3:] for e in tokens_age_grp]
    result = list(zip(tokens_gender, tokens_age_grp))
    for pair in result:
        if pair[0] == "Male":
            if pair[1] == "Child 0-11":
                male_vs_female["Child 0-11"][0] += 1
            elif pair[1] == "Teen 12-17":
                male_vs_female["Teen 12-17"][0] += 1
            elif pair[1] == "Adult 18+":
                male_vs_female["Adult 18+"][0] += 1
        elif pair[0] == "Female":
            if pair[1] == "Child 0-11":
                male_vs_female["Child 0-11"][1] += 1
            elif pair[1] == "Teen 12-17":
                male_vs_female["Teen 12-17"][1] += 1
            elif pair[1] == "Adult 18+":
                male_vs_female["Adult 18+"][1] += 1
In [15]:
# Create labels
labels = ["Adult", "Teen", "Child"]
male_data = [
    male_vs_female["Adult 18+"][0],
    male_vs_female["Teen 12-17"][0],
    male_vs_female["Child 0-11"][0],
]
female_data = [
    male_vs_female["Adult 18+"][1],
    male_vs_female["Teen 12-17"][1],
    male_vs_female["Child 0-11"][1],
]

# Plot data
x_axis = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots()
fig.set_figwidth(10)
fig.set_figheight(8)
rects1 = ax.bar(x_axis - width / 2, male_data, width, label="Male")
rects2 = ax.bar(x_axis + width / 2, female_data, width, label="Female")

ax.set_xlabel("Age Group")
ax.set_ylabel("Amount of Involvement")
ax.set_ylim([0, 275000])
ax.set_title("Male Verses Female Involvement in Gun Violence")
ax.set_xticks(x_axis, labels)
ax.legend()
ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)
plt.show()

Gender is a signifigant factor according to this analysis. Adult males especially are disproportionately involved in gun violence compared to other groups.

Mean Age of Participants Between 15 and 75 Verses Lethality¶

We want to see if age is a signifigant factor in gun violence. We took the average age of all participants involved in a shooting and plotted against the measure of lethality.

Lethality is calculated using the following formula:

$ 2* Participants\ Killed + 1.5 * Participants\ Injured $

We then wanted to see if these variables are correlated with each other by fitting a regression line,

In [16]:
def mean_age_of_participants(row):
    ages = {k: 0 for k in range(15, 75)}
    for age in ages.keys():
        count = row.count(str(age))
        ages[age] += count
    lst = []
    for key, value in ages.items():
        if key * value != 0:
            lst.append(key * value)
    sum_of_ages, num_of_ages = float(sum(lst)), float(len(lst))
    if sum_of_ages == 0:
        return "Invalid"
    else:
        return sum_of_ages / num_of_ages


age_df = data.dropna(subset=["participant_age"])
raw, filtered = [], []
for _, row in age_df.iterrows():
    [mean_age, lethality] = mean_age_of_participants(row["participant_age"]), float(
        ((2 * row["n_killed"]) + (1.5 * row["n_injured"]))
    )
    raw.append([mean_age, lethality])
for entry in raw:
    if entry[0] != "Invalid":
        filtered.append(entry)
In [17]:
x_data, y_data = [], []
for entry in filtered:
    if entry[0] < 75 and entry[1] < 100:
        x_data.append(entry[0])
        y_data.append(entry[1])
[slope, intercept] = np.polyfit(x_data, y_data, 1)
plt.figure(figsize=(10, 8))
plt.scatter(x_data, y_data, s=30, edgecolor="black")
plt.xlabel("Mean Age of Participants")
plt.ylabel("Measure of Lethality")
plt.title(
    "Mean Age of Participants Between 15 and 75 Years Old In Shootings Verses Lethality"
)
plt.plot(np.asarray(x_data), slope * np.asarray(x_data) + intercept, color="orange")
plt.show()

From this we can observe that age is not a significant factor of gun violence. The regression line that was fitted has 0 slope and the dots are not correlated with each other.

Maps¶

One good way to visualize this data set is by generating maps. To do this first we get a geojson file containing the relevant information for each state. Then we count all entries by state and add it. This way we can graph both together.

In [18]:
# Getting GeoJson of US states from the folium and saving as geopandas(so we can add GeoJson tooltips)
# Source: https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/us-states.json
import geopandas as gpd

state_geo = gpd.read_file("data/us-states.json")
In [19]:
# Summing up incidents per state
incident_count = data["state"].value_counts().reset_index()
incident_count.columns = ["name", "count"]
# Then merging since folium only does one data source for GeoJson
state_geo_count = state_geo.merge(incident_count, on="name")

Now that we have a valid dataframe we need to create our maps. To do this we will make a choropleth object, which will allow us to highlight each state based on gun violence volume. It uses our geojson to define state borders and the counts for their color.

In [20]:
from folium import Map, Choropleth
from folium.features import GeoJson, GeoJsonTooltip

# Creating map and choropleth
total_shootings_by_state_map = Map(location=[43, -102], zoom_start=4)

Choropleth(
    geo_data=state_geo,
    data=incident_count,
    bins=9,
    columns=["name", "count"],
    key_on="feature.properties.name",
    legend_name="Total shootings in state from 2014-2018",
    fill_color="YlOrRd",
    fill_opacity=0.7,
    line_opacity=0.5,
    reset=True,
).add_to(total_shootings_by_state_map)
Out[20]:
<folium.features.Choropleth at 0x1a414bfd0>

So now we have a map object and a choropleth object. To make the map more interactive we can add tool tips. This will essentially allow viewers to hover over states and view whatever we set. So we set some functions to control the colors of highlighting, then we specify what information will be showed. Then we add it to the map object.

In [21]:
# Styling functions and gjson tooltips
style = lambda x: {
    "fillColor": "#ffffff",
    "color": "#000000",
    "fillOpacity": 0.1,
    "weight": 0.1,
}

highlight = lambda x: {
    "fillColor": "#000000",
    "color": "#000000",
    "fillOpacity": 0.30,
    "weight": 0.1,
}

gjson = GeoJson(
    data=state_geo_count,
    style_function=style,
    highlight_function=highlight,
    control=False,
    tooltip=GeoJsonTooltip(
        fields=["name", "count"],
        aliases=["State", "Shootings"],
    ),
)
total_shootings_by_state_map.add_child(gjson)
total_shootings_by_state_map.keep_in_front(gjson)

Finally we display the map. This creates an javascript element storing all the relevant info.

In [22]:
# Showing the map
total_shootings_by_state_map
Out[22]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Next we will make a time based heatmap.
First we must group all latitude/longitude pairs that occurred within each month and make the time index.
This approach is complicated but faster than doing it by for loop for some reason.

In [23]:
# Making heatmap dataframe and parsing it to fit constraints
heatmap_df = data.dropna(subset=["month_year", "latitude", "longitude"])
heat_data = (
    heatmap_df[["month_year", "latitude", "longitude"]]
    .groupby("month_year")
    .apply(lambda row: [list(tup) for tup in zip(row["latitude"], row["longitude"])])
    .tolist()
)
In [24]:
# Getting list of all time values sorted
time_index = list(heatmap_df["month_year"].astype("str").sort_values().unique())

Now we must make our actual map. We specify all the parameters we generated. Then we set a bunch of play back options.

In [25]:
from folium.plugins import HeatMapWithTime

# Making a heatmap object and inputting data
heatmap = Map(location=[43, -102], zoom_start=4)

HeatMapWithTime(
    heat_data,
    index=time_index,
    radius=10,
    auto_play=False,
    speed_step=1,
    min_speed=1,
).add_to(heatmap)
Out[25]:
<folium.plugins.heat_map_withtime.HeatMapWithTime at 0x1adf256c0>

MAP:¶

(Zoom in to see specific areas)

In [26]:
heatmap
Out[26]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Another interesting question is how did the 2016 presidential election effect gun violence?

First let's sum up every month in the dataset. To do this we use the value counts function then reorder it to a dataframe with relevant info, including date objects.

In [27]:
# Getting total count of incidents for every month
cpm = data["month_year"].value_counts().sort_index().to_frame()
cpm.columns = ["count"]
cpm["year"] = cpm.index.year
cpm["month"] = cpm.index.month
# months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
cpm.head()
Out[27]:
count year month
2014-01 4395 2014 1
2014-02 3045 2014 2
2014-03 3669 2014 3
2014-04 3891 2014 4
2014-05 4320 2014 5

Then we can select 2015-2017 using the pandas query function and visualize it to see how the election impacted gun violence.
This gives us the following:

In [28]:
sns.set_theme()
election_df = cpm.query("year >= 2015 and year <= 2017")
sns.relplot(election_df, x="month", y="count", col="year", kind="line")
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x106c94fd0>

We can view this overlaid as well. Using seaborn's hue parameter we can display multiple lines on the same graph like so.

Overlaid:¶

In [29]:
sns.lineplot(election_df, x="month", y="count", hue="year").set(
    title="Total shootings per month over 2015-2017"
)
Out[29]:
[Text(0.5, 1.0, 'Total shootings per month over 2015-2017')]

So it does appear gun violence spiked starting around November 2016. However in the overlay we see that there is a spike around that time every year. Total gun violence seems to be increasing. To confirm this we could plot all available years using the same hue parameter:

In [30]:
sns.lineplot(cpm, x="month", y="count", hue="year").set(
    title="Total gun violence per month over 2014-2018"
)
Out[30]:
[Text(0.5, 1.0, 'Total gun violence per month over 2014-2018')]

We could also average the months over the recorded years. To do this we group by month then take the mean of their counts. After we get the following:

In [31]:
# Getting average incidents per month
mc = cpm.groupby("month")["count"].mean()
sns.lineplot(x=mc.index, y=mc.values)
plt.xlabel("Month")
plt.ylabel("Average volume")
plt.title("Average volume of gun violence per month")
plt.show()

They clearly all follow the same trends. The only difference it appears is each year the volume of gun violence increased. To confirm we could sum up the whole year and check. So using the same groupby technique:

In [32]:
yc = cpm.groupby("year")["count"].sum()
yc
Out[32]:
year
2014    51854
2015    53579
2016    58763
2017    61401
2018    13802
Name: count, dtype: int64

As we can see 2018 only had a few months of recorded data in the set, so we should probably remove it when comparing years:

In [33]:
yc = yc.drop(index=yc.index[-1])
yc
Out[33]:
year
2014    51854
2015    53579
2016    58763
2017    61401
Name: count, dtype: int64

Then we can plot the counts as a barplot to give a good idea the trend.

In [34]:
sns.barplot(x=yc.index, y=yc.values)
plt.xlabel("Year")
plt.ylabel("Total volume of gun violence")
plt.title("Total volume of gun violence per year")
plt.show()

So overall gun violence has been increasing each year.

Part 4: Hypothesis Testing and ML¶

Hypothesis Testing¶

ANOVA¶

We can furthur support our claims made in the previous section with some hypothesis testing. We want to reinforce the claim that gender is a significant factor of gun violence, so we will perform a 1-Way ANOVA on the populations. Below is more information about the test.

Test: 1-Way ANOVA Test

Null Hypothesis: The variations are due to differences within the groups
Alternate Hypothesis: The variations are due to differences between the groups

We will test at a significance level of 0.05.

In [113]:
import scipy.stats as stats
fvalue, pvalue = stats.f_oneway(male_vs_female["Adult 18+"], male_vs_female["Teen 12-17"], male_vs_female["Child 0-11"])
print(f"The test produced the following:\nThe F-Score was {fvalue} and the p-value was {pvalue}")
The test produced the following:
The F-Score was 1.6177923487325807 and the p-value was 0.33370760815857897

Since the p-value is greater than 0.05, we fail to reject the null hypothesis and conclude that the varations are due to differences within the groups, which supports the claim that gender is the factor at play.

T-Test¶

We can also perform a Non-Pooled T-Test to support our claim about the relationship between age and lethality. Below is more information about the test:

Test: Non-Pooled T-Test

Null Hypothesis: The populations means of age and lethality are equal Alternate Hypothesis: The population means of age and lethality are not equal

We will test at a significance level of 0.05.

In [114]:
stats.ttest_ind(a=x_data, b=y_data, equal_var=True)
Out[114]:
Ttest_indResult(statistic=945.8095355323632, pvalue=0.0)

Since our p-value is very small (almost 0), we reject the null hypothesis and conclude that the population means are not equal, and thus, no correlation between age and lethality.

Machine Learning¶

Can we predict the amount of gun violence events based on their date?
We will try to use two types of regression and see.
First we need to get the total count for every day. We use value counts the transform the result into a dataframe.

In [35]:
cpd = data["date"].value_counts().sort_index().to_frame().copy()
cpd.index = pd.to_datetime(cpd.index)

cpd.columns = ["count"]
cpd["year"] = cpd.index.year
cpd["month"] = cpd.index.month
cpd["day"] = cpd.index.day
cpd.head()
Out[35]:
count year month day
2014-01-01 216 2014 1 1
2014-01-02 119 2014 1 2
2014-01-03 124 2014 1 3
2014-01-04 140 2014 1 4
2014-01-05 130 2014 1 5

First we will try linear regression. We are going to try and predict gun violence in 2018, and use the rest of the dates to train. So we get all the years before 2018 using the query. Then we format them into input and output data sklearn will accept. Finally we add the predictions to the dataframe and drop irrelevant columns so we can easily graph it.

In [36]:
# Fit training data with linear regression
from sklearn.linear_model import LinearRegression

train_df = cpd.query("year < 2018").copy()

X_train = train_df.iloc[:, 1:4].values
y_train = train_df.iloc[:, 0].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X_train, y_train)

train_df["pred"] = train_df.apply(
    lambda row: float(reg.predict([[row["year"], row["month"], row["day"]]])), axis=1
)
train_df = train_df.drop(columns=["year", "day", "month"])
# Checking the score
reg.score(X_train, y_train)
Out[36]:
0.18705909256762032

The regression score on the training data is not very promising. To visualize we can graph the dataframe and overlay the predictions.

In [37]:
plt.xticks(rotation=45)
sns.lineplot(data=train_df)
plt.xlabel("Date")
plt.ylabel("Total volume")
plt.title("Total volume of gun violence per day with linear regression")
plt.show()

The first thing is while the line seems to capture the general trend, we would need a more complex line to accurately predict this line.
Any way lets see how it generalizes to 2018. So we do the same thing but with the test data (aka data from 2018). Then output the score.

In [38]:
# Run predictions on 2018 data
test_df = cpd.query("year == 2018").copy()
X_test = test_df.iloc[:, 1:4].values
y_test = test_df.iloc[:, 0].values.reshape(-1, 1)
test_df["pred"] = test_df.apply(
    lambda row: float(reg.predict([[row["year"], row["month"], row["day"]]])), axis=1
)
test_df = test_df.drop(columns=["year", "day", "month"])
reg.score(X_test, y_test)
Out[38]:
-0.8194541412845129

The test scores even worse. This was predictable as it is unseen data. Visualizing this with seaborn shows us:

In [39]:
plt.xticks(rotation=45)
sns.lineplot(data=test_df)
plt.xlabel("Date")
plt.ylabel("Total volume")
plt.title("Total volume of gun violence per day with regression prediction")
plt.show()

It seems gun violence actually dropped off, while the the model predicted it would continue to slowly increase.
Maybe trying a more complicated line will yield better results, so polynomial regression is next. We do the same filter on the years and format it. Then we fit our model and put its predictions back into the dataframe for graphing. Then we drop unneeded columns.

In [40]:
# Polynomial regression training
from sklearn.preprocessing import PolynomialFeatures

train_df = cpd.query("year < 2018").copy()

X_train = train_df.iloc[:, 1:4].values
y_train = train_df.iloc[:, 0].values.reshape(-1, 1)
poly = PolynomialFeatures(5)
poly_X_train = poly.fit_transform(X_train)

clf = LinearRegression()
clf.fit(poly_X_train, y_train)

train_df["pred"] = train_df.apply(
    lambda row: float(
        clf.predict(poly.fit_transform([[row["year"], row["month"], row["day"]]]))
    ),
    axis=1,
)
train_df = train_df.drop(columns=["year", "day", "month"])
clf.score(poly_X_train, y_train)
Out[40]:
0.11193099911558801

This score is similarly bad. Its likely the line is simply to complicated to be fit with any function like this. Visualizing the dataframe with seaborn shows us:

In [41]:
plt.xticks(rotation=45)
sns.lineplot(data=train_df)
plt.xlabel("Date")
plt.ylabel("Total volume")
plt.title("Total volume of gun violence per day with polynomial regression")
plt.show()

It does seem to be following the curve better, it just seems the noise is too much for it.
Lets try to predict 2018. So we get the 2018 data and use the model to predict the out come, then record the results.

In [42]:
# Polynomial regression on 2018 data
test_df = cpd.query("year == 2018").copy()
X_test = test_df.iloc[:, 1:4].values
y_test = test_df.iloc[:, 0].values.reshape(-1, 1)

poly_X_test = poly.fit_transform(X_test)

test_df["pred"] = test_df.apply(
    lambda row: float(
        clf.predict(
            poly.fit_transform(
                [
                    [
                        row["year"],
                        row["month"],
                        row["day"],
                    ]
                ]
            )
        )
    ),
    axis=1,
)
test_df = test_df.drop(columns=["year", "day", "month"])
clf.score(poly_X_test, y_test)
Out[42]:
-1.4608151418296078

Our worst score yet. This is sort of surprising as it seemed to fit better. However polynomial regression is known to not generalize well to unseen data. Visualizing the dataframe with seaborn shows:

In [43]:
plt.xticks(rotation=45)
sns.lineplot(data=test_df)
plt.xlabel("Date")
plt.ylabel("Total volume")
plt.title("Total volume of gun violence per day with polynomial regression prediction")
plt.show()

We can see this does resemble the curve closer, but seems to have random spikes and overall isn't a great fit. Maybe we can try to do linear regression by year since increase seemed consistent. So we will get the count from earlier since it is already filtered and convert it to a dataframe.

In [44]:
yc = yc.to_frame().reset_index()
yc
Out[44]:
year count
0 2014 51854
1 2015 53579
2 2016 58763
3 2017 61401

Now we must fit the data into out regression model. So we format the columns and call fit.

In [45]:
# Fit year count with linear regression
reg = LinearRegression()
X = yc["year"].values.reshape(-1, 1)
y = yc["count"].values.reshape(-1, 1)
reg.fit(X, y)
Out[45]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

Now to check the score:

In [46]:
reg.score(X, y)
Out[46]:
0.966034042758312

That is much better, it seems to have fit well. To visualize we can add the predictions to the dataframe.

In [47]:
# Add predictions to dataframe
yc["pred"] = yc.apply(
    lambda row: float(reg.predict(row["year"].reshape(-1, 1))), axis=1
)
yc = yc.set_index("year")

Now we can graph it to get the following.

In [48]:
sns.lineplot(data=yc)
plt.xlabel("Year")
plt.ylabel("Total volume")
plt.title("Total volume of gun violence per year with linear regression line")
plt.show()

So it is pretty clear there is a strong positive linear correlation between the increase of gun violence from 2014-2017 and our fit line. This is unlikely to predict future data points as there is no guarantee gun violence will continue increasing, especially not at the same rate.

Part 5: Conclusion¶

As stated in the introduction, gun violence has become a major conflict that has occurred in our everyday lives. Many of us live in large population major cities in the US that leak frequent news about gun violence in the common suburban and rural areas we hear about that include many schools, hoods, and large neighborhoods. Also as mentioned before, we saw how the US has the highest rate in homocides over developed countries across the world, as this makes it a great place to start our research and experiments, thus we did our best to analyze gun violence rates and distributions with respect to the likes of the types of guns, gender, age, and time. The purpose of this project was to discover more about Gun Violence and its causes and use our findings to inform the later generations of how this large destruction of humanity shaped over the years.

We used a dataset originating from the Gun Violence Archive (GVA) downloaded by James Qo found on his Github page: https://github.com/jamesqo/gun-violence-data. Also as mentioned, there was so much data that there needed to be proper preprocessing done to clean up the data. What we discovered was that a majority of NaN values inside the dataset were very much unrequired. However, some of the required data was still necessary for our experiments. We removed 8 total obvious unrequired columns from the dataset, included personalized info columns that had sensitive info to keep things anonymous and used 9 columns for our experiments. The extra columns that were explained in bullets in part 2 were columns we needed that still had important information we needed. One year, 2013 had to be removed because it was non-exhaustive. For the analysis, the Distribution in Fatalities of Mass Shootings histogram showed a right skewed curve and we found that the frequency of the distribution to be inversely proportional to the number of fatalities. That means a lower number of fatalities has a high frequency.

We also analyzed the types of guns that were used in any shootings and were involved in the fatalities, particularly hand guns, rifle gun types, and shotguns. The results turned out that handguns were most frequently used, followed by rifles and shotguns. The latter two types of guns had quite close numbers compared to handguns which had a 25000 frequency, nearly 2.5 times as much as those of rifles and handguns combined.

Then we analyzed the gun violence trends in correspondence to gender. As we looked at both male and female sexes, we split the data into three groups based on age, adults(18+), teens (12-17), and children (under 12). We found that there was a huge trend within the age group but not between different age groups. Also, there was a huge trend we found in the gender involved that showed a huge differentiation in the adult male and female sector, but not so much difference in the other two groups involving teens and children. There were nearly 7 times as many adult males than adult females involved in Gun Violence, with male adults having a frequency of more than 250000. The ratio of adults to any younger age group was tremendous and we concluded that males were significantly more involved than females.

Since the gender data didn’t show much trend between different ages, we decided to go more in depth with this concept and find the relationship between age and lethality, analyze the lethality of damage per mean age. How we got the mean age was that per shooting, many people were involved, so we took the average age of those people involved in that shooting and plotted a scatter plot against each shooting’s lethality measurement, so it’s a direct one to one mapping. Much of the data points were spread across the bottom of the graph so the majority of the lethality either showed pretty much no correlation between the mean age in a shooting, or each shooting wasn’t very lethal, which couldn’t possibly be the case because there were still data points that were considered outliers in the 40-50 year old range, so we went with pretty much no correlation between different ages. This was on track to what we had found before in the gender histogram where there wasn’t much correlation between different age groups.

Mapping was an interesting concept where we visualized gun violence over each state. We also plotted a heap map which showed the changes of gun violence in each state over time. The density of the gun violences within a radius of space in the US can be shown through fuller red variations of shade. We see that much of the eastern half of the US seems to be more impacted by gun-violence. As we move closer to the Western half of the US, it gets less and less dense. It may be more impacted because the population is more dense. Living in cities where there is a larger population seems to be more of a factor in determining where most gun violence occurs.

We also did a fun experiment where we tried to determine a relationship between political factors and gun violence. Over the period of time, it seemed to increase, but the root cause was not because of the 2016 election. The patterns were fairly similar and we couldn’t really find anything that was a beneficial trend or conclusion as the previous few experiments.

As a result it seemed that the two biggest factors and causes of gun violence that we found were the male gender and a larger population towards the eastern half of the US. Also, within an age group, adults were significantly more involved. We also saw many one dimensional graphs such as the types of guns that showed overwhelming differences between handguns vs rifles and shotguns. Overall, this seemed to show why larger cities are more preferable and common for gun violence fatalities to happen. As more gun-violence occurs, other parts of the world may show increases, so future work may include a look into how other regions of the world showed increases in gun-violence, as well as monitoring the increases here in the US.