Analyzing Traffic Fatalities by Vehicle Weight and Size with Tidyverse


I bought a car recently. I’m cheap and practical and I’m going to be a father soon, so the only things I really cared about were:

  • will the car keep my wife and child safe if they’re in an accident
  • will the car keep running reliably
  • how do I get 👆 those two things on the cheap

I’ve also been getting into machine learning and data science recently, so I thought it’d be useful to answer them using some data science tools. This post sums up what I learned about vehicle safety based on the Insurance Institute for Highway Safety’s data on traffic fatalities by make and model.

The tldr: drivers are less likely to die in a traffic accident if the vehicle they drive is heavier and larger. However, interestingly, there does seem to be a point of diminishing returns where additional weight and size doesn’t seem to make as much of a difference in whether drivers die in accidents. In fact, there may actually be a point of negative returns where additional weight and size is associated with more driver deaths. We’ll be able to see all of this with a few ggplot2 plots.

(“seem” and “may” are emphasized because the analysis I’ve done here merely suggests the above conclusions. A more complete sample of IIHS’s data would increase confidence in these conclusions.)

Data prep

The IIHS’s report only groups by fatalities by “vehicle size” and “vehicle type” (along with make and model). Becuase I didn’t know whether the “vehicle size” factors generalized across car types, I just googled the weights and sizes of the vehicles in the report.1

The data looks like this:

fatalities_by_weight_and_size <- read_csv('./vehicle-fatalities.csv', 
                                          locale = locale(grouping_mark = ",")
rand_rows <- sample(nrow(fatalities_by_weight_and_size))
## # A tibble: 6 x 6
##   `Make and Model`                         Class Size   Weight Volume Fatalities
##   <chr>                                    <chr> <chr>   <dbl>  <dbl>      <dbl>
## 1 Kia Forte                                car   small    2804 705600         89
## 2 Lexus NX 200t 4WD Luxury                 SUV   Midsi…   4050 875420          0
## 3 Chevrolet Trax 2WD                       suv   small    3124 764400         73
## 4 Acura RDX 2WD Luxury                     SUV   Midsi…   3946 885040          4
## 5 Chevrolet Sonic                          car   small    2733 641920         98
## 6 Land Rover Range Rover Evoque 4WD Luxury SUV   Small    3615 825600          0

We don’t ultimately care about the size and class columns, but let’s quickly normalize them to all be lower-case, just in case we’ll care later:

to_lower_factor <- function(f) {
fatalities_by_weight_and_size <- fatalities_by_weight_and_size %>%
  mutate(Class = to_lower_factor(Class), Size = to_lower_factor(Size))
## # A tibble: 6 x 6
##   `Make and Model`                     Class Size       Weight Volume Fatalities
##   <chr>                                <fct> <fct>       <dbl>  <dbl>      <dbl>
## 1 GMC Yukon XL 1500 4WD                suv   very large   5800 1.34e6          0
## 2 Infiniti QX60 2WD Luxury             suv   midsize      4383 1.05e6          0
## 3 Land Rover Range Rover Evoque 4WD L… suv   small        3615 8.26e5          0
## 4 Lexus NX 200t 4WD Luxury             suv   midsize      4050 8.75e5          0
## 5 Mercedes-Benz C-Class sedan 4WD Lux… car   midsize      3594 7.49e5          0
## 6 Porsche Cayenne 4WD Luxury           suv   large        4488 1.01e6          0


Now we can easily graph fatality by weight:

ggplot(fatalities_by_weight_and_size, aes(x=Weight, y=Fatalities)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Weight's vs fatalities per 100 million miles driven")

We can also graph fatality by car size:

ggplot(fatalities_by_weight_and_size, aes(x=Volume, y=Fatalities)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Volume vs fatalities per 100 million miles driven")


Like I said at the outset, the analysis here is only suggestive and needs more work. Still, if we accept the conclusion suggested by this analysis, there’s an interesting puzzle here: Why would fatalities go up for very large and very heavy cars? One possible explanation for the data is that people who tend to drive very large vehicles tend to drive less safely because they think they cannot be harmed.

  1. Actually, what I did was focus on the best and worst vehicles from the report, which I realize is problematic for making inferences about all vehicles.↩︎

statisticsdata science

Should we refactor files we're working on?