Digging into NJOAG Law Enforcement Officer Diversity data - Part II - Assembling a combined table

This is Part II of a series exploring the Law Enforcement Officer Diversity dataset released by the NJ OAG. Part I is here.

The target of this post is to assemble a large combined_table with each row representing a municipal police department, containing data from the NJ OAG Use Of Force and Law Enforcement Officer Diversity datasets, plus demographic and geographic data from the US Census. The combined table will look like this:

Table 1: combined_table – All Fields
Field	Description	Year	Source
GEOID	US Census Geographical Identifier		US Census
officer_count	Number of full-time officers	2021	NJ OAG Law Enforcement Officer Diversity Dataset
officer_mean_age	Mean age of officers	2021	NJ OAG Law Enforcement Officer Diversity Dataset
officer_r_male	Proportion of officers that are male	2021	NJ OAG Law Enforcement Officer Diversity Dataset
officer_r_white	Proportion of officers that are White	2021	NJ OAG Law Enforcement Officer Diversity Dataset
officer_r_black	Proportion of officers that are Black	2021	NJ OAG Law Enforcement Officer Diversity Dataset
officer_r_asian	Proportion of officers that are Asian	2021	NJ OAG Law Enforcement Officer Diversity Dataset
officer_r_hispanic_or_latino	Proportion of officers that are Hispanic or Latino	2021	NJ OAG Law Enforcement Officer Diversity Dataset
officer_r_race_other	Proportion of officers of other races	2021	NJ OAG Law Enforcement Officer Diversity Dataset
officer_r_two_or_more_races	Proportion of officers of two or more races	2021	NJ OAG Law Enforcement Officer Diversity Dataset
officer_r_race_na	Proportion of officers whose race is not available	2021	NJ OAG Law Enforcement Officer Diversity Dataset
population	Population	2019	US Census estimates
density	Population density	2019	US Census estimates
r_white	Proportion White (P003002 / P001001)	2010	US Census Decennial, SF1
r_black	Proportion Black or African American (P003003 / P001001)	2010	US Census Decennial, SF1
r_native_american	Proportion American Indian or Alaska Native (P003004 / P001001)	2010	US Census Decennial, SF1
r_asian	Proportion Asian (P003005 / P001001)	2010	US Census Decennial, SF1
r_pacific_islander	Proportion Native Hawaiian and Other Pacific Islander (P003006 / P001001)	2010	US Census Decennial, SF1
r_other_races	Proportion other races (P003007 / P001001)	2010	US Census Decennial, SF1
r_two_or_more_races	Proportion two or more races (P003008 / P001001)	2010	US Census Decennial, SF1
r_non_hispanic_or_latino	Proportion Non-Hispanic or Latino (P004002 / P001001)	2010	US Census Decennial, SF1
r_hispanic_or_latino	Proportion Hispanic or Latino (P004003 / P001001)	2010	US Census Decennial, SF1
household_median_income	Household median income (B19013_001)	2015-2019	US Census ACS
shore_town	`TRUE` if municipality shares a border with the Atlantic Ocean	2020	US Census
incident_count	Number of use of force incidents	2021	NJ OAG Use of Force Dataset
partial year	Estimated portion of year that use of force incidents were reported	2021	NJ OAG Use of Force Dataset
incident_rate_est	Estimate of yearly rate of incidents of use of force	2021	NJ OAG Use of Force Dataset

Packages used in this post

In this post, I’ll be using the following packages.

Package	Title	Version
njoagleod	NJ OAG Law Enforcement Officer Diversity data package	1.1.3
njoaguof	NJ OAG Use of Force data package	1.4.0
tidyverse	Easily Install and Load the ‘Tidyverse’	1.3.1
lubridate	Make Dealing with Dates a Little Easier	1.8.0
scales	Scale Functions for Visualization	1.2.0
tidycensus	Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames	1.2.2
tigris	Load Census TIGER/Line Shapefiles	1.6.1
sf	Simple Features for R	1.0-7

This is not a tutorial for using these packages, but all the R code used to generate the plots and tables will be included.

The OAG packages can be installed from github:

install_github("tor-gu/njoaguof")
install_github("tor-gu/njoagleod")

Gathering the data

We are going to gather and combine data from several sources:

2021 municipal agency officer count and diversity data from njoagleod
2021 use of force incident counts from njoaguof
2019 population estimates from the US Census
Race and ethnicity data from the 2010 US Census
Household median income from the 2015-2019 ACS
A “jersey shore” flag, derived from shapefiles from the US Census (see Part I)

Municipal agency diversity data

We start by getting a summary of the officer diversity data in njoagleod.

library(tidyverse)
library(njoagleod)
officer_info <- officer %>% group_by(agency_name, agency_county) %>%
  summarize(
    officer_count = n(),
    officer_r_male = sum(officer_gender == "Male", na.rm = TRUE) / n(),
    officer_mean_age = mean(officer_age, na.rm = TRUE),
    officer_r_race_na = sum(is.na(officer_race) / n()),
    officer_r_white = sum(officer_race == "White", na.rm = TRUE) / n(),
    officer_r_black = sum(officer_race == "Black", na.rm = TRUE) / n(),
    officer_r_asian = sum(officer_race == "Asian", na.rm = TRUE) / n(),
    officer_r_hispanic_or_latino =
      sum(officer_race == "Hispanic", na.rm = TRUE) / n(),
    officer_r_race_other = 1 - sum(c_across(
      officer_r_race_na:officer_r_hispanic_or_latino
    )),
    .groups = "drop"
  ) %>%
  left_join(agency, by = c("agency_name", "agency_county")) %>%
  filter(!is.na(municipality)) %>%
  left_join(municipality, by = c("agency_county"="county", "municipality")) %>%
  select(
    GEOID,
    officer_count,
    officer_mean_age,
    officer_r_male,
    officer_r_white:officer_r_hispanic_or_latino,
    officer_r_race_other,
    officer_r_race_na
  )

Table 2: head(officer_info) (first several columns)
GEOID	officer_count	officer_mean_age	officer_r_male	officer_r_white	officer_r_black
3402500070	41	37	0.95	0.88	0.02
3400100100	29	38	0.93	0.93	0.03
3400300700	15	43	0.87	1.00	0.00
3402500730	9	40	1.00	1.00	0.00
3402500760	6	35	1.00	1.00	0.00
3400301090	12	42	0.75	0.92	0.08

Note that the municipalites are identified here by US Census GEOID. We will repeat this pattern for each of the tables, joining them together at the end using the GEOID as the key.

Jersey Shore flag

We create a table with a shore_town flag, as in Part I. The flag is TRUE if the municipality has a border with the Atlantic Ocean.

library(tigris)
library(sf)
options(tigris_use_cache = TRUE)

counties <- municipality %>% pull(county) %>% unique() %>% sort()

# Get the map from tigris
nj_municipality_map <- counties %>%
  map_df(~ county_subdivisions("NJ", county = ., class = "sf"))

# Get the towns that border the Atlantic Ocean
jersey_shore_water <- nj_municipality_map %>%
  filter(ALAND == 0) %>%
  filter(GEOID != "3401100000")
jersey_shore_indices <- st_intersects(jersey_shore_water,
                                      nj_municipality_map) %>%
  unlist() %>% unique()

# Now build a table of shore towns
municipality_shore <- nj_municipality_map %>%
  as_tibble() %>%
  mutate(shore_town = FALSE)
municipality_shore$shore_town[jersey_shore_indices] = TRUE
municipality_shore <- municipality_shore %>%
  filter(ALAND != 0) %>%
  select(GEOID, shore_town)

Table 3: head(municipality_shore)
GEOID	shore_town
3400107810	TRUE
3400120290	FALSE
3400129430	FALSE
3400149410	FALSE
3400160600	FALSE
3400168430	FALSE

Population estimates and race and ethnicity data

Next we gather the race and ethnicity data from the US Census. We are using the tidycensus package to get the data from the 2010 decennial census.

library(tidycensus)
# Get race/enthnicity ratios (2010 / get_decennial)
municipality_race_ethnicity_ratios <- counties %>%
  map_df(
    ~ get_decennial(
      geography = "county subdivision",
      state = "NJ",
      county = .,
      year = 2010,
      output = "wide",
      variables = c(
        population = "P001001",
        white = "P003002",
        black = "P003003",
        native_american = "P003004",
        asian = "P003005",
        pacific_islander = "P003006",
        other_races = "P003007",
        two_or_more_races = "P003008",
        non_hispanic_or_latino = "P004002",
        hispanic_or_latino = "P004003"
      )
    )
  ) %>%
  filter(population != 0) %>%
  # Convert totals to ratios
  mutate(across(white:hispanic_or_latino, ~ .x / population)) %>%
  rename_with( ~ paste0("r_", .), white:hispanic_or_latino) %>%
  select(GEOID, r_white:r_hispanic_or_latino)

Table 4: head(municipality_race_ethnicity_ratios) (first several columns)
GEOID	r_white	r_black	r_native_american	r_asian	r_other_races
3400100100	0.76	0.10	0.00	0.08	0.03
3400102080	0.27	0.38	0.01	0.16	0.14
3400120290	0.70	0.10	0.00	0.12	0.05
3400108680	0.73	0.09	0.01	0.02	0.12
3400108710	0.78	0.13	0.00	0.01	0.04
3400115160	0.98	0.00	0.00	0.01	0.01

Next we gather the the population and population density from the 2019 estimates, again using the tidycensus package.

# Get population and density  (2019 / get_estimates)
municipality_population <- counties %>%
  map_df(
    ~ get_estimates(
      geography = "county subdivision",
      state = "NJ",
      county = .,
      year = 2019,
      output = "wide",
      product = "population"
    )
  ) %>%
  select(GEOID, population = POP, density = DENSITY)

Table 5: head(municipality_population)
GEOID	population	density
3400108680	4284	566
3400120350	4052	373
3400129280	25746	232
3400149410	5856	104
3400100100	8818	1613
3400102080	37743	3508

Household median income

Next, we get figures for the household median income from the 2015-2019 ACS.

municipality_income <- counties %>%
  map_df(
    ~ get_acs(
      geography = "county subdivision",
      state = "NJ",
      county = .,
      year = 2019,
      output = "wide",
      variables = c(household_median_income = "B19013_001")
    )
  ) %>%
  select(GEOID, household_median_income = household_median_incomeE) %>%
  filter(!is.na(household_median_income))

Table 6: head(municipality_income)
GEOID	household_median_income
3400100100	70393
3400160600	112917
3400115160	82500
3400149410	82663
3400168430	61366
3400143890	83045

Use of force incident counts

Finally, we gather 2021 counts of use of force incidents from the njoaguof package. This task is somewhat complicated by the fact that we (probably) do not have the full year of data from every municipality. The earliest incidents in the dataset are from October 2020, but many individual agencies do not have incidents until 2021. For example, here are all the incidents reported by the Watchung PD, from October 2020 through May 2022.

library(njoaguof)
incident %>%
  filter(
      agency_county == "Somerset County" &
      agency_name == "Watchung PD"
  ) %>% 
  ggplot() + geom_histogram(aes(x=incident_date_1), binwidth=30) +
  scale_x_date(limits=lubridate::ymd(c("2020-10-01","2022-05-31"))) +
  xlab("Date") + ylab("Number of incidents") +
  labs(title=str_wrap("Watchung PD use of force incidents"),
       subtitle = str_wrap("Oct 2020 - May 2022"),
       caption = "Source: NJ OAG")

One suspects that there may have been use of force incidents between October 2020 and July 2021 that went unreported by the Watchung PD, simply because they had not yet started reporting such incidents at the time. It is not clear the earliest date at which the Watchung PD would have reported a use of force incident, but it may have been well into 2021.

For this reason, we will attempt to estimate when each department started reporting use of force incidents. We will make the simplifying assumption that for each department, the incidents represent a homogeneous Poisson process with an unknown starting point and continuing through the end of 2021. ¹ For agencies which reported their first incident in 2021, we estimate the rate \(\lambda\) of incidents by looking at the period after the first incident to the end of the year: \[\lambda_{est}=n/d\] where \(n\) is the number of incidents after the first one, and \(d\) is the number of days from the first incident to the end of 2021. For a Poisson process with rate \(\lambda\), the expected value of the time until the first incident is \(e^{-\lambda}/(1-e^{-\lambda})\). Thus we estimate the start of the reporting period \(t_0\) as \[{t_0}_{est} = t_1 - e^{-\lambda_{est}}/(1-e^{-\lambda_{est}})\] where \(t_1\) is the date of the first incident. ² Note that in order to use this estimate, there must be at least two incidents in 2021.

Let us now calculate \(\lambda_{est}\) and \({t_0}_{est}\) and the estimated proportion of the year 2021 this represents for every municipal agency who reported their first incident in 2021:

initial_incident_2021_estimates <- incident %>%
  # Find the first incident and the number of incidents after
  group_by(agency_county, agency_name) %>%
  summarize(
    first_incident = min(incident_date_1),
    incidents_after_first = sum(lubridate::year(incident_date_1) == 2021) - 1,
    .groups = "drop"
  ) %>%
  filter(lubridate::year(first_incident) == 2021) %>%
  mutate(
    # Estimate lambda
    lambda_est_period = lubridate::ymd("2022-01-01") - first_incident,
    lambda_est = incidents_after_first / as.double(lambda_est_period),
    # Estimate t_0
    t_0_est = case_when(
      incidents_after_first == 0 ~ as.Date(NA),
      TRUE ~ first_incident - exp(-lambda_est) / (1 - exp(-lambda_est)),
    ),
    # Estimate the partial year
    reporting_days_est = lubridate::ymd("2022-01-01") -
      pmax(t_0_est, lubridate::ymd("2021-01-01")),
    partial_year = as.double(reporting_days_est) /
      as.double(lubridate::ymd("2022-01-01") - lubridate::ymd("2021-01-01"))
  ) %>%
  # Filter out non-municipal agencies
  left_join(agency, by = c("agency_county", "agency_name")) %>%
  filter(!is.na(municipality)) %>%
  select(agency_county, municipality, lambda_est, t_0_est, partial_year)

Table 7: head(initial_incident_2021_estimates)
agency_county	municipality	lambda_est	t_0_est	partial_year
Atlantic County	Linwood city	0.02	2021-02-11	0.89
Atlantic County	Longport borough	0.03	2021-10-15	0.21
Atlantic County	Margate City city	0.02	2021-04-06	0.74
Atlantic County	Mullica township	0.03	2021-01-06	0.98
Bergen County	Allendale borough	0.01	2020-12-20	1.00
Bergen County	Carlstadt borough	0.02	2020-12-01	1.00

There are 117 municipal agencies that reported their first incident in 2021. Of these, only 13 reported a single incident in 2021; we cannot estimate the the start of the reporting period for these. Of the remainder, we reckon that we have a full year of reporting for 49 and a partial year for 55:

initial_incident_2021_estimates %>% count(partial_year< 1)

## # A tibble: 3 × 2
##   `partial_year < 1`     n
##   <lgl>              <int>
## 1 FALSE                 49
## 2 TRUE                  55
## 3 NA                    13

Now we can build a table of all municipal agencies with an incident in either 2020 or 2021 and estimate the yearly rate of incidents for 2021.

For municipalities with an incident in 2020, this rate is exactly the number of incidents in 2021.
For municipalities that reported their first incident in 2021, the estimated rate is the number of incidents reported 2021 divided by partial_year.

incident_rate_2021 <- incident %>%
  # Get the 2021 incident count
  filter(lubridate::year(incident_date_1) <= 2021) %>%
  group_by(agency_county, agency_name) %>%
  summarize(incident_count = sum(lubridate::year(incident_date_1) == 2021),
            .groups = "drop") %>%
  # Filter out the non-municipal agencies
  left_join(agency, by = c("agency_county", "agency_name")) %>%
  filter(!is.na(municipality)) %>%
  select(agency_county, municipality, incident_count) %>%
  # Add a partial_year field, initialed to 1.0, and then update
  # with the initial_incident_2021_estimates
  mutate(partial_year = 1.0) %>%
  rows_update(
    select(
      initial_incident_2021_estimates,
      agency_county,
      municipality,
      partial_year
    ),
    by = c("agency_county", "municipality")
  ) %>%
  # Estimate the 2021 yearly rate
  mutate(incident_rate_est = incident_count / partial_year)

Finally, let’s switch to using the GEOID as the primary key.

municipality_incident_rate <- incident_rate_2021 %>%
  left_join(municipality, 
            by = c("agency_county" = "county", "municipality")) %>%
  select(GEOID, incident_count, partial_year, incident_rate_est)

Table 8: head(municipality_incident_rate)
GEOID	incident_count	partial_year	incident_rate_est
3400100100	8	1	8
3400102080	119	1	119
3400107810	26	1	26
3400120350	19	1	19
3400120290	63	1	63
3400125560	83	1	83

Combining the tables

Before combining the tables, we have to take into account the fact that the Township of Princeton and Princeton Borough merged in 2013.

Princeton Borough and Princeton Township

The former Princeton Township is present in the municipality_race_ethnicity_ratios table, since it is derived from the 2010 census, but absent from the other tables. We will handle this by updating municipality_race_ethnicity_ratios, combining the two Princetons, weighted by their 2010 populations.

# Combine the two Princetons.
#   Township:  Population 16265,  GEOID 3402160915
#   Borough:   Population 12307   GEOID 3402160900
combined_princeton <- municipality_race_ethnicity_ratios %>%
  filter(GEOID %in% c("3402160900", "3402160915")) %>%
  arrange(GEOID) %>%
  mutate(weight = c(12307, 16265)) %>%
  summarize(across(r_white:r_hispanic_or_latino,
                   ~ sum(.x * weight) / sum(weight))) %>%
  mutate(GEOID = "3402160900") %>%
  relocate(GEOID)

# Update the table with new ratios for Princeton Borough
municipality_race_ethnicity_ratios <- municipality_race_ethnicity_ratios %>% 
  rows_update(combined_princeton, by="GEOID")

US Census and Officer Diversity Race And Ethnicity Differences

Before combining the data, we should note here that the US Census and NJ OAG treat Hispanic and Latino origin differently. In the NJ OAG Law Enforcement Officer Diversity dataset, Hispanic and Latino origin is considered a race, so for officers we will have \[ r_{male} + r_{white} + \cdots + r_{non\_hispanic\_or\_latino} + \cdots + r_{two\_or\_more\_races} + r_{race\_na} = 1 \]

On the other hand, the US Census treats Hispanic and Latino origin as an ethnicity independent of race. Thus, for the general population, we will have \[ r_{white} + r_{black} + \cdots + r_{two\_or\_more\_races} = 1 \] and

\[ r_{non\_hispanic\_or\_latino} + r_{hispanic\_or\_latino} = 1 \]

The combined table

Now we can build the combined table.

combined_table <- officer_info %>% 
  inner_join(municipality_population, by="GEOID") %>%
  inner_join(municipality_race_ethnicity_ratios, by="GEOID") %>%
  inner_join(municipality_income, by="GEOID") %>%
  inner_join(municipality_shore, by="GEOID") %>%
  inner_join(municipality_incident_rate, by="GEOID")

This table has 407 rows, of which 395 have an estimate of the incident rate. These 395 rows cover 90% of the population of New Jersey.

sum(combined_table[!is.na(combined_table$incident_rate_est),]$population) /
  sum(municipality_population$population)

## [1] 0.9

Use of force incidents by population and officer count

Unsurprisingly, the incident rate estimate is strongly correlated with both the population and the officer count.

combined_table %>%
  select(incident_rate_est,
         officer_count,
         population,
         partial_year,
         shore_town) %>%
  pivot_longer(officer_count:population) %>%
  ggplot(aes(x = incident_rate_est, y = value)) +
  geom_point(aes(alpha = partial_year, color = shore_town)) +
  geom_smooth(aes(weight = partial_year), method = "lm", se = FALSE) +
  scale_x_log10() + scale_y_log10(labels=scales::comma) +
  facet_wrap(facets = "name", scales = "free") +
  labs(
    title = str_wrap("2021 use of force incidents vs population and officer count"),
    caption = "Source: US Census, NJ OAG"
  ) +
  theme(legend.position = "bottom",
        axis.title.y=element_blank())

However, in a joint linear model, the officer_count is far more significant than population.

lm(incident_rate_est ~ officer_count + population, 
   combined_table , weights=partial_year) %>% 
  summary() %>% 
  magrittr::use_series("coefficients")

##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -4.18760    1.66373    -2.5  1.2e-02
## officer_count  0.92022    0.04508    20.4  1.3e-63
## population    -0.00034    0.00014    -2.5  1.3e-02

As we observed in Part I, many of the towns on the Jersey Shore (blue dots in the chart above) have much larger summer populations (and correspondingly larger police departments) not reflected in the US Census population. If we exclude the shore towns, the population becomes more significant in the joint model, though still less so than the officer_count.

combined_table_core <- combined_table %>% filter(!shore_town)
lm(incident_rate_est ~ officer_count + population, 
   combined_table_core , weights=partial_year) %>% 
  summary() %>% 
  magrittr::use_series("coefficients")

##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -4.45728    1.70075    -2.6  9.2e-03
## officer_count  0.98138    0.04755    20.6  1.4e-62
## population    -0.00052    0.00015    -3.6  3.6e-04

With this in mind, we will begin to focus on the per-officer incident rate, and try to see how variations in this rate between municipalities are explained by the other variables in our combined table.

In the next post, we will use the combined_table generated in this post and begin exploring the relationship between these variables and number of use of force incidents per officer.

This is certainly a simplifying assumption, since we would expect the frequency of incidents to vary seasonally, especially for shore towns. It is also possible that some departments stopped reporting incidents some time before the end of 2021, or reported incidents sporadically. Additionally, the force incidents frequently appear in bunches because multiple officers or multiple subjects may be involved in a single event. We will not attempt account for these factors.↩︎
Note that \[1/\lambda < e^{-\lambda}/(1-e^{-\lambda}) < 1/\lambda + 1/2\] when \(\lambda > 0\), so the naive estimate \({t_0}_{est} = t_1 - d/n\) would work nearly as well.↩︎