Predicting the probability of rodent infestation by census block

City:	Washington D.C., United States
Organization:	The Lab @ D.C.
Project Start Date:	10 October, 2017
Project End Date:	20 November, 2017
Reference:	Casey, P., Wilson K., Yokum, D., (2018), A Cautionary Tail: A Framework for Testing Predictive Model Validity
Problem:	As of 2018, Washington D.C. only conducted rodent abatement inspections when they received a 311 complaint from the general public. They responded to each request in the order they were received. This practice presented two primary problems that might be resolved using data analytics. First, is that only 46% of 311 initiated inspections find evidence of significant rodent activity. The city had a growing work load and needed to prioritize 311 responses in areas that are most likely to show infestations. Recently the number of 311 rodent complaints had more than doubled. Rising from 2,123 in fiscal year 2015 to 5,015 in fiscal year 2017. The second problem is that bias in a 311-reporting system likely left gaps in the cities rodent response coverage. The city believes there are two primary sources of the bias. Research has shown that traditionally marginalized residents are significantly less likely to use 311 to request city services, even when they witness reportable problems. Rodent 311 complaints decrease by about 66% in the winter months, but rat populations and activity do not vary on a seasonal basis. To compensate for this bias, the city wanted to develop a model that predicts which areas are likely to have infestations REGARDLESS of if a 311 complaint was received.
Technical Solution:	Supervised class probability estimation: The city used a Random Forest model to determine the probability that a rat infestation would be found in the next three months for each census block in the city. Random Forest involves taking the mode of a large number of decision tree categorization outcomes. Model Development and Selection: To avoid bias, the city developed model features based on environmental causes of rodent infestation. Random Forest was selected as the highest performing model. The city created 3 different models using the environmental factors: 1.Logistic Regression 2.Gradient Boosting 3.Random Forest Each model was tested by predicting the outcome of historic 311 inspections from August 2016 to August 2017. For each month during that time period the models predicted the 100 census blocks that would be most likely to have an infestation during the following 3 months. 100 was chosen because it was the amount of additional capacity the city could provide for inspections. On average 74% of the census blocks selected using random forest model had infestations. Model Training and Implementation: The team then trained the random forest model on every census block with a 311 request from August 2015 and August 2017. They used the trained Random Forest model to determine the probability that each census block in the city will have a rat infestation within the next three months. The 100 highest probability census blocks were to be tested inspected each month regardless of whether or not. The probabilities were also used to prioritized conflicting 311 requests.
Datasets Used:	Dataset 1: Alleys, Open Data D.C., 2017 Dataset 2: Census Block Groups 2010, Open Data D.C., 2017 Dataset 3: Construction Permits, Open Data D.C., 2017 Dataset 4: Historic Data on D.C. Buildings, Open Data D.C., 2017 Dataset 5: Parks and Recreation Areas, Open Data D.C. , 2017 Dataset 6: Rodent Inspection and Treatment, Open Data D.C., 2017 Dataset 7: Sewer and Manhole Covers, Open Data D.C., 2017 Dataset 8: Soil Type, Open Data D.C., 2017 Dataset 9: Zoning Map of D.C., Open Data D.C., 2017
Outcome:	Performance against new 311 calls: The model performed well predicting the outcome of new 311 responses. That being said, the model slightly under predicted rodent activity in densely populated wards and slightly over predicted activity in less densely populated areas. Performance against Field Assessment Inspections: The city also tested 100 non-311 locations with probabilities between .5 and .9. The model did not perform well predicting the outcome of these inspections. Inspectors found rodents in 48% of census blocks that the model predicted would have .5 to .6 probability and 46% in census blocks that would have .8,.9 probability.
Issues that arose:	Training Data: The city only had 311 dependent data to train the model with. This likely introduced bias into the decision tree outcomes. Feature Weight: The city relied on urban rodentology research to develop the model features. However, there was significant ambiguity about how to weight the various environmental variables in the model. This made the model more susceptible to bias in the testing.
Status:	Terminated
Entered by:	Matthew Pitlock, matt.pitlock@mail.utoronto.ca

CEM1002,

Civil Engineering, University of Toronto

Contact: msf@eil.utoronto.ca