Predictive Modeling For Public Health: Preventing Childhood Lead Poisoning

City:	Chicago, Illinois, United States
Organization:	The University of Chicago (Data Science for Social Good), Chicago Department of Public Health
Project Start Date:	2014
Project End Date:	On-Going
Reference:	Potash, E., Brew, J., Loewi, A., Majumdar, S., Reece, A., Walsh, J., Rozier, E., Jorgensen, E., Mansour, R., Ghani, R. (2015). Predictive Modeling for Public Health: Preventing Childhood Lead Poisoning. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 2039-2047. https://dssg.uchicago.edu/wp-content/uploads/2016/01/p2039-potash.pdf http://dsapp.uchicago.edu/research-areas/health/lead-prevention/
Problem:	Use data to predict likelihood of child getting lead poising thereby turning the solution from a reactive approach to a predictive approach reducing the burden on Chicago Department of Public Health and their limited resources.
Technical Solution:	Predictive Analytics, Supervised Learning: Logistic Regression, Random Forrest Generation, Support Vector Machines. Focus was on evaluating the models precision at predicting lead poisoning in cases that are predicted to be most at risk by the model. This was done in order to account for proportion of tests that CDPH can tackle given their limited resources. Testing the performance of the model was done by updating it with new information accounting for how model adapts as the information available becomes less over the years and the number of lead poisoning decreases. Validation involved using differing years and periods for training and test datas and determining the optimum required number of years for training data.
Datasets Used:	Data Set 1: Chicago Public of Health (1993-2003) Data Set 2: City Records for Building Footprint Data Set 3: City Property Value Assessments Data Set 4: Census Data and Ward Boundaries Data Set 5: American Community Five Year Survey
Outcome:	Pre-Solution Performance: Reactive, Very Slow, Lethal to Children, Costly Post Solution Performance: Predictive, Very Good, Precision was 20% for logistic regression compared to 4% of the baseline (random classification). Other Models have similar precision. Attributes derived through logistic regression made sense. Expected to speed up inspection rates and lower costs significantly.
Issues that arose:	Aggregating different blood tests for same child difficult due to error-prone inputs/format of name and birthdate. Home addresses on blood tests prone to typographic error, only 20% matched exactly with address dataset. 75% matched after cleaning and using regular expressions. Roughly 4% of addresses were not resolved. Missing information, such as gender and ethnicity, inferred using other information and models Required to spatially aggegrate address to the same building in order to properly use structure data of building. Older records were in PDF format at best and on printed paper at worst. Model potential held back due to State withholding about 40,000 records of annual births occurring in Chicago.
Status:	Operational
Entered by:	Pablo Orozco 1003678647 pablo.orozco@mail.utoronto.ca November 20, 2017

CEM1002,

Civil Engineering, University of Toronto

Contact: msf@eil.utoronto.ca