Understanding Urban Gentrification through Machine Learning

City:	Greater London, England, United Kingdom
Organization:	Urban Studies - King’s College London, UK
Project Start Date:	Not specified (possibly August 2017, since references note that datasets were accessed then)
Project End Date:	September 25, 2018 (First published)
Reference:	Reades, Jon & Souza, Jordan & Hubbard, Phil. (2018). Understanding urban gentrification through machine learning. Urban Studies. ISBN:004209801878905. DOI: 10.1177/0042098018789054.
Problem:	Challenges of Qualitative Assessments for Urban Gentrification The study of neighbourhood change - in particular gentrification - is largely done through rigorous post-mortem qualitative assessments that are based on observational data collection methods such as, media analysis, interviews, and ethnography. Current challenges of qualitative assessments, includes focus given generally to select signifying locations that are more widely recognized. Consequently, areas that experience similar changes are often overlooked. For qualitative assessments to provide a more extensive analysis (both spatially and temporally) of gentrification and neighbourhood changes, case studies must explore the use of quantitative methodologies, while also considering applications of machine learning. Challenges of Quantitative Assessments for Urban Gentrification Within British literature for urban studies, there are perceived challenges with quantitative assessments that have prevented its’ application to-date, specifically: the limitations of secondary data to capture the dynamics of urban processes that occur at the local level, and; the suspicion that official statistics relating to neighbourhood change may describe patterns but make underlying processes of class change unclear Objective of Study This study challenges the above notions, and explores the application of machine learning techniques to demonstrate its’ potential in analyzing the existing patterns and processes of neighbourhood change, and also predicting which areas are likely to change.
Technical Solution:	The study used supervised machine learning algorithms to build a model on the 2001 census data to ‘predict’ the 2011 scores (i.e. neighbourhood status). Then the same model was used with 2011 census data to forecast gentrification for 2021. The model used two sets of variables: Scoring: variables that measured neighbourhood status (i.e. household income, property sale value, occupational share, and qualifications) Prediction: 166 variables that helped predict changes in the future (e.g. variables overlap with scoring variables identified) Data science methods used include: Principal Components Analysis (PCA) to train the algorithm to predict neighbourhood change by combining the four scoring variables into a singular measure of socio-economic status. PCA was applied to 2001 and 2011 census data to allow for comparison. Results found that property prices and incomes have a higher influence than skills and occupational mix. Standardization using the median and Inter-Quartile Range (IQR) was used to prevent one dimension from dominating due to its’ magnitude. This method preserved outliers while producing comparable scales for majority of the data. This same transformation was applied to both 2001 and 2011 census data. Random Forests (RF): The study’s model also employs “extremely randomized trees” (i.e. beyond just randomly selected dimensions, it also uses random ‘cut points’ for each split). Mean Squared Error, Mean Absolute Error, R^2 to compare the performance of each configuration. Simple Linear Regression & Multiple Linear Regression was done to compare RF against traditional methods. K-fold cross-validation to train and test the algorithm. Also, used to test every combination of hyper-parameters that govern RF.
Datasets Used:	The study only made a high-level reference to two data sources: 2001 and 2011 UK Census of Population and the London Data Store Office for National Statistics (ONS), Lower Layer Super Output Area (LSOA) Please note there were 166 variables considered in the study’s analysis. The datasets used are assumed to atleast include: Dataset 1: Population and Household Minimum and Maximum Thresholds for SOAs in England and Wales, Office for National Statistics, 2011 Dataset 2: 2001 & 2011 Household and Families data, London Datastore, February 5, 2013 Dataset 3: 2001 & 2011 Housing Data, London Datastore, February 5, 2013 Dataset 4: 2001 & 2011 Labor Market Data, London Datastore, February 5, 2013 Dataset 5: 2001 & 2011 Migrant Population Data, London Datastore, February 5, 2013 Dataset 6: 2001 & 2011 Qualifications Data, London Datastore, February 5, 2013 Potentially up to 78 other datasets sourced from: 2001 & 2011 Census Information Scheme Commissioned Tables, Office for National Statistics, 2013-2017
Outcome:	With hyper-parameter tuning (i.e. optimized Mean Squared Error) yielded a RF with a configuration of: 1400 trees, 85% of features considered by each tree, no maximum tree depth, and a minimum leaf size of two. RF was found to outperform multiple linear regression by 10% across measures, such as Mean Squared Error and Mean Absolute Error. Pearson’s r of 0.99 indicates that the model predicted 2011 scores well using 2011 data. Subsequently, projecting 2021 data using this model is promising (despite the issues discussed later). RF generated a feature importance measure (based on the contribution of the variable to the model, and out of a maximum value of 1). Of the 166 variables, occupation and skills changes were found to be drivers of neighbourhood change.
Issues that arose:	Selecting an appropriate geographical scale: Fine-scale data (i.e. block-by-block analysis) was considered to be highly sensitive and suppressed from census outputs, along with significant fluctuations (i.e. noise). Larger areas lack a sense of cohesion and shared identity that make change harder to identify. It was found that working with intermediate or mesh-scale data (i.e. ONS LSOA data) was found to be the most appropriate. Improving the selection of predictor variables: The study aligned the variable selection with previous work in urban studies cited, primarily focusing on: housing, households, work, travel and amenities. It was noted that the use of more built environment and amenity features (e.g. schools) and accounting for immigration (i.e. from Americas, EU and Oceania) are recommended for consideration in future studies. Establishing relative measures: Neighbourhood analysis required relative measures of change to contextualize and scale the generification research, and capture the varying magnitudes of neighbourhood change. Changes to characteristics of gentrifiers: It is a limitation to assume that the gentrifiers that existed between 2001-2011 are the same as 2011-2021. The model does not adjust for this. Predictive model cannot capture significant changes: For example major property developments and the transfer of residents during redevelopment are issues that the model was not be able to capture but has the ability to transform individual neighbourhoods. Sourcing timely data: The model used census data. The study noted an area for improvement could be using sources, such as Zoopa (a property price website) or Twitter to provide more real-time results. Influence of neighbouring zones and ‘edge effects’ from census tracts adjacent to geographical scope of Greater London may have impacted the model, but since it is out of scope that was not determined.
Status:	Terminated. No indication that the recommendations for improvement mentioned in the study are currently in development. The study used open data and open source code, which was also released on GitHub to enable replication.
Entered by:	September 28, 2019: Larissa Sequeira, larissa.sequeira@mail.utoronto.ca

CEM1002,

Civil Engineering, University of Toronto

Contact: msf@eil.utoronto.ca