Application Title

City:	United States of America, Global.
Organization:	National Science Foundation grant CCF-1522054 (COMPUSTNET: Expanding Horizons of Computational Sustainability).
Project Start Date:	December 15, 2015
Project End Date:	November 30, 2020 (Estimated)
Reference:	C. Robinson and B. Dilkina, ‘A Machine Learning Approach to Modeling Human Migration’, in Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS) - COMPASS ’18, Menlo Park and San Jose, CA, USA, 2018, pp. 1–8, doi: 10.1145/3209811.3209868.
Problem:	Human migration has a huge impact on cities. The accurate prediction of humans’ flow into cities is essential for cities planning, infectious disease control, public policy development, and international trade. This triggered the need to develop more accurate models than the currently used ones (i.e. gravity and radiation models). The conventional migration prediction models only use population and distance as the basis for prediction. Moreover, they are not able to capture more complicated migration dynamics because of their fixed form. This is seen as a shortcoming and required the development of new models that include more variables as their basis to enhance accuracy. The scope of this study developed models to predict human migration between USA counties and between countries on global scale.
Technical Solution:	The study used the following techniques to predict human migration between USA counties and between countries on global scale: “Extreme” Gradient Boosting regression (XGBoost model). Artificial Neural Network model (ANN model). For Evaluation of the models, five methods were used: Common Part of Commuters (CPC) which is identical to the Bray-Curtis similarity score. Common Part of Commuters Distance Variant (CPCd) Root mean squared error (RMSE). Coefficient of determination (r^2) Comparing the ground truth number of incoming migrants and the predicted number of incoming migrants per zone using mean absolute error (MAE) and r^2. The datasets were divided into segments of three years; for each segment; one year was used for training, one for validation, and one for testing. Prior to training, hyperparameters were selected to develop the models using the training and validation datasets.
Datasets Used:	Dataset 1: USA Migration dataset, source: IRS Tax-Stats data, date: from 2004 to 2014 Dataset 2: County features, source: US Census estimates and calculated from the Census TIGER line maps of US county boundaries, date: Not mentioned but probably covering the period from 2004 to 2014 Dataset 3: Between-county features, source: calculated based on the idea of “intervening opportunities”, date: Not mentioned but probably covering the period from 2004 to 2014 Dataset 4: Global Migration dataset, source: World Bank Global Bilateral Migration Database, date: 5 timesteps, one every 10 years from 1960 to 2000 Dataset 5: Country features, source: World Bank World Development Indicators data, date: Not mentioned but probably covers the same period as dataset 4. Dataset 6: Between country features, source: NA, date: Not mentioned but probably covers the same period as dataset 4.
Outcome:	Pre Solution Performance; For the USA Migration dataset, the most accurate traditional model was the Extended Radiation model. However, for the Global Migration dataset, all traditional models had a coefficient of determination value around zero indicating poor fit of the models. Post Solution Performance: The Machine learning models performed better than the traditional models when constrained to the same conditions. Where ANN performed best for USA Migration dataset, and XGBoost performed best for the Global Migration dataset. In the case of extended conditions, the ML models performed even better. The results indicate that more features than those included in the traditional approach must be considered to accurately predict migration. Accordingly, the study was able to determine the most correlated ten factors related to human migration which extend far beyond the traditional approach and aligns well with intuition. The ANN model performed best for county migration in the USA, and was very accurate in predicting the rural migration in contrast to the over estimation that is usually found in the traditional models.
Issues that arose:	Hyperparameter optimization. For XGBoost the following parameters were tuned (Maximum tree depth, number of estimators, and learning rate ). For ANN: (loss function, number of layers, layer width, number of training epochs, and training mini-batch size.) were tuned. Zero-inflated data: Less than one percent of US counties showed inter-migration data, whilst the remaining values were zeros. This would cause a problem for models training and development. The issue was solved by introducing a hyperparameter k into both models. Poor performance of common loss functions with the ANN models. Partially due to their inability to punish large errors and deal with many zeros respectively. This was resolved by introducing a custom loss function.
Status:	In Development.
Entered by:	30-October-2020: Ahmad Al-Musa, a.almusa@mail.utoronto.ca

CEM1002,
Civil Engineering, University of Toronto
Contact: msf@eil.utoronto.ca