Inferring Urban Land Use Using Large-Scale Social Media Check-In Data

City:	New York, NY, U.S.A.
Organization:	Lyles School of Engineering, Purdue University
Project Start Date:	Unknown (Probably 2014)
Project End Date:	18 September 2014 (Published)
Reference:	Zhan, X., Ukkusuri, S. V., & Zhu, F. (2014). Inferring Urban Land Use Using Large-Scale Social Media Check-in Data. Networks and Spatial Economics, 14(3-4), 647-667. doi:10.1007/s11067-014-9264-4
Problem:	How can one find a better/more efficient tool to categorize the type of use of plots of land in an urban setting? - Finding a better tool to infer urban land use types in a city that is an alternative to the traditional approach of checking building permit data, on-site investigations, survey data, and questionnaires. These old methods are labour-intensive, time-consuming, and expensive. Another approach that uses remote sensing and high spatial resolution satellite sensors might not be applicable to urban areas.
Technical Solution:	Raw check-in data was processed for geographic coordinates and the activity categories were grouped into home, work, eating, recreation, shopping, social service, travel related, entertainment, and education (while the two latter were excluded - found to be indecisive in land use inference). The data was broken up in 8 time periods a day and 7 days a weeks (grouping weekdays & weekends) City Map was divided into 200m x 200m cells and the highest activity category with the highest % check-ins for that cell governed. Each cell had a data “tuple” or a finite ordered list of elements Each cell’s data input vector was normalized over the total number of check-ins - each data tuple had a total number of 112 features for each cell (7 categories8 time periods2 weekend/weekday classification) A Laplacian Score was used to select the top 50 features (out of 112) to avoid over-fitting Results of algorithms were compared wtih the NYC Department of City Planning MapPLUTO Data (Tax Mapping/Land Use Data in GIS) “ground truth data” Unsupervised Learning Algorithms: Standard Partitioning Methods (Hard Partitioning): K-means, K-medoid, and Dynamic Time Warping based K-means K-means partitioning method had the best results among the unsupervised learning approaches Fuzzy Partitioning Methods: FCM, Gustafson-Kessel Algorithm, GathGeva Algorithm - do not work well with data sets of high dimensionality Clustering Algorithms: PROCLUS & SUBCLU - good for high dimensionality but resulted in large number of un-clustered cells and computational complexities due the large number of features Supervised Learning Algorithms: Random Forest (Best results, highest relative accuracy)– ensemble method “which uses a combination of randomly generated decision tree classifiers to increase accuracy.” Support Vector Machine (SVM) – classifies data by maximizing the perpendicular distance between the decision boundaries and the closest data points known as support vectors Naïve Bayes Method - probabilistic classifier based on Bayes Theorem
Datasets Used:	Dataset 1: Large-scale NYC (460,000) check-in dataset using Twitter and FourSquare (from 18,440 users) from: Cheng Z et al. (2011) Exploring millions of footprints in location sharing services. AAAI ICWSM, 2010(Cholera) - in conjunction with additional information collected on the venue category information of the check-ins Dataset 2: New York City Department of City Planning (NYCDCP) MapPLUTO Data (Tax Mapping and Land Use) from: New York City Department of City Planning (NYCDCP) (2013) MapPluto. http://www.nyc.gov/html/dcp/html/bytes/dwn_pluto_mappluto.shtml#mappluto
Outcome:	Supervised Learning: Accuracy was ranked by the F-measure (F1 = 2(precision*recall)/(precision+recall)), which a function of the harmon mean of precision (% of tuples that are classified as positive that are actually positive) and recall (% of positive tuples that are classified positive) F-Measure (Relative Accuracy): Naive Bayes = 31.39%, SVM = 60.92%, Random Forest = 64.14% 50% of ground truth data then added/used with Random Forest Algorithm to produce 78.69% accuracy (94% accuracy for residential land use type, 60% for commercial/open space/recreation land use, everything else more than 44%) Unsupervised Learning: K-means clustering resulted in 65.6% accuracy (78.7% accuracy for residential land use types, 37% for commercial, and 42% for transportation and utility) No ground truth information needed Overall: The study proves that activity related information via social media check-in data does contribute more towards the prediction of urban land use
Issues that arose:	Unsupervised Clustering: Extra effort is required in identifying land use types from the clustering results, it may not infer all the desired land use types, and there is a relatively lower accuracy vs. the supervised learning algorithm (when extra ground truth information is provided to the algorithm) K-means had a Tendency to over-predict transportation & utility Mixed residential/commercial could not be inferred (could not be compared easily to MapPLUTO Data with this category) Supervised Learning: Ground truth information is required - may not exist in small urban areas The Random Forest approach’s accuracy is “inflated” as the entire data set (training, testing, and unlabeled data) was used for testing (due to limited data) and ground truth data was also added so accuracy between supervised and unsupervised approaches are not directly comparable Approximation errors can come from using the dominant land use type for each 200m x 200m cell Only 1 year worth of data was available Industrial/manufacturing land use types were not inferable due to lack of social media check-in data Social media check-in data is highly diverse and very concentrated in big cities (harder to separate land use types)
Status:	Terminated (Unknown if the NYCCDCP uses this method in conjunction with their MapPLUTO Data)
Entered by:	Kevin Jeswani, kevin.jeswani@mail.utoronto.ca

CEM1002,

Civil Engineering, University of Toronto

Contact: msf@eil.utoronto.ca