City: | Syria |
Organization: | Faculty of Information Technology, Higher Institute for Applied Sciences and Technology, Damascus, Syria |
Project Start Date: | October 2018 |
Project End Date: | January 2019 |
Reference: | Al-Zuabi, I.M., Jafar, A. & Aljoumaa, K. Predicting customer’s gender and age depending on mobile phone data. J Big Data 6, 18 (2019). https://doi.org/10.1186/s40537-019-0180-9 |
Problem: | In the new age of digital marketing customer demographics play a massive role in enabling companies to cater offerings of services to their target customers. The problem is telecom operators often have unreliable demographic data of their users. Ideally, in marketing campaigns, companies want to target the user of the GSM (Global System for Mobile Communications) rather than the line owner as these are not always the same individual.
This study uses data driven approaches to analyze telecom customer behaviour, contract information and subscribed services to determine the true user of the GSM's age and gender. Findings of the study allow companies to identify network performance improvements and to increase effectiveness of marketing campaigns. With data analytics techniques, operators can identify problems and determine their root causes, improve the quality of user experience and perform real time troubleshooting to fix network performance issues. Furthermore, with more accurate information on customer’s gender and age attributes, intelligent marketing campaigns can be built for identified customer profiles and segments. |
Technical Solution: | The model was built using reliable dataset of 18,000 users provided by SyriaTel Telecom Company for model training and testing.
Supervised Learning:
The following Unsupervised Learning Methods were applied for dimensionality reduction and outlier detection:
|
Datasets Used: |
|
Outcome: |
Ensemble learning algorithms such as GBM, XGBoost and Random Forest were most successful in their predictions of age and gender. These models achieved best Accuracy, AUC and F1-measure when evaluating performance.
Based on the evaluation metrics, the highest performing model for age prediction was with XGBoost. This model achieved 65.5% accuracy. The second highest performing model was Random Forest, achieving 64.3% accuracy in age predictions. The third highest was GBM, achieving 62.6% accuracy in age predictions. Based on the evaluation metrics, the highest performing model for gender prediction was with XGBoost. This model achieved 85.6% accuracy.The second highest performing model was GBM, achieving 84.2% accuracy in gender predictions. The third highest was Random Forest, achieving 83.9% accuracy in gender predictions. |
Issues that arose: |
The first limitation of this work was collecting a reliable dataset for training and testing from random customers. There were only about 18,000 customers to analyze and collecting the data was time intensive, as direct methods were required and limitation in human resources for the collection process existed. The collection process took about 6 months in duration. A second limitation of the work was the fact that the model ignored two major age groups. Predictions from the model could be improved by introducing 2 new age-related groups; one for individuals who are less than 18 years old and another one for individuals who are above 60 years old. This would allow for a more detailed and balanced customer split regarding gender. A third limitation of the model was that the gender dataset was significantly unbalanced. The sample data used to build the model contained 64% males and 36% females. Consequently, the model predictions are biased towards the male class. A fourth limitation was that the model only conducted on two types of CDRs only (calls CDR and SMS CDR). The modelers couldn’t handle other types of CDR due to storage and process limitations. For future experiments, internet usage CDR would be an extremely valuable data source to extract features from. Its inclusion would effectively expand the reliable data set, making it more suitable for deep learning algorithms. Subsequent models would be more robust and accurate. The final limitation deals with the fact that the model was learned on strictly Syrian user data. The manner and ways in which individuals use their phones likely vary from society to society, thus, the informative features in this study could be more relevant to certain societies and less relevant to others. |
Status: | Terminated |
Entered by: | 28th October, Armin Safari, armin.safari@mail.utoronto.ca |