Barcelona Tourism Analytics

City: Barcelona, Catalonia, Spain
Organization: Department of Business Administration and Economic Management of Natural Resources, University of Lleida and Rovira i Virgili University, Spain.
Project Start Date: 2015
Project End Date: Published April 2015
Reference: Marine-Roig, E., & Clavé, S. A. (2015). Tourism analytics with massive user-generated content: A case study of Barcelona. Journal of Destination Marketing & Management, 4(3), 162–172. https://www.sciencedirect.com/science/article/pii/S2212571X15000359?via%3Dihub
Problem: Understanding the image of Barcelona presented online by English speaking tourists to improve destination management.
Technical Solution: Analyzed text in User Generated Content from online travel blogs and travel reviews relating to Barcelona to determine the most commonly used words, which describe the image of Barcelona presented online by tourists for different tourism sectors and locations.

This was done in 4 main steps:

1. Selected appropriate websites for analysis
  • Searched for web sites with specified attributes: post title, destination, trip date.
  • Searched for "travel blog" and "travel review" sites which had at least 100 user generated entries related to Barcelona, ending up with 11 travel websites (tripadvisor.com, for example).
  • Excluded websites specific to accommodation and dining.
  • Ranked websites by visibility (inbound links), popularity (visits and traffic) and size (number of entries relating to Barcelona), and used top 4 websites.

2. Data collection - filtered top 4 websites based on content to get specific posts or web pages for download. The following filters were applied:
  • Level filter: number of clicks required to get from website's home page to destination page.
  • File type filter: filename extension
  • URL filter: filtered by protocol, server, domain, directories, filenames, keywords
  • Content filter: checked for keywords
Roughly 100,000 web pages were downloaded.

3. Data pre-processing
  • Web content mining
    • Finding user's hometown: hyperlinks are geographically structured, so country of origin was extracted.
    • User language: specialized software "Language Detection Library" was used.
  • Data arranging: pages were organized by attributes (web host, language, poster's country, topic).
  • Data cleaning: HTML tags were removed using another separate program which removes negligible text and tags, aiming to remove advertisements, copyright notices and navigation menus without affecting user generated content. This reduced the webpage content 25 times.

4. Content analysis
  • Used content parser "Site Content Analyzer" which suggests the most used keywords, discovering the frequency, density and weight of keywords.
  • Parser settings used:
    • Ignored inconsequential words.
    • Added composite words to the library such as "Sagrada Familia".
    • Added weight to words mentioned closer to the beginning of a page due to prominence and visibility.
  • Keywords were generated.
  • Classification of keywords:
    • Based on former works and preliminary frequency analysis, the following tourism sectors were developed: Food and wine, Intangible heritage, Leisure and recreation, Nature and active tourism, Sports, Sun sea and sand, Tangible heritage, Urban environment, Smart city.
    • Words were classified into the appropriate tourism sector.
    • Words were split into three categories based on their relation or the relation of their webpage to the following sub-locations: Barcelona, Barcelona Coast and Barcelona Landscapes.
  • The number of keywords in each tourism sector was generated for each location.
Datasets Used:
  • English text from 100,000 web pages; user generated content on travel blogs and in online travel reviews between 2005 and 2015 relating to Barcelona.
Outcome: The outcome was 84,945 unique keywords which can be used for analysis.
  • These words were arranged by frequency and weight to give the top 25 words. Among the top 25 words, six were positive words, displaying a positive perception of the destination. Some of Barcelona's UNESCO World Heritage Sites were mentioned in the top 25 words, which reinforces the importance of these attractions to the city's image.
  • The words were split into three location "brands" then classified into the nine sectors of tourism. The number of times a keyword was mentioned for each sector was counted for the three locations, thus providing an image of each location and strengthening the importance of the identified tourist sectors in those locations.
The results provide useful insight into destination management. These can be used to direct tourism flow, complement or strengthen the tourism offers in different areas, assess branding strategies for Barcelona and for each location, and overall identify strengths and weaknesses in the tourism sector for the city.
Issues that arose: No issues were mentioned in the study. Issues identified by the reader are generally a lack of information in several processes:
  • Specific method of classification was not described.
  • Data Analysis software or program was not mentioned.
  • Webpage filtering tool not described.
  • In the data collection phase, the study describes filters that were used to download webpages, but does not describe the filter thresholds.
Status: Terminated - Study completed and published in 2015.
Entered by: September 27, 2019 : Lisa MacTavish, l.mactavish@mail.utoronto.ca


CEM1002,
Civil Engineering, University of Toronto
Contact: msf@eil.utoronto.ca