top of page
  • Writer's pictureJoão Ataide

Airbnb Data Analytics - Toronto, Canada

Updated: Feb 15, 2023


Airbnb is already considered to be the largest hotel company today. But this one doesn't have any hotels! The biggest differential of this company is the work of connecting people who want to travel (and stay) with hosts who want to rent their properties practically. Airbnb provides an innovative platform to make this hosting alternative. By the end of 2018, when it was still a Startup, it had hosted more than 300 million people around the world, challenging traditional hotel chains. One of the company's initiatives was the opening of data for the main tourist cities in the world. Through the Inside Airbnb portal, it is possible to download a large amount of data to develop Data Science projects and solutions. The city of Toronto is the capital of the province of Ontario, it is one of the most important Canadian cities, known for its winter tourism and for being a dynamic metropolis with a core of skyscrapers, in addition to having many green spaces, from the oval organized from Queen's Park to the 400-acre High Park and its trails.

Due to this, the present project aims to carry out an exploratory analysis of the data, thus removing several insights, referring to the services provided by Airbnb in the city of Toronto, Canada. This project was inspired by the practice of the Data Science course in the Practice of Calor Melo. The complete work can be seen in the Notebook.

The dataset is dated May 7, 2020, it is a summarized version, it has several variables that can be analyzed, that is, it was then necessary to first know its variable dictionary and understand the meaning of each column name.

  • id - generated id number to identify the property

  • name - advertised property name

  • host_id - id number of the owner (host) of the property

  • host_name - Hostname

  • neighborhood_group - this column does not contain any valid values

  • neighborhood - neighborhood name

  • latitude - coordinate of the property's latitude

  • longitude - coordinate of the property's longitude

  • room_type - informs the type of room offered

  • price - price to rent the property

  • minimum_nights - minimum amount of nights to book

  • number_of_reviews - number of reviews the property has

  • last_review - date of the last review

  • reviews_per_month - number of reviews per month

  • calculated_host_listings_count - number of properties from the same host

  • availability_365 - number of days of availability within 365 days

Likewise, before starting any analysis, we will check the "face" of our dataset, analyzing the first 5 entries, as we can see below:

In this way, I was able to verify the data and then generate some questions to get to know the database, asking things like:

1. How many attributes (variables) and how many entries does our dataset have?

2. What percentage of values are missing from the dataset?

3. What is the type of distribution of variables?

4. What type of property is most rented on Airbnb?

5. What is the most expensive location in the City of Toronto?

For the first question, it happened that most variables have little missing data, however, the neighborhood_group layer has 100% of its data. As we can see below:

In addition, I performed the statistical summary of these, which prompted me to raise some hypotheses. As we can see, the price and minimum nights variables have indications of values outside the normal called outliers, due to their characteristics such as:

  • The price variable has 75% of the value below 155.00, but its maximum value is 14058.00, indicating the presence of outliers (outliers).

  • The minimum number of nights (minimum_nights) is above 365 days in the year, reaching a maximum of 1125.00, and a minimum of 30.92 days.

However, to be more sure we will plot the boxplot. For the price above U$2000.00 adding 100 entries makes approximately 0.4593%.

For minimum nights over 30 days adding 560 data making approximately 2.58% of the dataset.

That's why we carry out the necessary cleanings, such as removing the neighborhood_group layer and applying a filter for variables, price and minimum nights. Then obtaining the histograms, which illustrate the distributions of each attribute.

Then I noticed that most distributions do not show normalization, however the next question was: "What type of property is most rented on Airbnb?". Knowing that the company offers options for apartments/whole houses, just renting a room or even sharing a room with other people. Then, we will count the number of occurrences of each rental category and what percentage of properties are available for each type.

Soon, I noticed that the majority prefer private apartments, and secondly, a complete house/apartment. Where I assumed that due to the minimum rental being around 30 days, they prefer comfort. Last but not least: “What is the most expensive location in the City of Toronto?” A filter was then applied, which calculates the average price per neighborhood.

Obtaining then that their price averages have very approximate values, having an almost normal distribution by neighborhood. As we can see in the histogram below:

However, when I went to plot its spatial distribution, it indicates a concentration in the central region of the city, where most tourist attractions and universities are present, as we can see below:

The present work was only a superficial exploratory analysis of the Airbnb database of the city of Toronto in Canada, in which some outliers were noticed in its variables. There was also an uneven distribution of properties, with a concentration of these regions in the center of the city and a normal distribution of prices by neighborhoods.

Finally, it is important to remember that this dataset is a summarized version, being ideal for an initial analysis approach. So, as a future objective, a more accurate analysis of the complete dataset.


bottom of page