+
Skip to content

Combining and cleaning data from multiple file types like .csv, .tsv, and .xlsx to extra valuable insights about Airbnb listing price in NYC

License

Notifications You must be signed in to change notification settings

WuCandice/Airbnb-NYC

Repository files navigation

1. Importing the Data

nyc


Welcome to New York City (NYC), one of the most-visited cities in the world. As a result, there are many Airbnb listings to meet the high demand for temporary lodging for anywhere between a few nights to many months. In this notebook, we will take a look at the NYC Airbnb market by combining data from multiple file types like .csv, .tsv, and .xlsx.



We will be working with three datasets:

  1. "datasets/airbnb_price.csv"

  2. "datasets/airbnb_room_type.xlsx"

  3. "datasets/airbnb_last_review.tsv"



Our goals are to convert untidy data into appropriate formats to analyze, and answer key questions including:

  • What is the average price, per night, of an Airbnb listing in NYC?
  • How does the average price of an Airbnb listing, per month, compare to the private rental market?
  • How many adverts are for private rooms?
  • How do Airbnb listing prices compare across the five NYC boroughs?
  • 2. Cleaning the price column

    Now the DataFrames have been loaded, the first step is to calculate the average price per listing by room_type.

    You may have noticed that the price column in the prices DataFrame currently states each value as a string with the currency (dollars) following, i.e.,

    price
    225 dollars
    89 dollars
    200 dollars

    We will need to clean the column in order to calculate the average price.

    3. Calculating average price

    We can see three quarters of listings cost \$175 per night or less.

    However, there are some outliers including a maximum price of \$7,500 per night!

    Some of listings are actually showing as free. Let's remove these from the DataFrame, and calculate the average price.

    4. Comparing costs to the private rental market

    Now we know how much a listing costs, on average, per night, but it would be useful to have a benchmark for comparison. According to Zumper, a 1 bedroom apartment in New York City costs, on average, $3,100 per month. Let's convert the per night prices of our listings into monthly costs, so we can compare to the private market.

    5. Cleaning the room type column

    Unsurprisingly, using Airbnb appears to be substantially more expensive than the private rental market. We should, however, consider that these Airbnb listings include single private rooms or even rooms to share, as well as entire homes/apartments.

    Let's dive deeper into the room_type column to find out the breakdown of listings by type of room. The room_type column has several variations for private room listings, specifically:

    • "Private room"
    • "private room"
    • "PRIVATE ROOM"

    We can solve this by converting all string characters to lower case (upper case would also work just fine).

    6. What timeframe are we working with?

    It seems there is a fairly similar sized market opportunity for both private rooms (45% of listings) and entire homes/apartments (52%) on the Airbnb platform in NYC.

    Now let's turn our attention to the reviews DataFrame. The last_review column contains the date of the last review in the format of "Month Day Year" e.g., May 21 2019. We've been asked to find out the earliest and latest review dates in the DataFrame, and ensure the format allows this analysis to be easily conducted going forwards.

    7. Joining the DataFrames.

    Now we've extracted the information needed, we will merge the three DataFrames to make any future analysis easier to conduct. Once we have joined the data, we will remove any observations with missing values and check for duplicates.

    8. Analyzing listing prices by NYC borough

    Now we have combined all data into a single DataFrame, we will turn our attention to understanding the difference in listing prices between New York City boroughs. We can currently see boroughs listed as the first part of a string within the nbhood_full column, e.g.,

    Manhattan, Midtown
    Brooklyn, Clinton Hill
    Manhattan, Murray Hill
    Manhattan, Hell's Kitchen
    Manhattan, Chinatown

    We will therefore need to extract this information from the string and store in a new column, borough, for analysis.

    9. Price range by borough

    The above output gives us a summary of prices for listings across the 5 boroughs. In this final task we would like to categorize listings based on whether they fall into specific price ranges, and view this by borough.

    We can do this using percentiles and labels to create a new column, price_range, in the DataFrame. Once we have created the labels, we can then group the data and count frequencies for listings in each price range by borough.

    We will assign the following categories and price ranges:

    label price
    Budget \$0-69
    Average \$70-175
    Expensive \$176-350
    Extravagant > \$350

About

Combining and cleaning data from multiple file types like .csv, .tsv, and .xlsx to extra valuable insights about Airbnb listing price in NYC

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载