Broward County Rental Analysis & Price-Range Prediction


Are nightly rentals on Airbnb listed at the correct price? What features in the data can improve pricing to offer customers a tool to improve their rental rate? How heavily does text in the description help predict pricing?

Instead of randomly selecting rentals for price improvement, the model used in this analysis show a 3x improvement in some price ranges.

The classification analysis reviews Broward County, Florida. All of these questions are answered in the classification prediction.


calendar_filtered_dt <- read_rds("01_data_prep/calendar_filtered_dt.rds")
listings_filtered_dt <- read_rds("01_data_prep/listings_filtered_dt.rds")

listings_filtered_dt %>% head(10)
Data Preparation

The first step is to get all data in the correct format. More features will be added later in the recipe.

listings_prepared_tbl <- listings_filtered_dt %>%
  as_tibble() %>%
  select(price, id, listing_url, 
         name, description, amenities, neighbourhood_cleansed,
         latitude, longitude, accommodates, 
         bathrooms_text, bedrooms, beds, 
         number_of_reviews, review_scores_rating) %>%
  # create price_rank and use ntile for equal proportion binning
  mutate(price = price %>% str_remove_all("\\$|,") %>% as.double(),
         price_rank = ntile(price, 10)) %>%
  relocate(price_rank) %>%
  # find min/max of each price_rank group
  group_by(price_rank) %>%
    min_price_by_rank = min(price, na.rm = T),
    max_price_by_rank = max(price, na.rm = T)
  ) %>%
  ungroup() %>%
  # new character column for priceRange... automated.
  unite('priceRange', min_price_by_rank:max_price_by_rank, sep = "-", remove = T) %>% 
  relocate(priceRange) %>%
  # bathrooms & bedrooms clean up
  mutate(bathrooms = gsub('[ baths]','',bathrooms_text) %>% as.double() %>% round(., 0)) %>%
  drop_na(bathrooms) %>%
  select(-bathrooms_text) %>%
  mutate(bedrooms = ifelse(, 0, bedrooms))

listings_filtered_tbl %>% glimpse()
Broward County is on the east coast of Florida, where Fort Lauderdale is located. It is just north of Miami.

What do we see?

  • Most locations close to the beach are slightly lighter (light blue, light green) in color, or more expensive.
  • Some very expensive (yellow) locations are dotted close to the coast, but also inland along the inlet.
  • 1.7 (mean) number of bedrooms, accommodating 5.1 people.
  • Most locations receive 4-/5-star reviews, and very few are given 1-star.

Text Analysis

I want to see the importance of text on the classification model. If it’s important, we will see on the variable importance plot.

First, let’s tokenize the description. We need to remove the stopwords like “a,” “the,” “with,” etc.

airbnb_text <- listings_prepared_tbl %>%
  mutate(priceRange = parse_number(priceRange)) %>%
  unnest_tokens(word, description) %>%

airbnb_text %>% count(word, sort = T)
Word Frequency

Let’s find the frequency of the top 100 words in each priceRange.

Text Modeling

Find word frequency in rentals that are increasing with price, and those that are decreasing.

Let’s look at the frequency of words in higher-priced rentals vs. lower-priced

Words like ocean, views, pool, resort, family, house, and king increase with price.

word_mods %>% arrange(-estimate)
Words like studio, restaurants, wifi, coffee, parking, shopping, mall, minutes, and mentioning the airport decrease with price.

word_mods %>% arrange(estimate)
Visualize with Volcano plot

The chart below compares the frequency of the higher-priced rentals (right side of vertical line) to the lower-priced rentals.

word_mods %>%
  mutate(p.value = log10(p.value)) %>%
  ggplot(aes(estimate, p.value)) +
  geom_vline(xintercept = 0, lty = 2, alpha = 0.7, color = "gray50") +
  geom_point(color = "midnightblue", alpha = 0.8, size = 2.5) +
  geom_text_repel(aes(label = word), max.overlaps = 40)

It’s not such a surprise that lower-priced rentals feel the need to describe their location to amenities like restaurants, shopping and mall, as well as included, or free, items like coffee and wifi.

Filter upper/lower p.value

higher_words <- word_mods %>%
  filter(p.value < 0.05) %>%
  slice_max(estimate, n = 12) %>%

lower_words <- word_mods %>%
  filter(p.value < 0.05) %>%
  slice_max(-estimate, n = 12) %>%

The charts below show the top 12 words associated with price increase & decrease.

  • As the price increases, higher frequency words trend upward on higher-priced rentals.
  • On lower-priced rentals, the curve starts high on words like “airport” and “shopping,” and then decrease as the price increases.
# higher words
word_frequency %>%
  filter(word %in% higher_words) %>%
  ggplot(aes(priceRange, proportion, color = word)) +
  geom_line(size = 1.5, alpha = 0.7, show.legend = F) +
  facet_wrap(vars(word), scales = "free_y") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::percent, limits = c(0, NA)) +
  labs(x = "rental price", y = "proportion of total words used for rentals in that price")

# lower words
word_frequency %>%
  filter(word %in% lower_words) %>%
  ggplot(aes(priceRange, proportion, color = word)) +
  geom_line(size = 1.5, alpha = 0.7, show.legend = F) +
  facet_wrap(vars(word), scales = "free_y") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::percent, limits = c(0, NA)) +
  labs(x = "rental price", y = "proportion of total words used for rentals in that price")

Maybe this text analysis helps the model. Let’s find out.


Train/Test Split

Because there isn’t so much data, we perform the initial split 90% training, and the rest is testing. Including more data might create more noise, but then we could use an 80/20 split, which is what I normally prefer. I can always change it later.


split <- listings_prepared_tbl %>%
  select(-listing_url, -name, -price, -price_rank) %>%
  mutate(description = str_to_lower(description)) %>%
  initial_split(strata = priceRange, prop = 90/100)

train_tbl <- training(split)
test_tbl  <- testing(split)
metrics <- metric_set(accuracy, roc_auc, mn_log_loss)

# resamples
v_folds <- vfold_cv(train_tbl, v = 5, strata = priceRange)
#  5-fold cross-validation using stratification 
Recipe + Regex Pattern

ML models can be finicky, and recipes are key to preparing the data. I create a quick regex pattern for those 24 words of the high-/low-priced rentals.

higher_pat <- glue::glue_collapse(higher_words, sep = "|")
lower_pat  <- glue::glue_collapse(lower_words, sep = "|")

recipe_spec <- recipe(priceRange ~ ., data = train_tbl) %>%
  update_role(id, new_role = "id") %>%
  # create a new indicator variable based on pattern using regex
  step_regex(description, pattern = higher_pat, result = "high_price_words") %>%
  step_regex(description, pattern = lower_pat, result = "low_price_words") %>%
  step_rm(description) %>%
  # remove zero-value/missing data in beds
  step_zv(beds) %>%
  step_novel(neighbourhood_cleansed) %>%
  step_dummy(all_nominal_predictors(), one_hot = T) %>%



XGBoost Model

Boosted Tree Model Specification (classification)

Custom Grid

I want to create a custom grid so I can have more control and have more complex modeling

Tune parameters with Racing methods

5 vfolds (or resamples) * 20 possible parameters = 100 xgb models
Won’t train ALL 100, as it will throw out some

# Tuning results
There were issues with some computations:

Visualize Race Results


## Finalize on Best
xgb_last <- xgb_word_wflw %>%
  finalize_workflow(select_best(xgb_word_rs, "mn_log_loss")) %>%
# Resampling results
There were issues with some computations:

Variable Importance

Lat+Long are most important features, then number of people it can accommodate. The model text analysis was used, but wasn’t as important as features on reviews.

extract_workflow(xgb_last) %>%
  extract_fit_parsnip() %>%
  vip(geom = "point", num_features = 15)


Confusion Matrix

More price adjustment suggestions are picked up in the very expensive and very cheap price range.

ROC Curve

Receiver Operator Characteristics
The model identifies the most expensive & cheapest locations (0-99 and 373+) easier than mid-priced. This means the mid-priced rentals are probably priced the best.


Area Under the Curve
1 is a perfect model, and 0.50 doesn’t add any intelligence. Our predictions are okay. Maybe another model would be better, or creating an ensemble, or another round of tuning.

Using the model shows we are gaining 89% of all listings where suggesting to adjust price is recommended in the first 30% of listings in the $500+ price group.


The lift quantifies the gain. Using the example above:

89% / 30% = 2.96x improvement

Using the model there is almost a 3x improvement to help rentals improve their list price based on the features in the model

Precision Recall

The better results are closer to 1:1 (the upper-right quadrant)

False Negatives are typically more important. Recall indicates susceptibility to FN’s (lower recall, more susceptible).

… in order words, we want to accurately predict the rentals that should adjust their price (lower FN’s) at the expense of over-predicting rentals that should not (False Positives).

The precision vs. recall curve shows us which models will give up fewer FP’s as we optimize the threshold for FN’s.