Final Project - Group B_1_2

Author

Meghan, Maya, and Mary

Background Data

The hotel dataset was created using data from the hotels’ Property Management System in Portugal. It contains detailed information about hotel reservations, including cancellations, booking dates, the time between booking and arrival, special requests, and more. While this dataset offers numerous opportunities for modeling, our focus is on identifying factors associated with cancellations and predicting when guests are likely to cancel their bookings.

Cleaning Steps

We performed several data cleaning steps to improve our models and analysis. For example, we combined the “babies” and “children” columns into a single “children” column and created a “RoomChange” dummy variable to indicate where the requested room differed from the assigned room. We removed variables such as MarketSegment, ArrivalDateYear, ArrivalDateMonth, and ArrivalDateDayOfMonth because their information could be derived from other columns. Similarly, we excluded Country, Agent, and Company due to the large number of specific levels that would not contribute meaningfully to the models. Lastly, we removed columns that contained a lot of missing values.

Question: Can we predict whether a reservation will be cancelled based on booking and guest characteristics?

Variable Type Response/Explanatory Number of Levels
IsCanceled numeric Response NA
LeadTime numeric Explanatory NA
ArrivalDateWeekNumber numeric Explanatory NA
StaysInWeekendNights numeric Explanatory NA
StaysInWeekNights numeric Explanatory NA
Adults numeric Explanatory NA
Children numeric Explanatory NA
Meal character Explanatory 5
DistributionChannel character Explanatory 3
IsRepeatedGuest numeric Explanatory NA
PreviousCancellations numeric Explanatory NA
PreviousBookingsNotCanceled numeric Explanatory NA
BookingChanges numeric Explanatory NA
DepositType character Explanatory 3
DaysInWaitingList numeric Explanatory NA
CustomerType character Explanatory 4
ADR numeric Explanatory NA
RequiredCarParkingSpaces numeric Explanatory NA
TotalOfSpecialRequests numeric Explanatory NA
ReservationStatus character Explanatory 3
RoomChange numeric Explanatory NA

To answer our question, we are using a variable called IsCancelled that is a dummy variable with a value of 0 for not-cancelled bookings and 1 for cancelled bookings. This is our response, and all other variables will be used as explanatory variables.

Variable Min Max Unique Values
Meal 2 9 5
DistributionChannel 5 9 3
DepositType 10 10 3
CustomerType 5 15 4
ReservationStatus 7 9 3
Variable Mean SD Min Max
IsCanceled 0.28 0.45 0.00 1
LeadTime 92.68 97.29 0.00 737
ArrivalDateWeekNumber 27.14 14.01 1.00 53
StaysInWeekendNights 1.19 1.15 0.00 19
StaysInWeekNights 3.13 2.46 0.00 50
Adults 1.87 0.70 0.00 55
Children 0.14 0.46 0.00 10
IsRepeatedGuest 0.04 0.21 0.00 1
PreviousCancellations 0.10 1.34 0.00 26
PreviousBookingsNotCanceled 0.15 1.00 0.00 30
BookingChanges 0.29 0.73 0.00 17
DaysInWaitingList 0.53 7.43 0.00 185
ADR 94.95 61.44 -6.38 508
RequiredCarParkingSpaces 0.14 0.35 0.00 8
TotalOfSpecialRequests 0.62 0.81 0.00 5
RoomChange 0.19 0.39 0.00 1

This graph shows the number of cancelled reservations in our dataset versus the number of non-cancelled reservations. It is included to show the skew of observations in the dataset we used which informs our decision later on to optimize area under the ROC curve instead of accuracy.

We are looking at if lead time (time between booking and check-in) makes a difference to potentially inform our employers if they should set limits to when they open booking. As we can see, there is a wide range of values, and our histogram is extremely right skewed. Most lead times are relatively short, with bookings made closer to the reservation date. However, there are a few notable outliers where reservations were made more than a year in advance.

This graph shows how previous cancellations and lead time are correlated with cancellation status. We can see that when the number of previous cancellations is 0, the majority of reservations are not cancelled. However, for higher values of previous cancellations, cancellations become much more common. There does not seem to be strong correlations between cancellation status and lead time, however we do see that all the bookings made more than 500 days in advance were not cancelled

These boxplots show us that the average lead time for cancelled reservations is higher than that for not-cancelled reservations. This could have to do with the large influx of reservations with little to no lead time that we saw in the histogram, so if the hotel allows same-day booking those bookings would almost definitely not be cancelled. This graph is interesting and relevant because bookings that occur last minute or close to their time are cancelled less than those further in advance.

We can also look into the booking channel. We see that the most cancellations are coming from travel agents/tour operators. This makes sense because they often book in large quantities and might also make provisional bookings for clients who are not yet committed. Furthermore, people are working through a third party which might be easier to cancel than directly through the hotel.

Lasso Model

We decided to use a lasso model and optimize the penalty to find the best predictors of booking cancellations. We made our training and testing datasets and set up a lasso model with the penalty tuned. We then used cross-validation to find the optimal lambda and added that into our workflow and fit. Our original model contained a variable (reservation status), that had an already known relationship with our response because it said whether the reservations were cancelled or not, so we excluded it from our recipe to fit a better model. We chose to optimize with the roc_auc metric, and used 10 fold cross validation. Because of the extremely skewed nature of our dataset, we chose to find the penalty where specificity and sensitivity were closest to equal. If the hotel predicts too many cancellations that don’t actually happen (false positives), they might double book rooms to compensate. This could lead to the hotel being overfilled and not having enough rooms for all guests. On the other hand, if the hotel fails to predict cancellations (false negatives), they could end up with empty rooms that go unused, losing money as a result. Finding where these are closest to equal can balance those out, helping the hotel combat any issues. We used step_smote in our recipe, which handles class imbalance by generating new examples of the minority class through nearest neighbors. This helped the model equally consider non-cancellations and cancellations, and resulted in a 73.2% model accuracy.

.metric .estimator .estimate
accuracy binary 0.7315277

The confusion matrix seen here shows the extra emphasis on the cancelled model to help combat the unbalanced data. True cancellations are predicted correctly about 81.4% of the time, and true non cancellations are predicted correctly about 70% of the time.

The five most important variables in this model are seen above. Previous cancellations, transient customer type and non-refundable deposits have a positive relationship with cancellation status, and required car parking spaces and room change have a negative relationship with cancellations.

Tree Model

For this model, we built a tree and optimized it using bootstrapping, resulting in a final tree with a depth of 15. We used the same recipe as before to ensure that the skewed dataset was being accounted for. This yields about an 78.6% accuracy rate, which is quite a bit better than our lasso model. We chose to optimize using roc_auc because of the imbalanced nature of our dataset, as this metric takes into account both false positives and false negatives.

.metric .estimator .estimate
accuracy binary 0.7855716

This model predicts true cancellations about 80.5% correctly and non cancellations about 77.8% correctly.

This model has 3 of the same 5 most important variables, room change, required car parking spaces, and customer type, meaning that these most likely have a somewhat strong relationship with cancellations. The other 2 important variables found from the tree model that were not identified as important in the lasso model are lead time and transient-party customer type. Lead time has a positive relationship with cancellations, while transient-party customers have a negative relationship with cancellations.

Forests

Our last model was a random forest, which uses multiple decision trees to optimize accuracy and prevent overfitting.

.metric .estimator .estimate
accuracy binary 0.847978

Using a random forest, we found a higher accuracy than our other models. The confusion matrix shows improved findings for preventing both type I and II errors. It predicts true cancellations correctly about 75.5% of the time and true non cancellations correctly about 88.4% of the time.

Three of the top-five important variables seen in the other models, lead time, car parking spaces, and room change, were also here, and this model also includes ADR (average daily revenue of the hotel) and arrival date week (what time of year). ADR has a positive relationship with cancellations, and arrival date week has increased cancellations during the summer/early fall as opposed to other times of the year

Conclusion

In conclusion, we found that we can predict with about 86% accuracy whether or not someone will cancel their hotel reservation based on customer characteristics and booking history. Figures summarizing the fit of each of the models and their variable importance are provided and described appropriately above. Some limitations to our finding are that all of our data is from Portugal, meaning that these trends may not be prevalent to other hotels around the world and our scope is limited. Many of our variables are also not normally distributed and are highly skewed, which can affect our models. We did try to combat this, but it is still important to keep in mind.

We hope that this modeling could be useful to hotels. Although it is currently limited to select hotels in Portugal, further research could help expand the scope. Ideally, these models could be very helpful for hotels around the world in predicting cancellations based on consumer and reservation characteristics.