Kaggle: Zillow Home Value Prediction

This was the second Kaggle competition I participated in, following up on my experience with prediction Moscow housing prices in the Sberbank competition.

Instead of predicting the house valuations themselves, Zillow asks to predict the error between its estimates and the actual sale price. The data set consisted of about 90,000 transactions in the Southern California housing market, with 58 features (e.g. square footage, number of floors, build year)

An exploratory data analysis revealed several important pieces of information:

  • Inspection of the correlation matrix showed that the variables with the largest correlation with error were the basement area, shed area, and finished living area. As many So-Cal residents know, most homes in California don’t have basements, which may explain the contribution to variability. Likewise, the presence and size of a shed typically isn’t a home-buyer’s primary concern. Finished living-area alone may not be a useful data point, since it may be the that the percentage matters more, and that the value may go into a repair cost upon purchasing.
  • Heteroskedastic features (errors varying with feature values) were of particular interest for this competition. Features with extreme values had more error than those with values closer to the norm.
  • There were numerous features that naively appear to be ordinal (e.g. filled in with Integers 1-30) but weren’t.
  • There were significant differences between train and test distributions, which may have caused substantial differences between local cross-validation and leaderboard performance.

To fill in missing data, I used the MICE (Multiple Imputation by Chained Equations) algorithm, which is particularly useful for complex datasets where simple imputation approaches are not sufficient.

In the end, I managed to make it into the Top 15%!

These price and price variability lessons I learned in this and the Sberbank competition have wide applicability in data science. They can be used anytime we have to predict the price of a complex object like a vehicle, office building, more general real estate, or fine art for example.