Sberbank’s Competition asks to predict prices in Moscow’s volatilize housing market.
The dataset consists of about 30,000 transactions, with almost 300 features (e.g. square footage, number of floors, vicinity to schools, workplaces, factories, etc.), as well as macroeconomic information such as currency exchange rates, income per capita, and average rental values.
An exploratory data analysis revealed that the correlation matrix exhibited the presence of several blocks of redundant, non-orthogonal features, which could be merged together to reduce dimensionality without substantial loss of information. Likewise, an inspection of macroeconomic data showed that very few of those features were relevant to price prediction.
A close examination of the housing price distribution showed that although the majority of prices followed a Gaussian-like distribution, there were prominent spikes at 1, 2, and 3 million rubles. The reasons for this were unclear. A deeper inspection into other features revealed severe data quality issues. There were numerous missing values, inconsistencies, and cases of either errors or outright fraud in the market.
My initial approach involved various models (such as regularized linear regression, random forests, and gradient-boosted decision trees) with preprocessing in the form of data imputation, dimensionality reduction, standardization, and scaling.
I removed extreme outliers, dropped features with mostly missing values, and filled in other missing values with corresponding median or modal values. Text-based features were converted to ordinal features or non-ordinal dummy variables. Highly uncorrelated variables were removed. Highly skewed variables were log-normalized.
Examination of predicted vs. measured plots and residual distribution plots showed the 1M, 2M, 3M spikes causing deep problems. To alleviate the issue I tried several solutions. In one approach I dropped the problematic prices and replaced their values with those determined by the MICE algorithm. In another approach, I separated the problematic data points into a separate data set, trained an initial classifier on the clean data, predicted the values of the “dirty” data set, and then retrained a classifier on the new data set.
This approach, combination with merging of several other models, was enough to get me in 302 out of 3274, a bronze model!