In this competition Jigsaw (formerly Google Ideas) asks us to detect the presence of toxicity in Wikipedia comments.
Exploratory Data Analysis
The key takeaways from the exploratory data analysis were:
- The data set consists of about 160,000 comments with six binary labels (toxic, severe toxic, obscene, threat, insult, hate speech).
- The data is highly imbalanced. 22% of comments are in some way dirty. 10% of comments are toxic. 10% of comments have multiple tags.
- Although the train and test distributions are of similar size, there are substantial differences between distributions.
- There are no missing values (thankfully).
- There are minor data quality issues and inconsistencies, which are likely due to the subjectivity of the data labeling process.
My solution consisted of blending together a multitude of models with varying pre-processing steps, architectures, and hyper-parameters. The primary models were:
- Baseline logistic regression, with word-level and character-level n-gram TFIDFs and engineered features. Models were independently trained on each class.
- Recurrent neural nets with various embeddings (FastText, Word2Vec, Glove) and architectural settings (sentence length, number of recurrent units, dense layer size, dropout rate). The architecture typically consisted of two bi-directional recurrent layers (either GRU or LSTM), followed by a dense layer, with a sigmoid activation function output. Since each class was solved simultaneously using the binary crossentropy metric, performance gains were achieved not only by increased model complexity, but also by including inter-label interactions.
- Factorization machines optimized with a follow-the-regularized-leader algorithm.
- Character-level convolutional neural nets. Although this model did not achieve as high ROC-AUC scores as the other approaches, it was incredibly uncorrelated with other predictions, and therefore great for blending.
These four models were able to achieve high performance (ROC-AUC > 0.98) with low inter-model correlations. Prior to the inter-model blending step, I took a simple average of intra-model predictions with different hyper-parameters settings, which substantially improved local CV scores.
An Aside on Evaluation Metrics
Halfway through the competition, users and Kaggle/Jigsaw discovered there were substantial differences between train/test distributions, which were exploited by multiplying the final output by a “magic number”. To avoid promoting this unprincipled “hacking”, the evaluation metric was switched from log-loss to ROC-AUC, which is indifferent to underlying distributions.
Another alternative metric could have been the F1 score (the harmonic mean between recall and precision). I found that while the ROC-AUC and precision metrics were consistently high, the recall was terrible. In other words, although there were few false-positives, there were many false negatives.
Blending vs. Stacking
It turned out for me that a simple arithmetic mean blending approach was superior to stacking, possibly due to over-fitting to the relatively small hold-out set. Further gains were achieved through weighted average blending.
The Finish Line
In the end, I submitted two entries: the highest locally performing model and, as insurance, a hand-picked weighted-average blend determined by public CV. The highest scoring entry turned out to be the hand-picked blend, somewhat to my disappointment. Although the more principled, entirely local CV based model did well, enough to earn a bronze model, the hand-picked entry performed substantially better. I suppose the lesson is that although in principle you should always trust local CV, when there are substantial differences between train/test distributions the Kaggle Public LB can be a useful signal.
There were several things I wanted to do, but ran out of time for. For some of the recurrent-neural-nets I augmented data through a novel data augmentation method proposed by another Kaggle user, involving translating data into another language and back into English. This strategy ended up being key for the competition’s 1st place solution. Presumably this worked as a great regularization effect for preventing over-fitting to irrelevant grammatical errors.
Feature engineering could have performed more carefully, with more time spent on assessing the impact of individual features on the evaluation metric.
A more thorough investigation of the VDCNN’s behavior may have been very useful, considering it’s incredibly low correlation (around 70% with the other models’ predictions).
Bayesian optimization, as opposed to random search, may have sped things up further.
Developing a deeper understanding of the FTRL model would’ve been useful as well.
And then there were the things other users did, which seem obvious in hindsight: concatenating embedding vectors, pseudo-labeling, blending TFIDF-LR with varying word and char n-gram levels, and including more interaction features.