Learning Curve: Take away from kaggle criteo winners

Criteo competition is predict click or not for display ads. It is a very useful scenario. Some winner of this competition shared their solution in forum. A quick review and take away from these solutions.

3 Idiots' Solution (1st)

Code, docs and discussion thread can be found online.
Basically it is a single FM model result. (!!!)
"Empirically we observe using categorical features is always better than using numerical features." (!!!)
"instance-wise data normalization makes the optimization problem easier to be solved." (!!!)
Field wise fm.
Use GBDT results as new features.
per-coordinate learning rate schedule looks very useful for sgd. Related paper. I think this method is also used in vowpal wabbit.

G = G + g*g

w = w - eta*(1/sqrt(G))*g

calibrate the final result based on local prediction.
Some ideas of this solution come from paper "Practical Lessons from Predicting Clicks on Ads at Facebook".

Guocong Song's Solution (3rd)

Code, docs and discussion thread can be found online.
Linear combination of 4 models. All 4 models are trained by vw.(!!!)
Grouping features before generate quadratic/polynomial features.
Two tricks of using vw.

-C [ --constant ] Set initial value of constant (Useful for faster convergence on data-sets where the label isn't centered around zero)
--feature_mask allows to specify directly a set of parameters which can update, from a model file. This is useful in combination with --l1. One can use --l1 to discover which features should have a nonzero weight and do -f model, then use --feature_mask model without --l1 to learn a better regressor.

Julian de Wit's Solution (4th)

Code, docs and discussion thread can be found online.
Ensemble of deep neural networks (!!!)
Traindata contained roughly 128K-200K features (!!!)
Rare and unseen test-set categorical values were all encoded as one category. (almost all winner mentioned this.) This is almost the only feature engeering of this winner.
Numeric features need standardized. Log-transform apply on longtail features (cnt based features).
2 hidden layers with respectively 128 and 256 units.
"Other challenges also reported simply averaged ensembles with neural networks are quite good already." (link)

Hopefully will get more thoughts after repro those results and reading code.

Learning Curve

Thursday, October 23, 2014

Take away from kaggle criteo winners

3 Idiots' Solution (1st)

Guocong Song's Solution (3rd)

Julian de Wit's Solution (4th)

No comments:

Post a Comment