Thursday, October 23, 2014

Take away from kaggle criteo winners

Criteo competition is predict click or not for display ads. It is a very useful scenario. Some winner of this competition shared their solution in forum. A quick review and take away from these solutions.


  • 3 Idiots' Solution (1st)

    • Codedocs and discussion thread can be found online.
    • Basically it is a single FM model result. (!!!)
    • "Empirically we observe using categorical features is always better than using numerical features." (!!!)
    • "instance-wise data normalization makes the optimization problem easier to be solved." (!!!)
    • Field wise fm. 
    • Use GBDT results as new features.
    • per-coordinate learning rate schedule looks very useful for sgd. Related paper. I think this method is also used in vowpal wabbit.
G = G + g*g 
w = w - eta*(1/sqrt(G))*g

  • Guocong Song's Solution (3rd)

    • Code, docs and discussion thread can be found online.
    • Linear combination of 4 models. All 4 models are trained by vw.(!!!)
    • Grouping features before generate quadratic/polynomial features.
    • Two tricks of using vw. 
      • -C [ --constant ]  Set initial value of constant (Useful for faster convergence on data-sets where the label isn't centered around zero)
      • --feature_mask allows to specify directly a set of parameters which can update, from a model file. This is useful in combination with --l1. One can use --l1 to discover which features should have a nonzero weight and do -f model, then use --feature_mask model without --l1 to learn a better regressor.

  • Julian de Wit's Solution (4th)

    • Code, docs and discussion thread can be found online.
    • Ensemble of deep neural networks (!!!)
    • Traindata contained roughly 128K-200K features (!!!)
    • Rare and unseen test-set categorical values were all encoded as one category. (almost all winner mentioned this.) This is almost the only feature engeering of this winner. 
    • Numeric features need standardized. Log-transform apply on longtail features (cnt based features).
    • 2 hidden layers with respectively 128 and 256 units.
    • "Other challenges also reported simply averaged ensembles with neural networks are quite good already." (link)
Hopefully will get more thoughts after repro those results and reading code.

No comments:

Post a Comment