Learning Curve: Take away from kaggle tradeshift winner

Tuesday, March 17, 2015

Take away from kaggle tradeshift winner

Tradeshift competition is predicting the probability that a piece of text belongs to each of the 33 classes. Winning solutions can be found in forum thread. Code is also in git.

Best solution is a weighted average of 14 two stage models, 13 online models and 2 simple one stage models. (blending!!!)
Prediction of 32 labels are used as features for 2nd half of data. (Labels have strong inter-dependence!)
Xgboost is chosen as the single metastage classifier. (Xgboost win again!)
Not only feature analysis, but also need label analysis.
Feature selection for online model.
Heavily rely on CV and grid search to fine-tune hyper-parameters.

Some other solutions shared in forum.

Additional (100-300) decision tree features based on Criteo's winning solution.
Postprocess on some labels. (o_O)
Even 3-stages solution!
"sklearn RandomForestClassifier active paths or ended nodes" should be useful to generate tree based features.
Semi-supervise in blending.

Learning Curve

Tuesday, March 17, 2015

Take away from kaggle tradeshift winner

No comments:

Post a Comment