Learning Curve: 2014

Thursday, October 23, 2014

Take away from kaggle criteo winners

Criteo competition is predict click or not for display ads. It is a very useful scenario. Some winner of this competition shared their solution in forum. A quick review and take away from these solutions.

3 Idiots' Solution (1st)

Code, docs and discussion thread can be found online.
Basically it is a single FM model result. (!!!)
"Empirically we observe using categorical features is always better than using numerical features." (!!!)
"instance-wise data normalization makes the optimization problem easier to be solved." (!!!)
Field wise fm.
Use GBDT results as new features.
per-coordinate learning rate schedule looks very useful for sgd. Related paper. I think this method is also used in vowpal wabbit.

G = G + g*g

w = w - eta*(1/sqrt(G))*g

calibrate the final result based on local prediction.
Some ideas of this solution come from paper "Practical Lessons from Predicting Clicks on Ads at Facebook".

Guocong Song's Solution (3rd)

Code, docs and discussion thread can be found online.
Linear combination of 4 models. All 4 models are trained by vw.(!!!)
Grouping features before generate quadratic/polynomial features.
Two tricks of using vw.

-C [ --constant ] Set initial value of constant (Useful for faster convergence on data-sets where the label isn't centered around zero)
--feature_mask allows to specify directly a set of parameters which can update, from a model file. This is useful in combination with --l1. One can use --l1 to discover which features should have a nonzero weight and do -f model, then use --feature_mask model without --l1 to learn a better regressor.

Julian de Wit's Solution (4th)

Code, docs and discussion thread can be found online.
Ensemble of deep neural networks (!!!)
Traindata contained roughly 128K-200K features (!!!)
Rare and unseen test-set categorical values were all encoded as one category. (almost all winner mentioned this.) This is almost the only feature engeering of this winner.
Numeric features need standardized. Log-transform apply on longtail features (cnt based features).
2 hidden layers with respectively 128 and 256 units.
"Other challenges also reported simply averaged ensembles with neural networks are quite good already." (link)

Hopefully will get more thoughts after repro those results and reading code.

Wednesday, October 22, 2014

How to tune parametes of random forest and gradient boosting tree?

Tune model sometimes refer to change different parameters and check the performance. Tree based models are easier to tune, b/c there are not many parameters to change for tree based models.

Random forest model

There are two main paramters of random forest model. They are depth and tree count.

My current thoughts.
Increase depth will decrease variance and increase bias.
Increase tree count will decrease bias and may increace variance.

Basically you can use small tree count (e.g. 100) to tune depth first. Increase depth to get low variance (maybe high bias). Then increase tree count to reduce bias.

Gradient boosting tree

Similar with random forest, GBT mainly has three paramenters. They are tree depth, iterator and learning rate.

My current thoughts.

Increase depth will be learning faster (easier to converge) and maybe jump around when close to converge.

Thursday, August 28, 2014

Some Useful Machine Learning Tools

A list of useful machine learning tools (in alphabetical order). Will continuously update this list. :)

fest (http://lowrank.net/nikos/fest/)

FEST, short for Fast Ensembles of Sparse Trees, is a piece of software for learning various types of decision tree committees from high dimensional sparse data.

hector (https://github.com/xlvector/hector)

Golang machine learning lib. Currently, it can be used to solve binary classification problems.

libfm (http://www.libfm.org/)

Factorization machines (FM) are a generic approach that allows to mimic most factorization models by feature engineering.

liblinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/)

LIBLINEAR is a linear classifier for data with millions of instances and features.

libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification.

vowpal_wabbit (https://github.com/JohnLangford/vowpal_wabbit)

The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research.

xgboot (https://github.com/tqchen/xgboost)

An optimized general purpose gradient boosting (tree) library.