Learning Curve: machine learning

Showing posts with label machine learning. Show all posts

Saturday, November 21, 2015

Quick links

What has caught my attention ~~lately~~:

Difference between Machine Learning & Statistical Modeling... ([1])
Semi-supervised learning frameworks for Python. Interesting and worth a try. ([2])
The Present and the Future of the KDD Cup Competition. "Three main take-aways from the KDD Cup workshop presentations: XGBoost, Feature Engineering is the king and Team work is crucial." ([3])
Comparing 7 Python data visualization tools. compare + examples. :) ([4])
The Three Cultures of Machine Learning. ([5])
Slides and videos from MLconf 2015, San Francisco ([6])

Tuesday, August 11, 2015

A short review of data science summit @ dato 2015

2015 Data Science Summit @ dato Review from Hang Li

More slides can be found in Dato, Inc..

Monday, June 22, 2015

Quick links

What has caught my attention lately:

Machine Learning Table of Elements Decoded. No xgboost?! ([1])
6 tricks from otto challenge! ([2])
Machine learning evaluation metrics. ([3])
Ensembling! ([4])
Owen(kaggle #1) talk! ([5])
Data science ipython notebooks. ([6])

Sunday, March 22, 2015

Quick links

What has caught my attention lately:

Distance and similarity in machine learning(Chinese). ([1])
"How to Choose a Neural Network" from DL4J. ([2])
Winning solution at the BCI Challenge @ NER 2015. ([3] [4])
Winning solution of The National Data Science Bowl. Convolutional neural networks win again! A lot of techniques to prevent overfitting! ([5] [6])
Scikit-image is a collection of algorithms for image processing. ([7])

Tuesday, March 17, 2015

Take away from kaggle tradeshift winner

Tradeshift competition is predicting the probability that a piece of text belongs to each of the 33 classes. Winning solutions can be found in forum thread. Code is also in git.

Best solution is a weighted average of 14 two stage models, 13 online models and 2 simple one stage models. (blending!!!)
Prediction of 32 labels are used as features for 2nd half of data. (Labels have strong inter-dependence!)
Xgboost is chosen as the single metastage classifier. (Xgboost win again!)
Not only feature analysis, but also need label analysis.
Feature selection for online model.
Heavily rely on CV and grid search to fine-tune hyper-parameters.

Some other solutions shared in forum.

Additional (100-300) decision tree features based on Criteo's winning solution.
Postprocess on some labels. (o_O)
Even 3-stages solution!
"sklearn RandomForestClassifier active paths or ended nodes" should be useful to generate tree based features.
Semi-supervise in blending.

Sunday, March 8, 2015

Quick links

What has caught my attention lately:

LIBFFM: A Library for Field-aware Factorization Machines. It has been used to win two recent click-through rate prediction competitions (Criteo's and Avazu's). ([1])
Anscombe's quartet hmm~~ :) ([2])
Timeseries Classification: KNN & DTW ([3] [4])
"Outlier and Anomaly Detection In Server Instances With Machine Learning At Netflix: Cody Rioux" DBSAN + MCMC ([5])
Introduction of python Decorators and Context Managers ([6])

Sunday, March 1, 2015

Quick links

What has caught my attention lately:

Dimensionality reduction techniques : "sparse random projections". ... "Again, random projections are not suitable for all datasets. There is no 'silver bullet' approach to dimensionality reduction." ([1] [2])
"Ten Lessons Learned from Building (real-life impactful) Machine Learning Systems" ([3])
NLP tool list. ([4] [5](can't open now...) )
"Optimizing Python in the Real World" ([6])
HBase v1.0! ([7])
Data Science At Zillow ([8])
"The G-means algorithm takes a hierarchical approach to detecting the number of clusters." ([9] [10])
Comparing supervised learning algorithms ([11])

Thursday, October 23, 2014

Take away from kaggle criteo winners

Criteo competition is predict click or not for display ads. It is a very useful scenario. Some winner of this competition shared their solution in forum. A quick review and take away from these solutions.

3 Idiots' Solution (1st)

Code, docs and discussion thread can be found online.
Basically it is a single FM model result. (!!!)
"Empirically we observe using categorical features is always better than using numerical features." (!!!)
"instance-wise data normalization makes the optimization problem easier to be solved." (!!!)
Field wise fm.
Use GBDT results as new features.
per-coordinate learning rate schedule looks very useful for sgd. Related paper. I think this method is also used in vowpal wabbit.

G = G + g*g

w = w - eta*(1/sqrt(G))*g

calibrate the final result based on local prediction.
Some ideas of this solution come from paper "Practical Lessons from Predicting Clicks on Ads at Facebook".

Guocong Song's Solution (3rd)

Code, docs and discussion thread can be found online.
Linear combination of 4 models. All 4 models are trained by vw.(!!!)
Grouping features before generate quadratic/polynomial features.
Two tricks of using vw.

-C [ --constant ] Set initial value of constant (Useful for faster convergence on data-sets where the label isn't centered around zero)
--feature_mask allows to specify directly a set of parameters which can update, from a model file. This is useful in combination with --l1. One can use --l1 to discover which features should have a nonzero weight and do -f model, then use --feature_mask model without --l1 to learn a better regressor.

Julian de Wit's Solution (4th)

Code, docs and discussion thread can be found online.
Ensemble of deep neural networks (!!!)
Traindata contained roughly 128K-200K features (!!!)
Rare and unseen test-set categorical values were all encoded as one category. (almost all winner mentioned this.) This is almost the only feature engeering of this winner.
Numeric features need standardized. Log-transform apply on longtail features (cnt based features).
2 hidden layers with respectively 128 and 256 units.
"Other challenges also reported simply averaged ensembles with neural networks are quite good already." (link)

Hopefully will get more thoughts after repro those results and reading code.

Wednesday, October 22, 2014

How to tune parametes of random forest and gradient boosting tree?

Tune model sometimes refer to change different parameters and check the performance. Tree based models are easier to tune, b/c there are not many parameters to change for tree based models.

Random forest model

There are two main paramters of random forest model. They are depth and tree count.

My current thoughts.
Increase depth will decrease variance and increase bias.
Increase tree count will decrease bias and may increace variance.

Basically you can use small tree count (e.g. 100) to tune depth first. Increase depth to get low variance (maybe high bias). Then increase tree count to reduce bias.

Gradient boosting tree

Similar with random forest, GBT mainly has three paramenters. They are tree depth, iterator and learning rate.

My current thoughts.

Increase depth will be learning faster (easier to converge) and maybe jump around when close to converge.

Thursday, August 28, 2014

Some Useful Machine Learning Tools

A list of useful machine learning tools (in alphabetical order). Will continuously update this list. :)

fest (http://lowrank.net/nikos/fest/)

FEST, short for Fast Ensembles of Sparse Trees, is a piece of software for learning various types of decision tree committees from high dimensional sparse data.

hector (https://github.com/xlvector/hector)

Golang machine learning lib. Currently, it can be used to solve binary classification problems.

libfm (http://www.libfm.org/)

Factorization machines (FM) are a generic approach that allows to mimic most factorization models by feature engineering.

liblinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/)

LIBLINEAR is a linear classifier for data with millions of instances and features.

libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification.

vowpal_wabbit (https://github.com/JohnLangford/vowpal_wabbit)

The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research.

xgboot (https://github.com/tqchen/xgboost)

An optimized general purpose gradient boosting (tree) library.