Learning Curve

Saturday, November 21, 2015

Quick links

What has caught my attention ~~lately~~:

Difference between Machine Learning & Statistical Modeling... ([1])
Semi-supervised learning frameworks for Python. Interesting and worth a try. ([2])
The Present and the Future of the KDD Cup Competition. "Three main take-aways from the KDD Cup workshop presentations: XGBoost, Feature Engineering is the king and Team work is crucial." ([3])
Comparing 7 Python data visualization tools. compare + examples. :) ([4])
The Three Cultures of Machine Learning. ([5])
Slides and videos from MLconf 2015, San Francisco ([6])

Tuesday, August 11, 2015

A short review of data science summit @ dato 2015

2015 Data Science Summit @ dato Review from Hang Li

More slides can be found in Dato, Inc..

Monday, June 22, 2015

Quick links

What has caught my attention lately:

Machine Learning Table of Elements Decoded. No xgboost?! ([1])
6 tricks from otto challenge! ([2])
Machine learning evaluation metrics. ([3])
Ensembling! ([4])
Owen(kaggle #1) talk! ([5])
Data science ipython notebooks. ([6])

Monday, May 25, 2015

Quick links

What has caught my attention lately:

A Benchmark Dataset for Time Series Anomaly Detection. ([1])
Python image processing libraries performance: OpenCV vs Scipy vs Scikit-Image. ([2])
Exploring Spark MLlib. ([3])
7 Python Libraries you should know about ([4])
Benchmarking random forest implementations. ([5])
Statistical inference is only mostly wrong. (really?!) ([6])

Sunday, March 22, 2015

Quick links

What has caught my attention lately:

Distance and similarity in machine learning(Chinese). ([1])
"How to Choose a Neural Network" from DL4J. ([2])
Winning solution at the BCI Challenge @ NER 2015. ([3] [4])
Winning solution of The National Data Science Bowl. Convolutional neural networks win again! A lot of techniques to prevent overfitting! ([5] [6])
Scikit-image is a collection of algorithms for image processing. ([7])

Tuesday, March 17, 2015

Take away from kaggle tradeshift winner

Tradeshift competition is predicting the probability that a piece of text belongs to each of the 33 classes. Winning solutions can be found in forum thread. Code is also in git.

Best solution is a weighted average of 14 two stage models, 13 online models and 2 simple one stage models. (blending!!!)
Prediction of 32 labels are used as features for 2nd half of data. (Labels have strong inter-dependence!)
Xgboost is chosen as the single metastage classifier. (Xgboost win again!)
Not only feature analysis, but also need label analysis.
Feature selection for online model.
Heavily rely on CV and grid search to fine-tune hyper-parameters.

Some other solutions shared in forum.

Additional (100-300) decision tree features based on Criteo's winning solution.
Postprocess on some labels. (o_O)
Even 3-stages solution!
"sklearn RandomForestClassifier active paths or ended nodes" should be useful to generate tree based features.
Semi-supervise in blending.

Sunday, March 8, 2015

Quick links

What has caught my attention lately:

LIBFFM: A Library for Field-aware Factorization Machines. It has been used to win two recent click-through rate prediction competitions (Criteo's and Avazu's). ([1])
Anscombe's quartet hmm~~ :) ([2])
Timeseries Classification: KNN & DTW ([3] [4])
"Outlier and Anomaly Detection In Server Instances With Machine Learning At Netflix: Cody Rioux" DBSAN + MCMC ([5])
Introduction of python Decorators and Context Managers ([6])