Last year i followed a coursera course on machine learning and several courses on statistics and data analysis. After that i've started participating in kaggle machine learning competitions and picked up data analysis and machine learning tasks at work.
In this repository i'm sharing an overview of the subject and i'm giving links to code snippets that you can reuse. I'm hoping this is interesting and helpfull to others that are learning about data analysis and/or machine learning.
To run the notebooks and/or extend them, you can run them using this command in a docker container (requires docker):
docker run -it --rm --user root -e NB_UID=1000 -e NB_GID=1000 -v `realpath .`:/home/jovyan/work -p 8888:8888 jupyter/datascience-notebook
Alternatively you can setup a python virtual environment, install the required packages in there and run:
virtualenv .
bin/pip install -r requirements.txt
bin/jupyter notebook
I find there are several important activities in an machine learning effort, often i start with exploratory data analysis, then data preparation, feature engineering and preprocessing. Then comes selecting, tuning and evaluating models. This is not a linear process though, i often revisit earlier steps to improve the final model.
I've tried to display the process in the chart below:
The picture intends to show that there is a back-and-forth between the various activities. Often it all starts at exploratory analysis and then trying to make a basic model, leading to a first score. Then the result is improved from there by revisiting each of the steps, for example adding some more features tuning the model more, picking a stronger model etc.
Data analysis and preparation serves to get an understanding of data and prepare it for modelling efforts. Good understanding of data can be very advantageous, for example allowing you to engineer better features or understanding why your model performs bad.
During the exploratory analysis you can for example plot distributions, correlation of numeric features and influence of categorical variables to help yourself understand the data.
Date preparation means more filling in missing values and/or converting string values to numeric ones.
See here for examples:
- Exploratory Data Analysis Titanic Survivors
- Data Analysis New York Green Taxi data
- Loans Acceptance Data Analysis & Modelling
- Black Friday
I would describe feature engineering as efforts to present data in such a way that a machine learning algorithm can use it best for predictions. This can be smart combinations of existing features, parsing textual data, calculated features etc.
See here for some examples and inspiration:
Preprocessing means transformations to make data suitable for further machine learning algorithms. For example some algorithms require all features to be on a similar scale, such as (-1, 1). Many algorithms only work with numerical data, meaning categorical or string features need to be encoded.
I've prepared some notebooks with handy code snippets and explanations:
- Scaling/normalization using RobustScaler
- Encoding categorical data (One-hot / Label encoding
- feature selection
Evaluation means training and then seeing how well it performs. Generally you train on one dataset and then evaluate on another, to ensure your model generalizes to new data also and doesn't just work for the training data.
There is many approaches to evaluating models, here are some notebooks with examples:
Most machine learning algorithms have some parameters that can (should) be tuned for optimal results. Parameters influence for example regularization strength, tree depth and many more aspects. Tuning parameters can have strong influence on results in some alorithms, so it pays off to spend some energy here!
I've prepared a few notebooks with examples of parameter tuning approaches:
- Exhaustive search using GridSeachCV
- Randomized search using RandomizedSearchCV
- Hyperopt library
- PySMAC also works well, i'll add a notebook when i find time.
- Black Friday - Tuning model params with validation plots
Different models can perform better or worse on various problems as each model is more or less suited to certain types of prolems. It often makes sense to train and evaluate multiple models to see which work best and/or where to spend tuning efforts. Note this is not neccessarily the case, different algorithms can also perform similarly good or bad.
I've prepared this notebook to show how you can compare models.
Overfitting and underfitting are the most common ways for models to fail. In the notebooks below i'll show how that could look, how to diagnose and some solutions.
... TODO ...
Working with textual data is a special area - because machine learning models normally only work on numbers. That means before any modelling happens we need to convert the text into numbers. Also there is various modeling approaches, from linear models to neural networks that have proven successful and work in different scenarios.
In these notebooks you can find some ideas on processing textual data:
- NLP with NLTK
- Hate speech - Logistic Regression and TfIdf features
- Hate speech - Deep learning
- Twitter sentiment - Airlines Data
- NLP with Neural Networks
- NLP Textual Similarity / Search Ranking
- NLP Part of Speech Tagging
Since i'm working with search technology frequently, a logical machine learning subject is Learn to Rank. We've done a few of these projects at work and i will explain some of the basics here.
In spark there is also a good machine learning library, spark mllib. I'm going to show some examples of usage below:
It's been really interesting to learn all these new things! Also very helpfull for myself to try and write it down in a structured way. I'm hoping this is interesting to others...