Essential Machine learning Libraries - II
03 Jul 2017Hello guys,
As promised earlier this post will cover machine learning libraries used for creating models and training them. I will be covering scikit-learn, tensorflow and keras in detail - guiding you through some basic model creation and getting you prepared for making any other model easily.
First of all you need to setup python environment for getting started with this tutorial, so follow these steps.
All the below given steps assume that you have already imported the dataset using pandas library. To limit the amount of content in a single post, I have decided to breakdown the Essential Machine Learning Libraries series into part II, III and IV. Such a subdivision will ensure easy to absorb content and less cofusion.
Lets not wait any longer, and directly jump into the actual learning.
Here X
refers to input parameters, and y
refer to the target or output values.
Scikit-learn
1. Importing the library and dataset
from sklearn.cross_validation import train_test_split
import pandas as pd
dataset = pd.read_csv('Dataset.csv') #modified as per need
X = dataset.iloc[:,input_indexes].values
y = dataset.iloc[:,label_indexes].values
2. Preprocessing the data
So the first step in solving any machine learning problem is to divide the dataset into training and test set.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
Feature Scaling
If the values of the input parameters varies alot, then it is useful to scale them on a common range. Doing so speeds up and improves the training process. Many ml modelswhen created automatically do feature scaling without explicitly being told, but its essential to know how to do it. Any one of the below mentioned 3 methods can be used:
- Standard Scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
- Normalization of the dataset
from sklearn.preprocessing import Normalizer scaler = Normalizer().fit(X_train) X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
- Binarization of the dataset
from sklearn.preprocessing import Binarizer binarizer = Binarizer(threshold = 0.0).fit(X) binary_X = binarizer.transform(X)
Encoding categorical features
When you have an input parameter having categories like country etc, or any other alphabetical features, its required to do encoding.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)
Imputing missing Values
Sometimes few values in the feature or output label column are missing. But we cannot simply eliminate that feature, instead we can use some strategy to have an approximate value in place of the missing value to fill the dataset. Eg mean, median
from sklearn.preprocessing import Imputer
imput = Imputer(missing_values =0, strategy = 'mean', axis = 0)
imput.fit_transform(X_train)
Generation of Polnomial Features
It can often happen, that you have very few features for traing you machine learning model, but you require a generalized model, which doesn’t overfit the training set. So instead of looking for new features to include in your dataset, you create polynomial features of any order suitbale, using existing features. This ensures your model is more robust to overfitting and can also possibly improve its accuracy.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
poly.fit_transform(X)
3. Model creation
In this step an appropriate machine learning model is created as per need. It is important to understand that, not all models are suitabe to solve all variety of ml or dl problems.
Understanding which model should be used, requires a basic knowledge of their working. I suggest you have a quick run through this article first.
Supervised Learning Models
- Linear Regression
from sklearn.linear_model import LinearRegression lr = LinearRegression(normalize = True)
- Naive Bayes
from sklearn.naive_bayes import GaussianNB gnb = GaussianNB()
- Support Vector Machines (SVM)
from sklearn.svm import SVC svc = SVC(kernel = 'linear')
- K- Nearest neighbors (KNN)
from sklearn import neighbors knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
Unsupervised Learning Models
- Principal Component Analysis (PCA)
from sklearn.decomposition import PCA pca = PCA(n_components = 0.95)
- K Means
from sklearn.cluster import KMeans k_means = KMeans(n_clusters = 3, random_state = 0)
4. Fitting the model to training data
Irrespective of the learning model used, below mentioned method is valid for all.
Supervised Learning
requires input parameters as well as target labels/values
lr.fit(X, y)
knn.fit(X_train, y_train)
svc.fit(X_train, y_train)
Unsupervised models
on the other hand only require inputs, as the name suggests.
k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)
5. Predicting test-set results
Supervised Learning
y_pred = lr.predict(X_test)
y_pred = svc.predict(X_test)
y_pred = knn.predict_proba(X_test)
Unsupervised Learning
y_pred = k_means.predict(X_test)
6. Evaluating your model’s performance
Once you have predicted the test set results, one important step still remaining is evaluation of the accuracy of your model. Also accuracy alone cannot gurantee that your model is more close to the target results. For this there are various scores and indexes which justify your model’s performance.
For any kind of model, it is useful to first simply compare the y_pred
and y_test
set values, to get a rough idea. Here y_test
is the actual target too be achieved by the model.
But once the number of examples being considered is large (»100), it beyond human effort for manual comparison, hence these metrics jump in.
These evaluation metrics vary in terms of the type of ml problem being considered.
Regression Metrics
- Mean Absoulte Error
from sklearn.metrics import mean_absoulte_error mean_absolute_error(y_test, y_pred)
- Mean Sqaured Error
from sklearn.metrics import mean_sqaured_error mean_squared_error(y_test, y_pred)
- R squared Score
from sklearn.metrics import r2_score r2_score(y_true, y_pred)
Classification Metrics
- Accuracy
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, y_pred)
- Classification Report
from sklearn.metrics import classification_report cr = classification_report(y_test, y_pred)
- Confusion Matrix
from sklearn.metrics confusion_matrix cm = confusion_matrix(y_test, y_pred)
Clustering Metrics
- Adjusted Rand Index
from sklearn.metrics import adjusted_rand_score rand_score = adjusted_rand_score(y_test, y_pred)
- Homogeneity
from sklearn.metrics import homogeneity_score hs = homogeneity_score(y_test, y_pred)
- V-measure
from sklearn.metrics v_measure_score v_measure = v_measure_score(y_test, y_pred)
Cross-Validation
from sklearn.cross_validation import cross_val_score
cross_val_knn = cross_val_score(knn, X_train, y_train, cv = 4)
cross_val_lr = cross_val_score(lr, X, y, cv = 2)
7. Fine-tuning your trained model
One might think that, once a ml model had been created, trained and tested , the job is done. Well it’s not so easy making state of art models, without first fine tuning them. Fine-tuning simply means, mdoifying the model properties to make it more and more robust with each epoch of testing and training by selecting the best set of model parameters values.
It may be considered to be a meta-heuristic procedure, which may or may not result in the global optimum/ best solution.
Grid Search
from sklearn.grid_search import GridSearchCV
parameters = {"n_eighbors": np.arange(1,3), "metric"" ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn, param_grid = parameters)
grid.fit(X_train,y_train)
print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)
So based on the value of n_neighbors
giving the best_score_
, our knn model can be fine-tuned to give better results.
Randomized Parameter Optimization
from sklearn.grid_search import RandomizedSearchCV
parameters = {"n_eighbors": range(1,5), "weights", ["uniform", "distance"]}
rsearch = RandomizedSearchCV(estimator = knn, param_distributions = parameters, cv = 4, n_iter = 8, random_state = 5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)
If you want to know more about the scikit-learn library implementation please follow the official documentation More info about the evaluation metrics can be found here, if needed.
So here, we come to the end of post-II of the Essential Machine Learning Libraries series. Scikit-learn has been one of the most widely used ml library by researchers and developers alike, and congrats to you for joining a new league!
In the upcoming Part-III, I will be covering Tensorflow library - basics, model creation and testing. So stay tuned!