Essential Machine learning Libraries - I
23 Jun 2017This post is part 1 of the Essential Machine Learning libraries series
Hello guys,
Now that you have covered the some basics of the theory behind Machine learning through the previous posts(1 ,2), it’s time to jump into some essential python libraries used for implementing ML and/or data science projects.
Python being the choice of most developers, major data science and ml/deep learning libraries used have been developed in Python. Though R is also be by many industry researchers for analytics and data science, my experience in this domain is limited to python.
This post will encompass only the libraries majorly used for data importing, preprocessing and visualization. The libraries used for ml model creation will be covered in the next post.
1. Numpy
Numpy library adds support for multidimentional arrays and matrices in a manner similar to MATLAB. Along with a large collection of high-level mathematical functions to operate on these arrays. The core functionality of NumPy is its “ndarray”, for n-dimensional array, data structure.
Sample Operations:
Array creation
import numpy as np
x = np.array([1, 2, 3])
x
Output - array([1, 2, 3])
y = np.arange(10) # like Python's range, but returns an array
y
Output - array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Basic operations
a = np.array([1, 2, 3, 6])
b = np.linspace(0, 2, 4) # create an array with four equally spaced points starting with 0 and ending with 2.
c = a - b
c
Output - array([ 1., 1.33333333, 1.66666667, 4.])
a**2
Output - array([ 1, 4, 9, 36])
Universal Functions
a = np.linspace(-np.pi, np.pi, 100)
b = np.sin(a)
c = np.cos(a)
Linear Algebra
from numpy.random import rand
from numpy.linalg import solve, inv
a = np.array([[1, 2, 3], [3, 4, 6.7], [5, 9.0, 5]])
a.transpose()
Output - array([[ 1. , 3. , 5. ],
[ 2. , 4. , 9. ],
[ 3. , 6.7, 5. ]])
inv(a)
Output - array([[-2.27683616, 0.96045198, 0.07909605],
[ 1.04519774, -0.56497175, 0.1299435 ],
[ 0.39548023, 0.05649718, -0.11299435]])
b = np.array([3, 2, 1])
solve(a, b) # solve the equation ax = b
Output - array([-4.83050847, 2.13559322, 1.18644068])
c = rand(3, 3) * 20 # create a 3x3 random matrix of values within [0,1] scaled by 20
c
Output - array([[ 3.98732789, 2.47702609, 4.71167924],
[ 9.24410671, 5.5240412 , 10.6468792 ],
[ 10.38136661, 8.44968437, 15.17639591]])
np.dot(a, c) # matrix multiplication
Output - array([[ 53.61964114, 38.8741616 , 71.53462537],
[ 118.4935668 , 86.14012835, 158.40440712],
[ 155.04043289, 104.3499231 , 195.26228855]])
For more detailed examples and tutorials follow this link.
2. Scipy
SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. It has a lot of packages for scientific computing including but not exhaustive - optimization, integration, fast fourier transform, ODE and signal and image processing.
SciPy is organized into subpackages covering different scientific computing domains. These are summarized in the following table:
Subpackage | Description |
---|---|
cluster | Clustering algorithms |
constants | Physical and mathematical constants |
fftpack | Fast Fourier Transform routines |
integrate | Integration and ordinary differential equation solvers |
interpolate | Interpolation and smoothing splines |
io | Input and Output |
linalg | Linear algebra |
ndimage | N-dimensional image processing |
odr | Orthogonal distance regression |
optimize | Optimization and root-finding routines |
signal | Signal processing |
sparse | Sparse matrices and associated routines |
spatial | Spatial data structures and algorithms |
special | Special functions |
stats | Statistical distributions and functions |
Since there are a lot of examples to be covered under scipy, I will be writing a seperate post after I have tried working with most of them. Till then you can check out the following link
3. Matplotlib
Matplotlib is a Python plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.
It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+. There is also a procedural “pylab” interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB.
Line Plot
import matplotlib.pyplot as plt
import numpy as np
a = np.linspace(0,10,100)
b = np.exp(-a)
plt.plot(a,b)
plt.show()
Histogram
import matplotlib.pyplot as plt
from numpy.random import normal,rand
x = normal(size=200)
plt.hist(x,bins=30)
plt.show()
Scatter Plot
import matplotlib.pyplot as plt
from numpy.random import rand
a = rand(100)
b = rand(100)
plt.scatter(a,b)
plt.show()
For more detailed examples and tutorials follow this link.
4. Pandas
Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
Few of the important features of pandas are :
- DataFrame object for data manipulation with integrated indexing.
- Tools for reading and writing data between in-memory data structures and different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of data sets.
- Label-based slicing, fancy indexing,and subsetting of large data sets.
- Data structure column insertion and deletion.
- Group by engine allowing split-apply-combine operations on data sets.
- Data set merging and joining.
Some sample examples:
Creating data
from pandas import DataFrame, read_csv
import pandas as pd
from numpy.random import rand
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
Merging 2 lists using zip function
BabyDataSet = list(zip(names,births))
BabyDataSet
Output - [('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]
Creating dataframe object for storing data in a manner similar to sql
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
Importing and dividing dataset
dataset = pd.read_csv('Dataset_name.csv') #modified as per need
X = dataset.iloc[index of input values].values
y = dataset.iloc[index of target/ouput values].values
For more detailed examples and tutorials follow this link. Also this is a great article for a quick walkthrough.
Till now we have covered major python libraries required for importing and preprocessing data before actual machine learning model creation.
In the next post I will be covering ml and deep learning libraries essential for creating, training and testing various models. So stay tuned!