maplearn.ml package¶
Machine Learning¶
What is Machine Learning?
From Wikipedia: “Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.”
So, we use Machine Learning to predict results about unknown data:
- Is this new email a spam?
- Is this an image of a cat or a dog?
- How many people are going to buy my new product?
- Applications are infinite…
To answer these questions, we will use mathematical models (the cloud in the above figure) that need to be trained (or fitted) prior to make predictions.
What to predict?
Depending on the nature of the values to be predicted, we will talk about:
- classification when the values are discrete (also called categorical)
- regression when the values are continuous
Classification and regression both needs some samples for training, they belong to supervised learning. If you do not have samples, then you should consider unsupervised classification, also called clustering.
Note
On the other hand, a regression can’t be made without samples.
Maplearn: machine Learning modules
In maplearn, machine learning is empowered by scikit-learn. One reason is its great documentation. Have a look to go further.
Maplearn provides 3 modules corresponding to each of these tasks:
- Classification
- Clustering
- Regression
Two other modules are linked to these tasks:
- Confusion: confusion matrix (used to evaluate classifications)
- Distance: computes distance using different formulas
Another task that can accomplish machine learning is to reduce the number of dimensions (also called features)
- Reduction: dimensionnality reduction
The last submodule is needed for programmation but should not be used itself:
- Machine: abstract class of a machine learning processor, one or more algorithms can be applied
Submodules¶
maplearn.ml.classification module¶
Classification
Classification methods are used to generate a map with each pixel assigned to a class based on its multispectral composition. The classes are determined based on the spectral composition of training areas defined by the user.
Classification is supervised and need samples to fit on. The output will be be a matrix with integer values.
- Example:
>>> from maplearn.datahandler.loader import Loader >>> from maplearn.datahandler.packdata import PackData >>> loader = Loader('iris') >>> data = PackData(X=loader.X, Y=loader.Y, data=loader.aData) >>> lst_algos = ['knn', 'lda', 'rforest'] >>> dir_out = os.path.join('maplean_path', 'tmp') >>> clf = Classification(data=data, dirOut=dir_out, algorithm=lst_algos) >>> clf.run()
-
class
maplearn.ml.classification.
Classification
(data=None, algorithm=None, **kwargs)¶ Bases:
maplearn.ml.machine.Machine
Apply supervised classification onto a dataset:
- samples needed for fitting
- data to predict
- Args:
- data (PackData): data to play with
- algorithm (list or str): name of an algorithm or list of algorithm(s)
- **kwargs: other parameters like kfold
-
export_tree
(out_file=None)¶ Exports a decision tree
- Args:
- out_file (str): path to the output file
-
fit_1
(algo, verbose=True)¶ Fits a classifier using cross-validation
- Arg:
- algo (str): name of the classifier
-
load
(data)¶ Loads necessary data for supervised classification:
- samples (X and Y): necessary for fitting
- other (unknwon) data to predict, after fitting
- Args:
- data (PackData)
-
optimize
(algo)¶ Optimize parameters of a classifier
- Args:
- algo (str): name of the classifier to use
-
predict_1
(algo, proba=True, verbose=True)¶ Predict classes using a fitted algorithm applied to unknown data
- Args:
- algo (str): name of the algorithme to apply
- proba (bool): should probabilities be added to result
-
run
(predict=False, verbose=True)¶ Applies every classifiers specified in ‘algorithm’ property
- Args:
- predict (bool): should be the classifier only fitted or also used to predict?
-
maplearn.ml.classification.
lcs_kernel
(x, y)¶ Custom kernel based on LCS (Longest Common Substring)
- Args:
- x and y (matrices)
- Returns:
- matrix of float values
-
maplearn.ml.classification.
skreport_md
(report)¶ Convert a classification report given by scikit-learn into a markdown table TODO: replaced by a pandas dataframe
- Arg:
- report (str): classification report
- Returns:
- str_table: a table formatted as markdown
-
maplearn.ml.classification.
svm_kernel
(x, y)¶ Custom Kernel based on DTW
- Args:
- x and y (matrices)
- Returns:
- matrix of float values
maplearn.ml.clustering module¶
Clustering (unsupervised classification)
A clustering algorithm groups the given samples, each represented as a vector x in the N-dimensional feature space, into a set of clusters according to their spatial distribution in the N-D space. Clustering is an unsupervised classification as no a priori knowledge (such as samples of known classes) is assumed to be available.
Clustering is unsupervised and does not need samples for fitting. The output will be a matrix with integer values.
- Example:
>>> from maplearn.datahandler.loader import Loader >>> from maplearn.datahandler.packdata import PackData >>> loader = Loader('iris') >>> data = PackData(X=loader.X, Y=loader.Y, data=loader.aData) >>> lst_algos = ['mkmeans', 'birch'] >>> dir_out = os.path.join('maplean_path', 'tmp') >>> cls = Clustering(data=data, dirOut=dir_out, algorithm='mkmeans') >>> cls.run()
-
class
maplearn.ml.clustering.
Clustering
(data=None, algorithm=None, **kwargs)¶ Bases:
maplearn.ml.machine.Machine
Apply one or several methods of clustering onto a dataset
- Args:
- data (PackData): dataset to play with
- algorithm (str or list): name of algorithm(s) to use
- **kwargs: more parameters about clustering. The ‘metric’ to use, the number of clusters expected (‘n_clusters’)
-
fit_1
(algo, verbose=True)¶ Fits one clustering algorithm
- Arg:
- algo (str): name of the algorithm to fit
-
load
(data)¶ Loads necessary data for clustering: no samples are needed.
- Arg:
- data (PackData): data to play with
-
predict_1
(algo, export=False, verbose=True)¶ Makes clustering prediction using one algorithme
- Args:
- algo (str): name of the algorithm to use
- export (bool): should the result be exported?
maplearn.ml.confusion module¶
Confusion matrix
A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of a classificarion algorithm (see ‘classification’ class).
Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes.
- Example:
>>> import numpy as np >>> # creates 2 vectors representing labels >>> y_true = np.random.randint(0, 15, 100) >>> y_pred = np.random.randint(0, 15, 100) >>> cm = Confusion(y_true, y_pred) >>> cm.calcul_matrice() >>> cm.calcul_kappa() >>> print(cm)
-
class
maplearn.ml.confusion.
Confusion
(y_sample, y_predit, fTxt=None, fPlot=None)¶ Bases:
object
Computes confusion matrix based on 2 vectors of labels:
- labels of known samples
- predicted labels
- Args:
- y_sample (vector): vector with known labels
- y_predit (vector): vector with predicted labels
- fTxt (str): path to the text file to write confusion matrix into
- fPlot (str): id. with chart
- Attributes:
- y_sample (vector): true labels (ground data)
- y_predit (vector): corresponding predicted labels
- cm (matrix): confusion matrix filled with integer values
- kappa (float): kappa index
- score (float): precision score
- TODO:
- y_sample and y_predit should be renamed y_true and y_pred
-
calcul_matrice
()¶ Computes a confusion matrix and display the result
- Returns:
- matrix (integer): confusion matrix
- float: kappa index
-
export
(fTxt=None, fPlot=None, title=None)¶ Saves confusion matrix in:
- a text file
- a graphic file
- Args:
- fTxt (str): path to the output text file
- fPlot (str): path to the output graphic file
- title (str): title of the chart
-
kappa
¶ Computes kappa index based on 2 vectors
- Returns:
- float: kappa index
-
maplearn.ml.confusion.
confusion_cl
(cm, labels, os1, os2)¶ Computes confusion between 2 given classes (expressed in percentage) based on a confusion matrix
- Args:
- cm (matrix): confusion matrix
- labels (array): vector of labels
- os1 and os2 (int): codes of th classes
- Returns:
- float: confusion percentage between 2 classes
maplearn.ml.distance module¶
Distance
Computes pairwise distance between 2 matrices, using several metric (euclidean is the default)
- Example:
>>> import numpy as np >>> y1 = np.random.random(50) >>> y2 = np.random.random(50) >>> dist = Distance(y1, y2) >>> dist.run()
-
class
maplearn.ml.distance.
Distance
(x=None, y=None)¶ Bases:
object
Computes pairwise distance between 2 matrices (x and y)
- Args:
- x (matrix)
- y (matrix)
-
compare
(x=None, y=None, methods=[])¶ Compare pairwise distances got with different metrics
- Args:
- x and y (matrices)
- methods (list): list of metrics used to compute pairwise distance. if empty, every available metrics will be compared
-
dtw
(x=None, y=None)¶ Dynamic Time-Warping distance
-
lcs
(x=None, y=None, eps=10, delta=3)¶ Distance based on Longest Common Subsequence
-
run
(x=None, y=None, meth='euclidean')¶ Distance calculation according to a specified method
- Args:
- x (matrix)
- y (matrix)
- meth (str): name of the metric distance to use
- Returns:
- matrix of pairwise distance values
-
simplex
(x=None, y=None, sigma=50)¶ Simplex distance
maplearn.ml.reduction module¶
Dimensionnality reduction
The number of dimensions are reduced by selecting some of the features (like in kbest approach) or transforming them (like in PCA…). This reduction is applied to samples and the data to predict in further step.
Several approaches are available, which are listed in the class attribute “ALG_ALGOS”.
-
class
maplearn.ml.reduction.
Reduction
(data=None, algorithm=None, **kwargs)¶ Bases:
maplearn.ml.machine.Machine
This class reduces the number of dimensions by selecting some of the features or transforming them (like in PCA…). This reduction is applied to samples and the data to predict in further step.
- Args:
- data (PackData): dataset to reduced
- algorithm (list): list of algorithm(s) to apply on dataset
- **kwargs: parameters about the reduction (numberof components) or the dataset (like features)
- Attributes:
- attributes inherited from Machine classe
- ncomp (int): number of components expected
-
fit_1
(algo)¶ Fits one reduction algorithm to the dataset
- Args:
- algo (str): name of the algorithm to fit
-
load
(data)¶ Loads dataset to reduce
- Args:
- data (PackData): dataset to load
-
predict_1
(algo)¶ Applies chosen way of reduction to the dataset
- Args:
- algo (str): name of the algorithm to apply
-
run
(predict=True, ncomp=None)¶ Executes reduction of dimensions (fits and applies)
- Args:
- predict (bool): should apply the reduction or just fit the
- algorithm ?
- ncomp (int): number of dimensions expected
- Returns:
- array: reduced features data
- array: reduced samples features
- list: liste of features
maplearn.ml.regression module¶
Regression
In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.
Regression analysis is supervised and need samples for fitting. The output will be a matrix with float values.
Example:
>>> from maplearn.datahandler.loader import Loader
>>> from maplearn.datahandler.packdata import PackData
>>> from maplearn.ml.regression import Regression
>>> loader = Loader('boston')
>>> data = PackData(X=loader.X, Y=loader.Y, data=loader.aData)
>>> reg = Regression(data=data, dirOut=os.path.join('maplearn_path', 'tmp'))
>>> reg.fit_1(self.__algo)
-
class
maplearn.ml.regression.
Regression
(data=None, algorithm=None, **kwargs)¶ Bases:
maplearn.ml.machine.Machine
Applies regression using 1 or several algorithm(s) onto a specified dataset
- Args:
- data (PackData): dataset to play with
- algorithm (list or str): name of the algorithm(s) to use
- **kwargs: more parameters like k-fold
Attributes and properties are inherited from Machine class
-
fit_1
(algo)¶ Fits one regression algorithm
- Arg:
- algo (str): name of the algorithm to fit
-
load
(data)¶ Loads necessary data for regression, with samples (labels are float values).
- Arg:
- data (PackData): data to play with
- Returns:
- int: did data load correctly (returns 0) or not (<> 0) ?
- TODO:
- checks a few things when loading…
-
optimize
(algo)¶ Optimize parameters of a regression algorithm
- Args:
- algo (str): name of the regressor to use
-
predict_1
(algo, proba=False)¶ Predicts Y using one regressor (specified by algo)
Args:
- algo (str): key of the regressor to use
- proba (bool): should probabilities (if available) given by algorithm be added to result?
-
run
(predict=False)¶ Applies every regressors specified in ‘algorithm’ property
- Args:
- predict (bool): should be the regressor only fitted or also used
- to predict?
maplearn.ml.machine module¶
Machine Learning class
Fits and predict result using one or several machine learning algorithm(s).
This is an abstract class that should not be used directly. Use instead one one of the these classes:
- Classification: supervised classification
- Clustering: unsupervised classification
- Regression: regression
- Reduction: to reduce dimensions of a dataset
-
class
maplearn.ml.machine.
Machine
(data=None, algorithm=None, **kwargs)¶ Bases:
object
Class to apply one or several machine learning algorithm(s) on a given dataset.
Args:
- data (PackData): data to use with machine learning algorithm(s)
- algorithm (list or str): algorithm(s) to use
Attributes:
- algo (str): key code of the currently used algorithm
- result (dataframe): result(s) predicted by algorithm(s)
- proba (dataframe): probabilities produced by some algorithm(s)
Properties:
- algorithm (list): machine learning algorithm(s) to use
-
ALL_ALGOS
= {}¶
-
algorithm
¶ Gets list of algorithm that will be used when running the class
-
fit_1
(algo)¶ Fits an algorithm to dataset
-
load
(data)¶ Loads necessary data to machine learning algorithm(s)
Args:
- data (PackData): dataset used by machine learning algorithm(s)
-
predict_1
(algo, export=False)¶ Predict a result using a given algorithm
Args:
- algo (str): key name identifying the algorithm to use
- export (bool): should the algorithm be used to predict results
-
run
(predict=False)¶ Apply machine learning task(s) using every specified algorithm(s)
Args:
- predict (boolean): should machine learning algorithm(s) be used to predict results (or just be fitted to samples) ?