maplearn.ml package

Machine Learning

What is Machine Learning?

From Wikipedia: “Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.”

_images/machine_learning_en.svg

So, we use Machine Learning to predict results about unknown data:

  • Is this new email a spam?
  • Is this an image of a cat or a dog?
  • How many people are going to buy my new product?
  • Applications are infinite…

To answer these questions, we will use mathematical models (the cloud in the above figure) that need to be trained (or fitted) prior to make predictions.

What to predict?

Depending on the nature of the values to be predicted, we will talk about:

  • classification when the values are discrete (also called categorical)
  • regression when the values are continuous
_images/classif_reg.png

Classification and regression both needs some samples for training, they belong to supervised learning. If you do not have samples, then you should consider unsupervised classification, also called clustering.

Note

On the other hand, a regression can’t be made without samples.

Maplearn: machine Learning modules

_images/logo_scikit-learn.png

In maplearn, machine learning is empowered by scikit-learn. One reason is its great documentation. Have a look to go further.

Maplearn provides 3 modules corresponding to each of these tasks:

  1. Classification
  2. Clustering
  3. Regression

Two other modules are linked to these tasks:

  • Confusion: confusion matrix (used to evaluate classifications)
  • Distance: computes distance using different formulas

Another task that can accomplish machine learning is to reduce the number of dimensions (also called features)

  • Reduction: dimensionnality reduction

The last submodule is needed for programmation but should not be used itself:

  • Machine: abstract class of a machine learning processor, one or more algorithms can be applied

Submodules

maplearn.ml.classification module

Classification

Classification methods are used to generate a map with each pixel assigned to a class based on its multispectral composition. The classes are determined based on the spectral composition of training areas defined by the user.

Classification is supervised and need samples to fit on. The output will be be a matrix with integer values.

Example:
>>> from maplearn.datahandler.loader import Loader
>>> from maplearn.datahandler.packdata import PackData
>>> loader = Loader('iris')
>>> data = PackData(X=loader.X, Y=loader.Y, data=loader.aData)
>>> lst_algos = ['knn', 'lda', 'rforest']
>>> dir_out = os.path.join('maplean_path', 'tmp')
>>> clf = Classification(data=data, dirOut=dir_out, algorithm=lst_algos)
>>> clf.run()
class maplearn.ml.classification.Classification(data=None, algorithm=None, **kwargs)

Bases: maplearn.ml.machine.Machine

Apply supervised classification onto a dataset:

  • samples needed for fitting
  • data to predict
Args:
  • data (PackData): data to play with
  • algorithm (list or str): name of an algorithm or list of algorithm(s)
  • **kwargs: other parameters like kfold
export_tree(out_file=None)

Exports a decision tree

Args:
  • out_file (str): path to the output file
fit_1(algo, verbose=True)

Fits a classifier using cross-validation

Arg:
  • algo (str): name of the classifier
load(data)

Loads necessary data for supervised classification:

  • samples (X and Y): necessary for fitting
  • other (unknwon) data to predict, after fitting
Args:
  • data (PackData)
optimize(algo)

Optimize parameters of a classifier

Args:
  • algo (str): name of the classifier to use
predict_1(algo, proba=True, verbose=True)

Predict classes using a fitted algorithm applied to unknown data

Args:
  • algo (str): name of the algorithme to apply
  • proba (bool): should probabilities be added to result
run(predict=False, verbose=True)

Applies every classifiers specified in ‘algorithm’ property

Args:
predict (bool): should be the classifier only fitted or also used to predict?
maplearn.ml.classification.lcs_kernel(x, y)

Custom kernel based on LCS (Longest Common Substring)

Args:
  • x and y (matrices)
Returns:
matrix of float values
maplearn.ml.classification.skreport_md(report)

Convert a classification report given by scikit-learn into a markdown table TODO: replaced by a pandas dataframe

Arg:
  • report (str): classification report
Returns:
str_table: a table formatted as markdown
maplearn.ml.classification.svm_kernel(x, y)

Custom Kernel based on DTW

Args:
  • x and y (matrices)
Returns:
matrix of float values

maplearn.ml.clustering module

Clustering (unsupervised classification)

A clustering algorithm groups the given samples, each represented as a vector x in the N-dimensional feature space, into a set of clusters according to their spatial distribution in the N-D space. Clustering is an unsupervised classification as no a priori knowledge (such as samples of known classes) is assumed to be available.

Clustering is unsupervised and does not need samples for fitting. The output will be a matrix with integer values.

Example:
>>> from maplearn.datahandler.loader import Loader
>>> from maplearn.datahandler.packdata import PackData
>>> loader = Loader('iris')
>>> data = PackData(X=loader.X, Y=loader.Y, data=loader.aData)
>>> lst_algos = ['mkmeans', 'birch']
>>> dir_out = os.path.join('maplean_path', 'tmp')
>>> cls = Clustering(data=data, dirOut=dir_out, algorithm='mkmeans')
>>> cls.run()
class maplearn.ml.clustering.Clustering(data=None, algorithm=None, **kwargs)

Bases: maplearn.ml.machine.Machine

Apply one or several methods of clustering onto a dataset

Args:
  • data (PackData): dataset to play with
  • algorithm (str or list): name of algorithm(s) to use
  • **kwargs: more parameters about clustering. The ‘metric’ to use, the number of clusters expected (‘n_clusters’)
fit_1(algo, verbose=True)

Fits one clustering algorithm

Arg:
  • algo (str): name of the algorithm to fit
load(data)

Loads necessary data for clustering: no samples are needed.

Arg:
  • data (PackData): data to play with
predict_1(algo, export=False, verbose=True)

Makes clustering prediction using one algorithme

Args:
  • algo (str): name of the algorithm to use
  • export (bool): should the result be exported?

maplearn.ml.confusion module

Confusion matrix

A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of a classificarion algorithm (see ‘classification’ class).

Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes.

Example:
>>> import numpy as np
>>> # creates 2 vectors representing labels
>>> y_true = np.random.randint(0, 15, 100)
>>> y_pred = np.random.randint(0, 15, 100)
>>> cm = Confusion(y_true, y_pred)
>>> cm.calcul_matrice()
>>> cm.calcul_kappa()
>>> print(cm)
class maplearn.ml.confusion.Confusion(y_sample, y_predit, fTxt=None, fPlot=None)

Bases: object

Computes confusion matrix based on 2 vectors of labels:

  1. labels of known samples
  2. predicted labels
Args:
  • y_sample (vector): vector with known labels
  • y_predit (vector): vector with predicted labels
  • fTxt (str): path to the text file to write confusion matrix into
  • fPlot (str): id. with chart
Attributes:
  • y_sample (vector): true labels (ground data)
  • y_predit (vector): corresponding predicted labels
  • cm (matrix): confusion matrix filled with integer values
  • kappa (float): kappa index
  • score (float): precision score
TODO:
  • y_sample and y_predit should be renamed y_true and y_pred
calcul_matrice()

Computes a confusion matrix and display the result

Returns:
  • matrix (integer): confusion matrix
  • float: kappa index
export(fTxt=None, fPlot=None, title=None)

Saves confusion matrix in:

  • a text file
  • a graphic file
Args:
  • fTxt (str): path to the output text file
  • fPlot (str): path to the output graphic file
  • title (str): title of the chart
kappa

Computes kappa index based on 2 vectors

Returns:
  • float: kappa index
maplearn.ml.confusion.confusion_cl(cm, labels, os1, os2)

Computes confusion between 2 given classes (expressed in percentage) based on a confusion matrix

Args:
  • cm (matrix): confusion matrix
  • labels (array): vector of labels
  • os1 and os2 (int): codes of th classes
Returns:
  • float: confusion percentage between 2 classes

maplearn.ml.distance module

Distance

Computes pairwise distance between 2 matrices, using several metric (euclidean is the default)

Example:
>>> import numpy as np
>>> y1 = np.random.random(50)
>>> y2 = np.random.random(50)
>>> dist = Distance(y1, y2)
>>> dist.run()
class maplearn.ml.distance.Distance(x=None, y=None)

Bases: object

Computes pairwise distance between 2 matrices (x and y)

Args:
  • x (matrix)
  • y (matrix)
compare(x=None, y=None, methods=[])

Compare pairwise distances got with different metrics

Args:
  • x and y (matrices)
  • methods (list): list of metrics used to compute pairwise distance. if empty, every available metrics will be compared
dtw(x=None, y=None)

Dynamic Time-Warping distance

lcs(x=None, y=None, eps=10, delta=3)

Distance based on Longest Common Subsequence

run(x=None, y=None, meth='euclidean')

Distance calculation according to a specified method

Args:
  • x (matrix)
  • y (matrix)
  • meth (str): name of the metric distance to use
Returns:
matrix of pairwise distance values
simplex(x=None, y=None, sigma=50)

Simplex distance

maplearn.ml.reduction module

Dimensionnality reduction

The number of dimensions are reduced by selecting some of the features (like in kbest approach) or transforming them (like in PCA…). This reduction is applied to samples and the data to predict in further step.

Several approaches are available, which are listed in the class attribute “ALG_ALGOS”.

class maplearn.ml.reduction.Reduction(data=None, algorithm=None, **kwargs)

Bases: maplearn.ml.machine.Machine

This class reduces the number of dimensions by selecting some of the features or transforming them (like in PCA…). This reduction is applied to samples and the data to predict in further step.

Args:
  • data (PackData): dataset to reduced
  • algorithm (list): list of algorithm(s) to apply on dataset
  • **kwargs: parameters about the reduction (numberof components) or the dataset (like features)
Attributes:
  • attributes inherited from Machine classe
  • ncomp (int): number of components expected
fit_1(algo)

Fits one reduction algorithm to the dataset

Args:
  • algo (str): name of the algorithm to fit
load(data)

Loads dataset to reduce

Args:
  • data (PackData): dataset to load
predict_1(algo)

Applies chosen way of reduction to the dataset

Args:
algo (str): name of the algorithm to apply
run(predict=True, ncomp=None)

Executes reduction of dimensions (fits and applies)

Args:
  • predict (bool): should apply the reduction or just fit the
    algorithm ?
  • ncomp (int): number of dimensions expected
Returns:
  • array: reduced features data
  • array: reduced samples features
  • list: liste of features

maplearn.ml.regression module

Regression

In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.

Regression analysis is supervised and need samples for fitting. The output will be a matrix with float values.

Example:

>>> from maplearn.datahandler.loader import Loader
>>> from maplearn.datahandler.packdata import PackData
>>> from maplearn.ml.regression import Regression
>>> loader = Loader('boston')
>>> data = PackData(X=loader.X, Y=loader.Y, data=loader.aData)
>>> reg = Regression(data=data, dirOut=os.path.join('maplearn_path', 'tmp'))
>>> reg.fit_1(self.__algo)
class maplearn.ml.regression.Regression(data=None, algorithm=None, **kwargs)

Bases: maplearn.ml.machine.Machine

Applies regression using 1 or several algorithm(s) onto a specified dataset

Args:
  • data (PackData): dataset to play with
  • algorithm (list or str): name of the algorithm(s) to use
  • **kwargs: more parameters like k-fold

Attributes and properties are inherited from Machine class

fit_1(algo)

Fits one regression algorithm

Arg:
  • algo (str): name of the algorithm to fit
load(data)

Loads necessary data for regression, with samples (labels are float values).

Arg:
  • data (PackData): data to play with
Returns:
  • int: did data load correctly (returns 0) or not (<> 0) ?
TODO:
  • checks a few things when loading…
optimize(algo)

Optimize parameters of a regression algorithm

Args:
  • algo (str): name of the regressor to use
predict_1(algo, proba=False)

Predicts Y using one regressor (specified by algo)

Args:

  • algo (str): key of the regressor to use
  • proba (bool): should probabilities (if available) given by algorithm be added to result?
run(predict=False)

Applies every regressors specified in ‘algorithm’ property

Args:
  • predict (bool): should be the regressor only fitted or also used
    to predict?

maplearn.ml.machine module

Machine Learning class

Fits and predict result using one or several machine learning algorithm(s).

This is an abstract class that should not be used directly. Use instead one one of the these classes:

  • Classification: supervised classification
  • Clustering: unsupervised classification
  • Regression: regression
  • Reduction: to reduce dimensions of a dataset
class maplearn.ml.machine.Machine(data=None, algorithm=None, **kwargs)

Bases: object

Class to apply one or several machine learning algorithm(s) on a given dataset.

Args:

  • data (PackData): data to use with machine learning algorithm(s)
  • algorithm (list or str): algorithm(s) to use

Attributes:

  • algo (str): key code of the currently used algorithm
  • result (dataframe): result(s) predicted by algorithm(s)
  • proba (dataframe): probabilities produced by some algorithm(s)

Properties:

  • algorithm (list): machine learning algorithm(s) to use
ALL_ALGOS = {}
algorithm

Gets list of algorithm that will be used when running the class

fit_1(algo)

Fits an algorithm to dataset

load(data)

Loads necessary data to machine learning algorithm(s)

Args:

  • data (PackData): dataset used by machine learning algorithm(s)
predict_1(algo, export=False)

Predict a result using a given algorithm

Args:

  • algo (str): key name identifying the algorithm to use
  • export (bool): should the algorithm be used to predict results
run(predict=False)

Apply machine learning task(s) using every specified algorithm(s)

Args:

  • predict (boolean): should machine learning algorithm(s) be used to predict results (or just be fitted to samples) ?