maplearn.datahandler package

Data handlers

Interim classes between file(s) and dataset

  • packdata: creates a dataset with samples and data
  • labels: labels associated to features (in samples)
  • loader: loads data from a file or known datasets
  • writer: writes data into a file
  • signature: graphs a dataset
  • plotter: generic class to make charts

Submodules

maplearn.datahandler.packdata module

Machine Learning dataset

A machine learning dataset is classically a table where:

  • columns are all variables that can be used by machine learning algorithms
  • lines correspond to the individuals

Variables

The variables fall into two categories:

  1. the variables for which you have information: these are the predictors (or features)
  2. the variable to predict, also called label

Individuals

  • The individuals for whom you know the label are called samples.
  • The others are just called data
class maplearn.datahandler.packdata.PackData(X=None, Y=None, data=None, **kwargs)

Bases: object

PackData: a container for datasets

A PackData contains:

  • samples (Y and X) to fit algorithm(s)
    • Y: a vector with samples’ labels
    • X: a matrix with samples’ features
  • data: 2d matrix with features to use for prediction

PackData checks if samples are compatible with data (same features…) and is compatible with Machine Learning algorithm(s).

Example:
>>> import numpy as np
>>> data = np.random.random((10, 5))
>>> x = np.random.random((10, 5))
>>> y = np.random.randint(1, 10, size=10)
>>> ds = PackData(x, y, data)
>>> print(ds)
Args:
  • X (array): 2d matrix with features of samples
  • Y (array): vector with labels of samples
  • data (array): 2d matrix with features
  • **kwargs: other parameters about dataset (features, na…)
Attributes:
  • not_nas: vector with non-NA indexes
X

X (array): 2d matrix with features of samples

Y

Y (array): vector with labels of samples

balance(seuil=None)

Balance samples and remove some individuals within the biggest classes.

Args:
  • seuil (int): max number of samples inside a class
classes

dict: labels classes and associated number of individuals

data

data (array): 2d matrix with features

features

list: list of features of the dataset

load(X=None, Y=None, data=None, features=None)

Loads data to the packdata

Args:
  • X (array): 2d matrix with features of samples
  • Y (array): vector with labels of samples
  • data (array): 2d matrix with features
  • features (list): list of features
plot(prefix='sig')

Plots the dataset (signature): * one chart for the whole samples * one chart per samples’ class

Args:
  • prefix (str): prefix of output files to save charts in
reduit(meth='lda', ncomp=None)

Reduces number of dimensions of data and X

Args:
  • meth (str): reduction method to apply
  • ncomp (int): number of dimensions expected
scale()

Normalizes data and X matrices

separability(metric='euclidean')

Performs separability analysis between samples

Arg:
  • metric (str): name of the distance used

maplearn.datahandler.labels module

Labels

This class handles labels associated to features in samples:

  • counts how many samples for each class
class maplearn.datahandler.labels.Labels(Y, codes=None, output=None)

Bases: object

Samples labels used in PackData class

Args:
  • Y (array): vector with samples’ labels
  • codes (dict): dictionnary with labels code and associated description
Attributes:
  • summary ()
  • dct_codes (dict): dictionnary with labels code and associated description
Property:
  • Y (array): vector containing labels of samples (codes)
Y

Samples (as a vector)

convert()

Conversion between codes

count()

Summarizes labels of each class (how many samples for each class)

libelle2code()

Converts labels’ names into corresponding codes

maplearn.datahandler.loader module

Loads data from a file

This class aims to feed a PackData. It gathers data from one or more files or uses known datasets stored in a library

class maplearn.datahandler.loader.Loader(source, **kwargs)

Bases: object

Loads data from a file or a known dataset

Args:
  • source (str): path the file to load or name of a dataset (“iris” for example)
  • **kwargs: other attributes to drive loading (handles NA, labels…)
Attributes:
  • src (dct): informations about the source (type, path…)
  • X: samples’ features
  • Y: samples’ labels
  • aData:
  • matrix: (needed when loading from a raster file)
  • features
  • nomenclature
Examples:
  • Loading data from a know dataset:

    >>> ldr = Loader('iris')
    >>> print(ldr)
    >>> print(ldr.X, ldr.Y)
    >>> print(ldr.data)
    
  • Loading data from a file (here a shapefile):

    >>> ldr = Loader(os.path.join('maplearn_path', 'datasets',
                                  'ex1.xlsx'))
    >>> print(ldr)
    >>> print(ldr.X, ldr.Y)
    
X

Matrix of values corresponding to samples

Y

Vector of labels describing samples. Values to be predicted by machine learning algorithm

aData

Data to predict

df

Dataframe loaded

features

List of features that contains the dataset

matrix

Data served as a matrix. Needed when loading data from an image

nomenclature

Legends of labels. Dictionnary combining labels codes and their corresponding names

run(**kwargs)

Gets samples (X with features and Y containing labels)

Args:
  • **kwargs:
    • features (list): features to load
    • label (str): column with class labels (description)
    • label_id (str): column with labels codes

maplearn.datahandler.writer module

Writes data into a file

This class is to be used with PackData. It puts data into one file (different formats are useable).

class maplearn.datahandler.writer.Writer(path=None, **kwargs)

Bases: object

Writes data in a file (different formats available)

Args:
  • path (str): path towards the file to write data into
  • **kwargs:
    • origin (str): path to the original file used as a model
path
run(data, path=None, na=None, dtype=None)

Writes data into a file

Args:
  • data (pandas dataframe): dataset to write
  • path (str): path towards the file to write data into
  • na : value used as a code for “NoData”
  • dtype (np.dtype): desired data type

maplearn.datahandler.signature module

Signature

This class makes charts about a dataset:

  • spectral signature
  • temporal signature
Example:
>>> from maplearn.datahandler.loader import Loader
>>> from maplearn.datahandler.signature import Signature
>>> ldr = Loader('iris')
>>> sig = Signature()
>>> sig.plot(ldr.X, title='test')
class maplearn.datahandler.signature.Signature(data, features=None, model='boxplot', output=None)

Bases: object

Makes charts about a dataset:

  • one global graph
  • one graph per class in samples (if samples are available)
Args:
  • data (array or DataFrame): data to plot
  • features (list): name of columns
  • model (str): how to plot signature (plot or boxplot)
  • ouput (str): path to the output directory where will be saved plots
plot(title='Signature du jeu de donnees', file=None)

Plots (spectral) signature of data as boxplots or points depending of the number of features

Args:
  • title (str): title to add to the plot
  • file (str): name of the output file
plot_class(data_class, label='', file=None)

Plots the signature of one class above the whole dataset

Args:
  • data_class (dataframe): data of one class
  • label (str): label of the class to plot
  • file (str): path to the file to save the chart in

maplearn.datahandler.plotter module