tdamapper

tdamapper.core

Core tools for creating and analyzing Mapper graphs.

The Mapper algorithm is a method for exploring the shape and structure of high-dimensional datasets, by constructing a graph representation called Mapper graph. The algorithm has three main steps:

Filtering: Apply a lens function (also called filter) to map the data points to a lower-dimensional space, such as a scalar value or a 2D plane.
Covering: Arrange the lens space into overlapping open sets, using a cover algorithm such as uniform intervals or balls.
Clustering: Group the data points in each open set into clusters, using a clustering algorithm such as single-linkage or DBSCAN.

The Mapper graph consists of nodes that represent clusters of data points, and edges that connect overlapping clusters (clusters obtained from different open sets can possibly overlap). For more details on the Mapper algorithm and its applications, see

Gurjeet Singh, Facundo Mémoli and Gunnar Carlsson, “Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition”, Eurographics Symposium on Point-Based Graphics, 2007.

This module provides the class tdamapper.core.MapperAlgorithm, which encapsulates the algorithm and its parameters. The Mapper graph produced by this module is a NetworkX graph object.

class tdamapper.core.MapperAlgorithm(cover, clustering)

Bases: object

A class for creating and analyzing Mapper graphs.

This class provides two methods fit() and fit_transform(). Once fitted, the Mapper graph is stored in the attribute graph_ as a networkx.Graph object.

This class adopts the same interface as scikit-learn’s estimators for ease and consistency of use. However, it’s important to note that this is not a proper scikit-learn estimator as it does not validata the input in the same way as a scikit-learn estimator is required to do. This class can work with datasets whose elements are arbitrary objects when feasible for the supplied parameters.

Parameters:

cover (A class compatible with tdamapper.cover.Cover) – The cover algorithm to apply to lens space.
clustering (A class from tdamapper.clustering, or a class from sklearn.cluster) – The clustering algorithm to apply to each subset of the dataset.

fit(X, y=None)

Create the Mapper graph and store it for later use.

This method stores the result of tdamapper.core.mapper_graph() in the attribute graph_ and returns a reference to the calling object.

Parameters:

X (array-like of shape (n, m) or list-like of length n) – The dataset to be mapped.
y (array-like of shape (n, k) or list-like of length n) – The lens values for each point in the dataset.

Returns:

The object itself.

fit_transform(X, y)

Create the Mapper graph.

This method is equivalent to calling tdamapper.core.mapper_graph().

Parameters:

X (array-like of shape (n, m) or list-like of length n) – The dataset to be mapped.
y (array-like of shape (n, k) or list-like of length n) – The lens values for each point in the dataset.

Returns:

The Mapper graph.

Return type:

networkx.Graph

tdamapper.core.aggregate_graph(X, graph, agg)

Apply an aggregation function to the nodes of a graph.

This function takes a dataset and a graph, and computes an aggregation value for each node of the graph, based on the data points that are associated with that node. The aggregation function can be any callable that takes a list of values and returns a single value, such as sum, mean, max, min, etc.

The function returns a dictionary that maps each node of the graph to its aggregation value. The keys of the dictionary are the nodes of the graph, and the values are the aggregation values.

Parameters:

X (array-like of shape (n, m) or list-like) – The dataset to be aggregated.
graph (networkx.Graph.) – The graph to apply the aggregation function to.
agg (Callable.) – The aggregation function to use.

Returns:

A dictionary of node-aggregation pairs.

Return type:

dict

tdamapper.core.mapper_connected_components(X, y, cover, clustering)

Identify the connected components of the Mapper graph.

A connected component is a maximal set of nodes that are reachable from each other by following the edges. This function assigns a unique integer label to each point in the dataset, based on the connected component of the Mapper graph that it belongs to.

This function uses a union-find data structure to efficiently keep track of the connected components as it scans the points of the dataset. This approach should be faster than computing the Mapper graph by first calling tdamapper.core.mapper_graph() and then calling networkx.connected_components() on it.

Parameters:

X (array-like of shape (n, m) or list-like of length n) – The dataset to be mapped.
y (array-like of shape (n, k) or list-like of length n) – The lens values for each point in the dataset.
cover (A class compatible with tdamapper.cover.Cover) – The cover algorithm to apply to lens space.
clustering (A class from tdamapper.clustering, or a class from sklearn.cluster) – The clustering algorithm to apply to each subset of the dataset.

Returns:

A list of labels. The label at position i identifies the connected component of the point at position i in the dataset.

Return type:

list[int]

tdamapper.core.mapper_graph(X, y, cover, clustering)

Create the Mapper graph.

This function first identifies the unique cluster labels that each point of the dataset belongs to. These labels are used to identify the nodes of the Mapper graph. Then the edges of the Mapper graph are created by checking if any two nodes share some points in their corresponding clusters.

This function return the Mapper graph as an object of type networkx.Graph. Each node has an attribute ‘size’ that stores the number of points contained in its corresponding cluster, and an attribute ‘ids’ that stores the indices of the points in the dataset that are contained in the cluster.

Parameters:

X (array-like of shape (n, m) or list-like of length n) – The dataset to be mapped.
y (array-like of shape (n, k) or list-like of length n) – The lens values for each point in the dataset.
cover (A class compatible with tdamapper.cover.Cover) – The cover algorithm to apply to lens space.
clustering (A class from tdamapper.clustering, or a class from sklearn.cluster) – The clustering algorithm to apply to each subset of the dataset.

Returns:

The Mapper graph.

Return type:

networkx.Graph

tdamapper.core.mapper_labels(X, y, cover, clustering)

Identify the nodes of the Mapper graph.

The function first covers the lens space with overlapping sets, using the cover algorithm provided. Then, for each set, it clusters the points of the dataset that have lens values within that set, using the clustering algorithm provided. The clusters are then labeled with unique integers, starting from zero for each set. The function then adds an offset to the cluster labels, such that the labels are distinct across all sets. The offset is equal to the maximum label of the previous set plus one.

The function returns a list of node labels for each point in the dataset. The list at position i contains the labels of the nodes that the point at position i belongs to. The labels are sorted in ascending order, and there are no duplicates. If i < j, the labels at position i are strictly less than those at position j.

Parameters:

X (array-like of shape (n, m) or list-like of length n) – The dataset to be mapped.
y (array-like of shape (n, k) or list-like of length n) – The lens values for each point in the dataset.
cover (A class compatible with tdamapper.cover.Cover) – The cover algorithm to apply to lens space.
clustering (A class from tdamapper.clustering, or a class from sklearn.cluster) – The clustering algorithm to apply to each subset of the dataset.

Returns:

A list of node labels for each point in the dataset.

Return type:

list[list[int]]

tdamapper.cover

Open cover construction for the Mapper algorithm.

An open cover is a collection of open subsets of a dataset whose union spans the whole dataset. Unlike clustering, open subsets do not need to be disjoint. Indeed, the overlaps of the open subsets define the edges of the Mapper graph.

class tdamapper.cover.BallCover(radius, metric, flat=True)

Bases: ProximityCover

Cover algorithm based on ball proximity function implemented as tdamapper.proximity.BallProximity.

Parameters:

radius (float) – The radius of the open balls, must be positive.
metric (Callable) – The (pseudo-)metric function that defines the distance between points, must be symmetric, positive, and satisfy the triangle-inequality, i.e. \(metric(x, z) \leq metric(x, y) + metric(y, z)\) for every x, y, z in the dataset.
flat (bool, optional) – A flag that indicates whether to use a flat or a hierarchical vantage point tree, defaults to False.

class tdamapper.cover.Cover

Bases: object

Abstract interface for cover algorithms.

This is a naive implementation. Subclasses should override the methods of this class to implement more meaningful cover algorithms.

apply(X)

Covers the dataset with a single open set.

This is a naive implementation that should be overridden by subclasses to implement more meaningful cover algorithms.

Parameters:: X (array-like of shape (n, m) or list-like of length n) – A dataset of n points to be covered with open subsets.
Returns:: A generator that produces a single list of ints whose elements are the indices of the data points, ranging from 0 to n - 1.
Return type:: generator of lists of ints

class tdamapper.cover.CubicalCover(n_intervals, overlap_frac, flat=True)

Bases: ProximityCover

Cover algorithm based on the cubical proximity function implemented as tdamapper.proximity.CubicalProximity.

Parameters:

n_intervals (int) – The number of intervals to use for each dimension, must be positive and less than or equal to the length of the dataset.
overlap_frac (float) – The fraction of overlap between adjacent intervals on each dimension, must be in the range (0.0, 1.0).
flat (bool, optional) – A flag that indicates whether to use a flat or a hierarchical vantage point tree, defaults to False.

class tdamapper.cover.KNNCover(neighbors, metric, flat=True)

Bases: ProximityCover

Cover algorithm based on knn proximity function implemented as tdamapper.proximity.KNNProximity.

Parameters:

neighbors (int) – The number of neighbors to use for the KNN Proximity function, must be positive and less than the length of the dataset.
metric (Callable) – The (pseudo-)metric function that defines the distance between points, must be symmetric, positive, and satisfy the triangle-inequality, i.e. \(metric(x, z) \leq metric(x, y) + metric(y, z)\) for every x, y, z in the dataset.
flat (bool, optional) – A flag that indicates whether to use a flat or a hierarchical vantage point tree, defaults to False.

class tdamapper.cover.ProximityCover(proximity)

Bases: Cover

Cover algorithm based on proximity-net.

This class creates an open cover by calling the function tdamapper.proximity.proximity_net().

Parameters:: proximity (Anything compatible with class tdamapper.proximity.Proximity) – The proximity function to use.

apply(X)

Covers the dataset using proximity-net on the specified proximity function.

The proximity function is used to create an open set whenever a point is picked from tdamapper.proximity.proximity_net().

Parameters:: X (array-like of shape (n, m) or list-like of length n) – A dataset of n points to be covered with open subsets.
Returns:: A generator yielding lists of ints whose elements are the indices of the data points.
Return type:: generator of lists of ints

class tdamapper.cover.TrivialCover

Bases: ProximityCover

Cover algorithm based on the trivial proximity function implemented as tdamapper.proximity.TrivialProximity.

tdamapper.proximity

Proximity functions for open cover algorithms.

A proximity function is a function that maps each point into a subset of the dataset that contains the point itself.

Proximity functions, implemented as subclasses of class tdamapper.proximity.Proximity, are a convenient way to implement open cover algorithms by using the proximity_net construction. Proximity-net is implemented by function tdamapper.proximity.proximity_net(), and used by the class tdamapper.cover.ProximityCover.

class tdamapper.proximity.BallProximity(radius, metric, flat=True)

Bases: Proximity

Proximity function based on open balls of fixed radius.

An open ball is a set of points placed within a certain distance from a center. This class maps each point to the open ball of fixed radius centered on the point itself.

Parameters:

radius (float) – The radius of the open balls, must be positive.
metric (Callable) – The (pseudo-)metric function that defines the distance between points, must be symmetric, positive, and satisfy the triangle-inequality, i.e. \(metric(x, z) \leq metric(x, y) + metric(y, z)\) for every x, y, z in the dataset.
flat (bool, optional) – A flag that indicates whether to use a flat or a hierarchical vantage point tree, defaults to False.

fit(X)

Train internal parameters.

This method creates a vptree on the dataset in order to perform fast range queries in the func:tdamapper.proximity.BallProximity.search method.

Parameters:: X (array-like of shape (n, m) or list-like of length n) – A dataset of n points used to extract parameters and perform training.
Returns:: The object itself.
Return type:: self

search(x)

Return a list of neighbors for the query point.

This method uses the internal vptree to perform fast range queries.

Parameters:: x (Any) – A query point for which we want to find neighbors.
Returns:: The indices of the neighbors contained in the dataset.
Return type:: list[int]

class tdamapper.proximity.CubicalProximity(n_intervals, overlap_frac, flat=True)

Bases: Proximity

Proximity function based on open hypercubes of uniform size and overlap.

A hypercube is a multidimensional generalization of a square or a cube. The size and overlap of the hypercubes are determined by the number of intervals and the overlap fraction parameters. This class maps each point to the hypercube with the nearest center.

Parameters:

n_intervals (int) – The number of intervals to use for each dimension, must be positive and less than or equal to the length of the dataset.
overlap_frac (float) – The fraction of overlap between adjacent intervals on each dimension, must be in the range (0.0, 1.0).
flat (bool, optional) – A flag that indicates whether to use a flat or a hierarchical vantage point tree, defaults to False.

fit(X)

Train internal parameters.

This method builds an internal tdamapper.cover.BallCover attribute that allows efficient queries of the dataset.

Parameters:: X (array-like of shape (n, m) or list-like of length n) – A dataset of n points used to extract parameters and perform training.
Returns:: The object itself.
Return type:: self

search(x)

Return a list of neighbors for the query point.

This method takes a target point as input and returns the hypercube whose center is closest to the target point.

Parameters:: x (Any) – A query point for which we want to find neighbors.
Returns:: The indices of the neighbors contained in the dataset.
Return type:: list[int]

class tdamapper.proximity.KNNProximity(neighbors, metric, flat=True)

Bases: Proximity

Proximity function based on k-nearest neighbors (KNN).

This class maps each point to the set of the k nearest neighbors to the point itself.

Parameters:

neighbors (int) – The number of neighbors to use for the KNN Proximity function, must be positive and less than the length of the dataset.
metric (Callable) – The (pseudo-)metric function that defines the distance between points, must be symmetric, positive, and satisfy the triangle-inequality, i.e. \(metric(x, z) \leq metric(x, y) + metric(y, z)\) for every x, y, z in the dataset.
flat (bool, optional) – A flag that indicates whether to use a flat or a hierarchical vantage point tree, defaults to False.

fit(X)

Train internal parameters.

This method creates a vptree on the dataset in order to perform fast KNN queries in the func:tdamapper.proximity.BallProximity.search method.

Parameters:: X (array-like of shape (n, m) or list-like of length n) – A dataset of n points used to extract parameters and perform training.
Returns:: The object itself.
Return type:: self

search(x)

Return a list of neighbors for the query point.

This method queries the internal vptree in order to perform fast KNN queries.

Parameters:: x (Any) – A query point for which we want to find neighbors.
Returns:: The indices of the neighbors contained in the dataset.
Return type:: list[int]

class tdamapper.proximity.Proximity

Bases: object

Abstract interface for proximity functions.

This is a naive implementation. Subclasses should override the methods of this class to implement more meaningful proximity functions.

fit(X)

Train internal parameters.

This is a naive implementation that should be overridden by subclasses to implement more meaningful proximity functions.

Parameters:: X (array-like of shape (n, m) or list-like of length n) – A dataset of n points used to extract parameters and perform training.
Returns:: The object itself.
Return type:: self

search(x)

Return a list of neighbors for the query point.

This is a naive implementation that returns all the points in the dataset as neighbors. This method should be overridden by subclasses to implement more meaningful proximity functions.

Parameters:: x (Any) – A query point for which we want to find neighbors.
Returns:: A list containing all the indices of the points in the dataset.
Return type:: list[int]

class tdamapper.proximity.TrivialProximity

Bases: Proximity

Proximity function that returns the whole dataset.

This class creates a single open set that contains all the points in the dataset.

fit(X)

Train internal parameters.

This method stores the dataset in an internal attribute.

Parameters:: X (array-like of shape (n, m) or list-like of length n) – A dataset of n points used to extract parameters and perform training.
Returns:: The object itself.
Return type:: self

search(x)

Return a list of neighbors for the query point.

This method returns a list of all the ids ranging from zero to the length of the dataset.

Parameters:: x (Any) – A query point for which we want to find neighbors.
Returns:: The indices of the neighbors contained in the dataset.
Return type:: list[int]

tdamapper.proximity.proximity_net(X, proximity)

Compute proximity-net for a given proximity function.

This function uses an iterative algorithm to construct the proximity-net. It starts with an arbitrary point and builds an open cover around it based on the proximity function. Then it discards the covered points and repeats the process on the remaining points until all points are covered.

This function applies an iterative algorithm to create the proximity-net. It picks an arbitrary point and forms an open cover calling the proximity function on the chosen point. The points contained in the open cover are then marked as covered, and discarded in the following steps. The procedure is repeated on the leftover points until every point is eventually covered.

This function returns a generator that yields each element of the proximity-net as a list of ids. The ids are the indices of the points in the original dataset.

Parameters:

X (array-like of shape (n, m) or list-like of length n) – A dataset of n points.
proximity (tdamapper.cover.Proximity) – A proximity function

Returns:

A generator of lists of ids.

Return type:

generator of lists of ints

tdamapper.clustering

Clustering module for the Mapper algorithm.

This module implements some tools for the clustering step of Mapper algorithm, which groups the data points in each open set into clusters using a clustering algorithm of choice. The clusters are then used to form the nodes of the Mapper graph, and are connected by edges if they share points in the overlap.

class tdamapper.clustering.FailSafeClustering(clustering, verbose=True)

Bases: object

A delegating clustering algorithm that prevents failure.

This class wraps a clustering algorithm and handles any exceptions that may occur during the fitting process. If the clustering algorithm fails, instead of throwing an exception, a single cluster containing all points is returned. This can be useful for robustness and debugging purposes.

Parameters:

clustering (Anything compatible with a sklearn.cluster class.) – A clustering algorithm to delegate to.
verbose (bool) – Set to True to log exceptions. The default is True.

fit(X, y=None)

class tdamapper.clustering.MapperClustering(cover=None, clustering=None)

Bases: object

A clustering algorithm based on the Mapper graph.

The Mapper algorithm constructs a graph from a dataset, where each node represents a cluster of points and each edge represents an overlap between clusters. Each point in the dataset belongs to one or more nodes in the graph. These nodes are therefore all connected and share the same connected component in the Mapper graph. This class clusters point according to their connected component in the Mapper graph calling the function tdamapper.core.mapper_connected_components().

Parameters:

cover (A class from tdamapper.cover) – The cover algorithm to apply to lens space.
clustering (A class from tdamapper.clustering, or a class from sklearn.cluster) – The clustering algorithm to apply to each subset of the dataset.

fit(X, y=None)

class tdamapper.clustering.TrivialClustering

Bases: object

A clustering algorithm that returns a single cluster.

This class implements a trivial clustering algorithm that assigns all data points to the same cluster. It can be used as an argument of the class tdamapper.core.MapperAlgorithm to skip clustering in the construction of the Mapper graph.

fit(X, y=None)

Fit the clustering algorithm to the data.

Parameters:

X (array-like of shape (n, m) or list-like of length n) – The dataset to be mapped.
y – Ignored.

Returns:

self

tdamapper.plot

This module provides functionalities to visualize the Mapper graph.

class tdamapper.plot.MapperLayoutInteractive(graph, dim, seed=42, iterations=50, colors=None, agg=<function nanmean>, title=None, width=512, height=512, cmap='jet')

Bases: object

Class for generating and visualizing the Mapper graph.

This class creates a metric embedding of the Mapper graph in 2D or 3D and converts it into a Plotly figure suitable for interactive display.

Parameters:

graph (networkx.Graph, required) – The precomputed Mapper graph to be embedded. This can be obtained by calling tdamapper.core.mapper_graph() or tdamapper.core.MapperAlgorithm.fit_transform().
dim (int) – The dimension of the graph embedding (2 or 3).
seed (int, optional (default: 42)) – The random seed used to construct the graph embedding.
iterations (int, optional (default: 50)) – The number of iterations used to construct the graph embedding.
colors (array-like of shape (n,) or list-like of size n) – An array of values that determine the color of each node in the graph, useful for highlighting different features of the data.
agg (Callable, optional) – A function used to aggregate the colors data when multiple points are mapped to a single node. The final color of each node is obtained by mapping the aggregated value with the colormap cmap.
title (str, optional) – The title to be displayed alongside the figure.
width (int, optional (default: 512)) – The desired width of the figure in pixels.
height (int, optional (default: 512)) – The desired height of the figure in pixels.
cmap (str, optional) – The name of a colormap used to map color data values, aggregated by agg, to actual RGBA colors.

plot()

Plot the Mapper graph.

Returns:: An interactive Plotly figure that can be displayed on screen and notebooks. For 3D embeddings, the figure requires a WebGL context to be shown.
Return type:: plotly.graph_objects.Figure

update(seed=None, iterations=None, colors=None, agg=None, title=None, width=None, height=None, cmap=None)

Update the figure.

This method modifies the figure returned by the plot function. After calling this method, the figure will be updated according to the supplied parameters.

Parameters:

seed (int, optional) – The random seed used to construct the graph embedding.
iterations (int, optional) – The number of iterations used to construct the graph embedding.
colors (array-like of shape (n,) or list-like of size n) – An array of values that determine the color of each node in the graph, useful for highlighting different features of the data.
agg (Callable, optional) – A function used to aggregate the colors data when multiple points are mapped to a single node. The final color of each node is obtained by mapping the aggregated value with the colormap cmap.
title (str, optional) – The title to be displayed alongside the figure.
width (int, optional) – The desired width of the figure in pixels.
height (int, optional) – The desired height of the figure in pixels.
cmap (str, optional) – The name of a colormap used to map color data values, aggregated by agg, to actual RGBA colors.

class tdamapper.plot.MapperLayoutStatic(graph, dim, seed=42, iterations=50, colors=None, agg=<function nanmean>, title=None, width=512, height=512, cmap='jet')

Bases: object

Class for generating and visualizing the Mapper graph.

This class creates a metric embedding of the Mapper graph in 2D and converts it into a matplotlib figure suitable for static display.

Parameters:

graph (networkx.Graph, required) – The precomputed Mapper graph to be embedded. This can be obtained by calling tdamapper.core.mapper_graph() or tdamapper.core.MapperAlgorithm.fit_transform().
dim (int) – The dimension of the graph embedding (only 2 is supported, for compatibility).
seed (int, optional (default: 42)) – The random seed used to construct the graph embedding.
iterations (int, optional (default: 50)) – The number of iterations used to construct the graph embedding.
colors (array-like of shape (n,) or list-like of size n) – An array of values that determine the color of each node in the graph, useful for highlighting different features of the data.
agg (Callable, optional) – A function used to aggregate the colors data when multiple points are mapped to a single node. The final color of each node is obtained by mapping the aggregated value with the colormap cmap.
title (str, optional) – The title to be displayed alongside the figure.
width (int, optional (default: 512)) – The desired width of the figure in pixels.
height (int, optional (default: 512)) – The desired height of the figure in pixels.
cmap (str, optional) – The name of a colormap used to map color data values, aggregated by agg, to actual RGBA colors.

plot()

Plot the Mapper graph.

Returns:: A static matplotlib figure that can be displayed on screen and notebooks.
Return type:: matplotlib.figure.Figure, matplotlib.axes.Axes