API Reference
tdamapper.core
Core tools for creating and analyzing Mapper graphs.
The Mapper algorithm is a method for exploring the shape and structure of high-dimensional datasets, by constructing a graph representation called Mapper graph. The algorithm has three main steps:
Filtering: Apply a lens function (also called filter) to map the data points to a lower-dimensional space, such as a scalar value or a 2D plane.
Covering: Arrange the lens space into overlapping open sets, using a cover algorithm such as uniform intervals or balls.
Clustering: Group the data points in each open set into clusters, using a clustering algorithm such as single-linkage or DBSCAN.
The Mapper graph consists of nodes that represent clusters of data points, and edges that connect overlapping clusters (clusters obtained from different open sets can possibly overlap). For more details on the Mapper algorithm and its applications, see
Gurjeet Singh, Facundo Mémoli and Gunnar Carlsson, “Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition”, Eurographics Symposium on Point-Based Graphics, 2007.
This module provides the class tdamapper.core.MapperAlgorithm, which
encapsulates the algorithm and its parameters. The Mapper graph produced by
this module is a NetworkX graph object.
- class tdamapper.core.FailSafeClustering(clustering: Clustering[T] | None = None, verbose: bool = True)[source]
Bases:
ParamsMixin,Generic[T]A delegating clustering algorithm that prevents failure.
This class wraps a clustering algorithm and handles any exceptions that may occur during the fitting process. If the clustering algorithm fails, instead of throwing an exception, a single cluster containing all points is returned. This can be useful for robustness and debugging purposes.
- Parameters:
clustering – A clustering algorithm to delegate to.
verbose – A flag to log clustering exceptions. Set to True to enable logging, or False to suppress it.
- fit(X: ArrayRead[T], y: ArrayRead[T] | None = None) FailSafeClustering[T][source]
Fit the clustering algorithm to the data.
If the clustering algorithm fails, a single cluster containing all points is returned.
- Parameters:
X – A dataset of n points.
y – Ignored.
- Returns:
self
- labels_: list[int]
- class tdamapper.core.TrivialClustering[source]
Bases:
ParamsMixin,Generic[T]A clustering algorithm that returns a single cluster.
This class implements a trivial clustering algorithm that assigns all data points to the same cluster. It can be used as an argument of the class
tdamapper.core.MapperAlgorithmto skip clustering in the construction of the Mapper graph.- fit(X: ArrayRead[T], y: ArrayRead[T] | None = None) TrivialClustering[T][source]
Fit the clustering algorithm to the data.
- Parameters:
X – A dataset of n points.
y – Ignored.
- Returns:
self
- labels_: list[int]
- class tdamapper.core.TrivialCover[source]
Bases:
ParamsMixin,Generic[T]Cover algorithm that covers data with a single subset containing the whole dataset.
This class creates a single open set that contains all the points in the dataset.
- apply(X: ArrayRead[T]) Iterator[list[int]][source]
Covers the dataset with a single open set.
- Parameters:
X – A dataset of n points.
- Returns:
A generator of lists of ids.
- fit(X: ArrayRead[T]) TrivialCover[T][source]
Fit the cover algorithm to the data.
- Parameters:
X – A dataset of n points. Ignored.
- Returns:
self
- tdamapper.core.aggregate_graph(X: ArrayRead[S], graph: Graph, agg: Callable[[...], Any]) dict[int, Any][source]
Apply an aggregation function to the nodes of a graph.
This function takes a dataset and a graph, and computes an aggregation value for each node of the graph, based on the data points that are associated with that node. The aggregation function can be any callable that takes a list of values and returns a single value, such as sum, mean, max, min, etc.
The function returns a dictionary that maps each node of the graph to its aggregation value. The keys of the dictionary are the nodes of the graph, and the values are the aggregation values.
- Parameters:
X – A dataset of n points.
graph – The graph to apply the aggregation function to.
agg – The aggregation function to use.
- Returns:
A dictionary of node-aggregation pairs.
- tdamapper.core.mapper_connected_components(X: ArrayRead[S], y: ArrayRead[T], cover: Cover[T], clustering: Clustering[S], n_jobs: int = 1) list[int][source]
Identify the connected components of the Mapper graph.
A connected component is a maximal set of nodes that are reachable from each other by following the edges. This function assigns a unique integer label to each point in the dataset, based on the connected component of the Mapper graph that it belongs to.
This function uses a union-find data structure to efficiently keep track of the connected components as it scans the points of the dataset. This approach should be faster than computing the Mapper graph by first calling
tdamapper.core.mapper_graph()and then callingnetworkx.connected_components()on it.- Parameters:
X – A dataset of n points.
y – Lens values for the n points of the dataset.
cover – A cover algorithm.
clustering – The clustering algorithm to apply to each subset of the dataset.
n_jobs – The maximum number of parallel clustering jobs. This parameter is passed to the constructor of
joblib.Parallel.
- Returns:
A list of labels. The label at position i identifies the connected component of the point at position i in the dataset.
- tdamapper.core.mapper_graph(X: ArrayRead[S], y: ArrayRead[T], cover: Cover[T], clustering: Clustering[S], n_jobs: int = 1) Graph[source]
Create the Mapper graph.
This function first identifies the unique cluster labels that each point of the dataset belongs to. These labels are used to identify the nodes of the Mapper graph. Then the edges of the Mapper graph are created by checking if any two nodes share some points in their corresponding clusters.
This function return the Mapper graph as an object of type
networkx.Graph. Each node has an attribute ‘size’ that stores the number of points contained in its corresponding cluster, and an attribute ‘ids’ that stores the indices of the points in the dataset that are contained in the cluster.- Parameters:
X – A dataset of n points.
y – Lens values for the n points of the dataset.
cover – A cover algorithm.
clustering – The clustering algorithm to apply to each subset of the dataset.
n_jobs – The maximum number of parallel clustering jobs. This parameter is passed to the constructor of
joblib.Parallel.
- Returns:
The Mapper graph.
- tdamapper.core.mapper_labels(X: ArrayRead[S], y: ArrayRead[T], cover: Cover[T], clustering: Clustering[S], n_jobs: int = 1) list[list[int]][source]
Identify the nodes of the Mapper graph.
The function first covers the lens space with overlapping sets, using the cover algorithm provided. Then, for each set, it clusters the points of the dataset that have lens values within that set, using the clustering algorithm provided. The clusters are then labeled with unique integers, starting from zero for each set. The function then adds an offset to the cluster labels, such that the labels are distinct across all sets. The offset is equal to the maximum label of the previous set plus one.
The function returns a list of node labels for each point in the dataset. The list at position i contains the labels of the nodes that the point at position i belongs to. The labels are sorted in ascending order, and there are no duplicates. If i < j, the labels at position i are strictly less than those at position j.
- Parameters:
X – A dataset of n points.
y – Lens values for the n points of the dataset.
cover – A cover algorithm.
clustering – A clustering algorithm.
n_jobs – The maximum number of parallel clustering jobs. This parameter is passed to the constructor of
joblib.Parallel.
- Returns:
A list of node labels for each point in the dataset.
- tdamapper.core.proximity_net(search: SpatialSearch[S], X: ArrayRead[S]) Iterator[list[int]][source]
Covers the dataset using proximity-net.
This function applies an iterative algorithm to create the proximity-net. It picks an arbitrary point and forms an open cover calling the proximity function on the chosen point. The points contained in the open cover are then marked as covered, and discarded in the following steps. The procedure is repeated on the leftover points until every point is eventually covered.
This function returns a generator that yields each element of the proximity-net as a list of ids. The ids are the indices of the points in the original dataset.
- Parameters:
X – A dataset of n points.
- Returns:
A generator of lists of ids.
tdamapper.cover
Open cover construction for the Mapper algorithm.
An open cover is a collection of open subsets of a dataset whose union spans the whole dataset. Unlike clustering, open subsets do not need to be disjoint. Indeed, the overlaps of the open subsets define the edges of the Mapper graph.
- class tdamapper.cover.BallCover(radius: float = 1.0, metric: Literal['euclidean', 'manhattan', 'minkowski', 'chebyshev', 'cosine'] | Metric[Any] = 'euclidean', metric_params: dict[str, Any] | None = None, kind: Literal['flat', 'hierarchical'] = 'flat', leaf_capacity: int = 1, leaf_radius: float | None = None, pivoting: Literal['disabled', 'random', 'furthest'] = 'disabled')[source]
Bases:
ParamsMixin,Generic[T_contra]Cover algorithm based on ball proximity function, which covers data with open balls of fixed radius.
An open ball is a set of points within a specified distance from a center point. This class maps each point to its corresponding open ball with a fixed radius centered on the point itself.
- Parameters:
radius – The radius of the open balls. Must be a positive value.
metric – The metric used to define the distance between points. Accepts any value compatible with tdamapper.utils.metrics.get_metric.
metric_params – Additional parameters for the metric function, to be passed to tdamapper.utils.metrics.get_metric.
kind – Specifies whether to use a flat or a hierarchical vantage point tree.
leaf_capacity – The maximum number of points in a leaf node of the vantage point tree. Must be a positive value.
leaf_radius – The radius of the leaf nodes. If not specified, it defaults to the value of radius. Must be a positive value.
pivoting – The method used for pivoting in the vantage point tree.
- apply(X: ArrayRead[T_contra]) Iterator[list[int]][source]
Covers the dataset using proximity-net.
This function returns a generator that yields each element of the open cover as a list of ids. The ids are the indices of the points in the original dataset.
- Parameters:
X – A dataset of n points.
- Returns:
A generator of lists of ids.
- class tdamapper.cover.BaseCubicalCover(n_intervals: int = 1, overlap_frac: float | None = None, kind: Literal['flat', 'hierarchical'] = 'flat', leaf_capacity: int = 1, leaf_radius: float | None = None, pivoting: Literal['disabled', 'random', 'furthest'] = 'disabled')[source]
Bases:
objectBase class for cubical cover algorithms, which cover data with open hypercubes of uniform size and overlap. This class provides the basic functionality for cubical covers, including the initialization of parameters and the methods for computing the center of a hypercube and its overlap.
A hypercube is a multidimensional generalization of a square or a cube. The size and overlap of the hypercubes are determined by the number of intervals and the overlap fraction parameters. This class maps each point to the hypercube with the nearest center.
- Parameters:
n_intervals – The number of intervals to use for each dimension. Must be positive and less than or equal to the length of the dataset.
overlap_frac – The fraction of overlap between adjacent intervals on each dimension, must be in the range (0.0, 0.5]. If not specified, the overlap_frac is computed such that the volume of the overlap within each hypercube is half the total volume.
kind – Specifies whether to use a flat or a hierarchical vantage point tree.
leaf_capacity – The maximum number of points in a leaf node of the vantage point tree. Must be a positive value.
leaf_radius – The radius of the leaf nodes. If not specified, it defaults to the value of radius. Must be a positive value.
pivoting – The method used for pivoting in the vantage point tree.
- fit(X: ArrayRead[ndarray[Any, dtype[float64]]]) BaseCubicalCover[source]
Train internal parameters.
This method builds an internal
tdamapper.cover.BallCoverattribute that allows efficient queries of the dataset.- Parameters:
X – A dataset of n points.
- Returns:
The object itself.
- search(x: ndarray[Any, dtype[float64]]) list[int][source]
Return a list of neighbors for the query point.
This method takes a target point as input and returns the hypercube whose center is closest to the target point.
- Parameters:
x – A query point for which we want to find neighbors.
- Returns:
The indices of the neighbors contained in the dataset.
- class tdamapper.cover.CubicalCover(n_intervals: int = 1, overlap_frac: float | None = None, algorithm: Literal['standard', 'proximity'] = 'proximity', kind: Literal['flat', 'hierarchical'] = 'flat', leaf_capacity: int = 1, leaf_radius: float | None = None, pivoting: Literal['disabled', 'random', 'furthest'] = 'disabled')[source]
Bases:
ParamsMixinWrapper class for cubical cover algorithms, which cover data with open hypercubes of uniform size and overlap. This class delegates its methods to either
tdamapper.cover.StandardCubicalCoverortdamapper.cover.ProximityCubicalCover, based on the algorithm parameter.A hypercube is a multidimensional generalization of a square or a cube. The size and overlap of the hypercubes are determined by the number of intervals and the overlap fraction parameters.
- Parameters:
n_intervals – The number of intervals to use for each dimension. Must be positive and less than or equal to the length of the dataset.
overlap_frac – The fraction of overlap between adjacent intervals on each dimension, must be in the range (0.0, 0.5]. If not specified, the overlap_frac is computed such that the volume of the overlap within each hypercube is half the total volume.
algorithm – Specifies whether to use standard cubical cover, as in
tdamapper.cover.StandardCubicalCoveror proximity cubical cover, as intdamapper.cover.ProximityCubicalCover.kind – Specifies whether to use a flat or a hierarchical vantage point tree.
leaf_capacity – The maximum number of points in a leaf node of the vantage point tree. Must be a positive value.
leaf_radius – The radius of the leaf nodes. If not specified, it defaults to the value of radius. Must be a positive value.
pivoting – The method used for pivoting in the vantage point tree.
- apply(X: ArrayRead[ndarray[Any, dtype[float64]]]) Iterator[list[int]][source]
Covers the dataset using hypercubes.
This method delegates to the apply method of the internal cubical cover used.
- Parameters:
X – A dataset of n points.
- Returns:
A generator of lists of ids.
- fit(X: ArrayRead[ndarray[Any, dtype[float64]]]) CubicalCover[source]
Train internal parameters.
This method delegates to the
fit()method of the internal cubical cover used.- Parameters:
X – A dataset of n points.
- Returns:
The object itself.
- search(x: ndarray[Any, dtype[float64]]) list[int][source]
Return a list of neighbors for the query point.
This method delegates to the search method of the internal cubical cover used.
- Parameters:
x – A query point for which we want to find neighbors.
- Returns:
The indices of the neighbors contained in the dataset.
- class tdamapper.cover.KNNCover(neighbors: int = 1, metric: Literal['euclidean', 'manhattan', 'minkowski', 'chebyshev', 'cosine'] | Metric[Any] = 'euclidean', metric_params: dict[str, Any] | None = None, kind: Literal['flat', 'hierarchical'] = 'flat', leaf_capacity: int | None = None, leaf_radius: float = 0.0, pivoting: Literal['disabled', 'random', 'furthest'] = 'disabled')[source]
Bases:
ParamsMixin,Generic[T_contra]Cover algorithm based on KNN proximity function, which covers data using k-nearest neighbors (KNN).
This class maps each point to the set of the k nearest neighbors to the point itself.
- Parameters:
neighbors – The number of neighbors to use for the KNN Proximity function, must be positive and less than the length of the dataset.
metric – The metric used to define the distance between points. Accepts any value compatible with tdamapper.utils.metrics.get_metric.
metric_params – Additional parameters for the metric function, to be passed to tdamapper.utils.metrics.get_metric.
kind – Specifies whether to use a flat or a hierarchical vantage point tree.
leaf_capacity – The maximum number of points in a leaf node of the vantage point tree. If not specified, it defaults to the value of neighbors. Must be a positive value.
leaf_radius – The radius of the leaf nodes. Must be a positive value.
pivoting – The method used for pivoting in the vantage point tree.
- apply(X: ArrayRead[T_contra]) Iterator[list[int]][source]
Covers the dataset using proximity-net.
This function returns a generator that yields each element of the open cover as a list of ids. The ids are the indices of the points in the original dataset.
- Parameters:
X – A dataset of n points.
- Returns:
A generator of lists of ids.
- class tdamapper.cover.ProximityCubicalCover(n_intervals: int = 1, overlap_frac: float | None = None, kind: Literal['flat', 'hierarchical'] = 'flat', leaf_capacity: int = 1, leaf_radius: float | None = None, pivoting: Literal['disabled', 'random', 'furthest'] = 'disabled')[source]
Bases:
BaseCubicalCover,ParamsMixinCover algorithm based on the cubical proximity function, which covers data with open hypercubes of uniform size and overlap. The cubical cover is obtained by selecting a subsect of all the hypercubes that intersect the dataset using proximity net (see
tdamapper.core.Proximity). For an open cover containing all the hypercubes interecting the dataset usetdamapper.core.StandardCubicalCover.A hypercube is a multidimensional generalization of a square or a cube. The size and overlap of the hypercubes are determined by the number of intervals and the overlap fraction parameters. This class maps each point to the hypercube with the nearest center.
- Parameters:
n_intervals – The number of intervals to use for each dimension. Must be positive and less than or equal to the length of the dataset.
overlap_frac – The fraction of overlap between adjacent intervals on each dimension, must be in the range (0.0, 0.5]. If not specified, the overlap_frac is computed such that the volume of the overlap within each hypercube is half the total volume.
kind – Specifies whether to use a flat or a hierarchical vantage point tree.
leaf_capacity – The maximum number of points in a leaf node of the vantage point tree. Must be a positive value.
leaf_radius – The radius of the leaf nodes. If not specified, it defaults to the value of radius. Must be a positive value.
pivoting – The method used for pivoting in the vantage point tree.
- apply(X: ArrayRead[ndarray[Any, dtype[float64]]]) Iterator[list[int]][source]
Covers the dataset using proximity-net.
This function returns a generator that yields each element of the open cover as a list of ids. The ids are the indices of the points in the original dataset.
- Parameters:
X – A dataset of n points.
- Returns:
A generator of lists of ids.
- class tdamapper.cover.StandardCubicalCover(n_intervals: int = 1, overlap_frac: float | None = None, kind: Literal['flat', 'hierarchical'] = 'flat', leaf_capacity: int = 1, leaf_radius: float | None = None, pivoting: Literal['disabled', 'random', 'furthest'] = 'disabled')[source]
Bases:
BaseCubicalCover,ParamsMixinCover algorithm based on the standard open cover, which covers data with open hypercubes of uniform size and overlap. The standard cover is obtained by selecting all the hypercubes that intersect the dataset.
A hypercube is a multidimensional generalization of a square or a cube. The size and overlap of the hypercubes are determined by the number of intervals and the overlap fraction parameters. This class maps each point to the hypercube with the nearest center.
- Parameters:
n_intervals – The number of intervals to use for each dimension. Must be positive and less than or equal to the length of the dataset.
overlap_frac – The fraction of overlap between adjacent intervals on each dimension, must be in the range (0.0, 0.5]. If not specified, the overlap_frac is computed such that the volume of the overlap within each hypercube is half the total volume.
kind – Specifies whether to use a flat or a hierarchical vantage point tree.
leaf_capacity – The maximum number of points in a leaf node of the vantage point tree. Must be a positive value.
leaf_radius – The radius of the leaf nodes. If not specified, it defaults to the value of radius. Must be a positive value.
pivoting – The method used for pivoting in the vantage point tree.
- apply(X: ArrayRead[ndarray[Any, dtype[float64]]]) Iterator[list[int]][source]
Covers the dataset using landmarks.
This function yields all the hypercubes intersecting the dataset.
This function returns a generator that yields each element of the open cover as a list of ids. The ids are the indices of the points in the original dataset.
- Parameters:
X – A dataset of n points.
- Returns:
A generator of lists of ids.
tdamapper.learn
This module provides classes based on the Mapper algorithm, a technique from topological data analysis (TDA) for extracting insights from complex data. Each class is designed to be compatible with scikit-learn’s estimator APIs, ensuring seamless integration with existing machine learning pipelines.
Users can leverage these classes to explore high-dimensional data, visualize relationships, and uncover meaningful structures in a manner that aligns with scikit-learn’s conventions for estimators.
- class tdamapper.learn.MapperAlgorithm(cover: Cover[T_contra] | None = None, clustering: Clustering[S_contra] | None = None, failsafe: bool = True, verbose: bool = True, n_jobs: int = 1)[source]
Bases:
EstimatorMixin,ParamsMixin,Generic[S_contra,T_contra]A class for creating and analyzing Mapper graphs.
This class provides two methods
fit()andfit_transform(). Once fitted, the Mapper graph is stored in the attribute graph_ as anetworkx.Graphobject.This class adopts the same interface as scikit-learn’s estimators for ease and consistency of use. However, it’s important to note that this is not a proper scikit-learn estimator as it does not validata the input in the same way as a scikit-learn estimator is required to do. This class can work with datasets whose elements are arbitrary objects when feasible for the supplied parameters.
- Parameters:
cover – A cover algorithm. If no cover is specified,
tdamapper.core.TrivialCoveris used, which produces a single open cover containing the whole dataset.clustering – The clustering algorithm to apply to each subset of the dataset. If no clustering is specified,
tdamapper.core.TrivialClusteringis used, which produces a single cluster for each subset.failsafe – A flag that is used to prevent failures. If True, the clustering object is wrapped by
tdamapper.core.FailSafeClustering.verbose – A flag that is used for logging, supplied to
tdamapper.core.FailSafeClustering. If True, clustering failures are logged. Set to False to suppress these messages.n_jobs – The maximum number of parallel clustering jobs. This parameter is passed to the constructor of
joblib.Parallel.
- fit(X: ArrayRead[S_contra], y: ArrayRead[T_contra] | None = None) MapperAlgorithm[S_contra, T_contra][source]
Create the Mapper graph and store it for later use.
This method stores the result of
tdamapper.core.mapper_graph()in the attribute graph_ and returns a reference to the calling object.- Parameters:
X – A dataset of n points.
y – Lens values for the n points of the dataset.
- Returns:
The object itself.
- fit_transform(X: ArrayRead[S_contra], y: ArrayRead[T_contra]) Graph[source]
Create the Mapper graph.
This method is equivalent to calling
tdamapper.core.mapper_graph().- Parameters:
X – A dataset of n points.
y – Lens values for the n points of the dataset.
- Returns:
The Mapper graph.
- graph_: Graph
- class tdamapper.learn.MapperClustering(cover: Cover[T_contra] | None = None, clustering: Clustering[S_contra] | None = None, n_jobs: int = 1)[source]
Bases:
EstimatorMixin,ParamsMixin,Generic[S_contra,T_contra]A clustering algorithm based on the Mapper graph.
The Mapper algorithm constructs a graph from a dataset, where each node represents a cluster of points and each edge represents an overlap between clusters. Each point in the dataset belongs to one or more nodes in the graph. These nodes are therefore all connected and share the same connected component in the Mapper graph. This class builds clusters of points according to their connected component in the Mapper graph.
This class does not compute the Mapper graph but calls the function
tdamapper.core.mapper_connected_components(), which is faster.- Parameters:
cover – A cover algorithm.
clustering – The clustering algorithm to apply to each subset of the dataset.
n_jobs – The maximum number of parallel clustering jobs. This parameter is passed to the constructor of
joblib.Parallel.
- fit(X: ArrayRead[S_contra], y: ArrayRead[T_contra] | None = None) MapperClustering[S_contra, T_contra][source]
Fit the clustering algorithm to the data.
- Parameters:
X – A dataset of n points.
y – Ignored.
- Returns:
self
- labels_: list[int]
tdamapper.plot
This module provides functionalities to visualize the Mapper graph.
- class tdamapper.plot.MapperPlot(graph: Graph, dim: Literal[2, 3], iterations: int = 50, seed: int | None = None, layout_engine: Literal['igraph', 'networkx'] = 'igraph')[source]
Bases:
objectClass for generating and visualizing the Mapper graph.
This class creates a metric embedding of the Mapper graph in 2D or 3D and converts it into a plot.
- Parameters:
graph – The precomputed Mapper graph to be embedded. This can be obtained by calling
tdamapper.core.mapper_graph()ortdamapper.core.MapperAlgorithm.fit_transform().dim – The dimension of the graph embedding.
iterations – The number of iterations used to construct the graph embedding.
seed – The random seed used to construct the graph embedding.
layout_engine – The engine used to compute the graph layout in the specified dimensions.
- plot_matplotlib(colors: ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy.float64]], node_size: int = 1, agg: ~typing.Callable[[...], ~typing.Any] = <function nanmean>, title: str | None = None, cmap: str = 'jet', width: int = 512, height: int = 512) tuple[Figure, Axes][source]
Draw a static plot using Matplotlib.
- Parameters:
colors – An array of values that determine the color of each node in the graph, useful for highlighting different features of the data.
node_size – A scaling factor for node size.
agg – A function used to aggregate the colors array over the points within a single node. The final color of each node is obtained by mapping the aggregated value with the colormap cmap.
title – The title to be displayed alongside the figure.
cmap – The name of a colormap used to map colors data values, aggregated by agg, to actual RGBA colors.
width – The desired width of the figure in pixels.
height – The desired height of the figure in pixels.
- Returns:
A static matplotlib figure that can be displayed on screen and notebooks.
- plot_plotly(colors: ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy.float64]], node_size: int | float | list[int | float] = 1, agg: ~typing.Callable[[...], ~typing.Any] = <function nanmean>, title: str | list[str] | None = None, cmap: str | list[str] = 'jet', width: int | None = None, height: int | None = None) Figure[source]
Draw an interactive plot using Plotly.
- Parameters:
colors – An array of values that determine the color of each node in the graph, useful for highlighting different features of the data.
node_size – A scaling factor for node size. When node_size is a list, the figure will display a slider with the specified values.
agg – A function used to aggregate the colors array over the points within a single node. The final color of each node is obtained by mapping the aggregated value with the colormap cmap.
title – The title for the colormap. When colors has shape (n, m) and title is a list of string, each item will be used as title for its corresponding colormap.
cmap – The name of a colormap used to map colors data values, aggregated by agg, to actual RGBA colors.
width – The desired width of the figure in pixels.
height – The desired height of the figure in pixels.
- Returns:
An interactive Plotly figure that can be displayed on screen and notebooks. For 3D embeddings, the figure requires a WebGL context to be shown.
- plot_plotly_update(fig: Figure, colors: ndarray[Any, dtype[float64]] | None = None, node_size: int | float | list[int | float] | None = None, agg: Callable[[...], Any] | None = None, title: str | None = None, cmap: str | None = None, width: int | None = None, height: int | None = None) Figure[source]
Draw an interactive plot using Plotly on a previously rendered figure.
This is typically faster than calling MapperPlot.plot_plotly on a new set of parameters.
- Parameters:
fig – A Plotly Figure object obtained by calling the method MapperPlot.plot_plotly.
colors – An array of values that determine the color of each node in the graph, useful for highlighting different features of the data.
node_size – A scaling factor for node size.
agg – A function used to aggregate the colors array over the points within a single node. The final color of each node is obtained by mapping the aggregated value with the colormap cmap.
title – The title to be displayed alongside the figure.
cmap – The name of a colormap used to map colors data values, aggregated by agg, to actual RGBA colors.
width – The desired width of the figure in pixels.
height – The desired height of the figure in pixels.
- Returns:
An interactive Plotly figure that can be displayed on screen and notebooks. For 3D embeddings, the figure requires a WebGL context to be shown.
- plot_pyvis(output_file: str, colors: ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy.float64]], node_size: int = 1, agg: ~typing.Callable[[...], ~typing.Any] = <function nanmean>, title: str | None = None, cmap: str = 'jet', width: int = 512, height: int = 512) None[source]
Draw an interactive HTML plot using PyVis.
- Parameters:
output_file – The path where the html file is written.
colors – An array of values that determine the color of each node in the graph, useful for highlighting different features of the data.
node_size – A scaling factor for node size.
agg – A function used to aggregate the colors array over the points within a single node. The final color of each node is obtained by mapping the aggregated value with the colormap cmap.
title – The title to be displayed alongside the figure.
cmap – The name of a colormap used to map colors data values, aggregated by agg, to actual RGBA colors.
width – The desired width of the figure in pixels.
height – The desired height of the figure in pixels.
tdamapper.protocols
Protocols for type hinting in tdamapper.
These protocols define the expected interfaces for various components used in the tdamapper library, such as array-like structures, metrics, cover algorithms, clustering algorithms, and spatial search algorithms. They are used to ensure that the components adhere to the expected behaviors and can be used interchangeably in the library’s functions and methods.
- class tdamapper.protocols.Array(*args, **kwargs)[source]
Bases:
ArrayRead[T],ArrayWrite[T],Protocol[T]Protocol for an array-like structure.
- class tdamapper.protocols.ArrayRead(*args, **kwargs)[source]
Bases:
Protocol[T_co]Protocol for a read-only array-like structure.
- class tdamapper.protocols.ArrayWrite(*args, **kwargs)[source]
Bases:
Protocol[T_contra]Protocol for a writeable array-like structure.
- class tdamapper.protocols.Clustering(*args, **kwargs)[source]
Bases:
Protocol[T_contra]Protocol for clustering algorithms.
A clustering algorithm groups data points into clusters, each represented by an integer label. Labels are typically non-negative and may be non-contiguous.
- fit(X: ArrayRead[T_contra], y: ArrayRead[T_contra] | None = None) Clustering[T_contra][source]
Fit the clustering algorithm to the data.
- Parameters:
X – A dataset of n points.
y – A dataset of targets. Typically ignored and present for compatibility with scikit-learn’s clustering interface.
- Returns:
The fitted clustering object.
- labels_: list[int]
- class tdamapper.protocols.Cover(*args, **kwargs)[source]
Bases:
Protocol[T_contra]Protocol for cover algorithms.
A cover algorithm collects open sets from a dataset such that each point belongs to at least one open set. The open sets are represented as lists of indices, where each index corresponds to a point in the dataset. The open sets are eventually overlapping.
- class tdamapper.protocols.Metric(*args, **kwargs)[source]
Bases:
Protocol[T_contra]Protocol for a metric function.
- class tdamapper.protocols.SpatialSearch(*args, **kwargs)[source]
Bases:
Protocol[T_contra]Protocol for spatial search algorithms.
A spatial search algorithm is a method for finding neighbors of a query point in a dataset.
- fit(X: ArrayRead[T_contra]) SpatialSearch[T_contra][source]
Train internal parameters.
- Parameters:
X – A dataset of n points.
- Returns:
The object itself.
tdamapper.utils.metrics
Utilities for computing metrics.
This module provides functions to calculate various distance metrics. A metric, or distance function, is a function that maps two points to a double value, representing the “distance” between them. For a function to qualify as a valid metric, it must satisfy the following properties:
- Symmetry: The distance between two points is the same regardless of the
order, i.e.: \(d(x, y) = d(y, x)\) for all \(x\) and \(y\).
- Positivity: The distance between two distinct points is always positive,
i.e.: \(d(x, y) > 0\) for all distinct \(x\) and \(y\), and \(d(x, x) = 0\) for every \(x\).
- Triangle inequality: The distance between two points is less than or equal
to the sum of the distances from a third point, i.e.: \(d(x, z) \leq d(x, y) + d(y, z)\) for all points \(x, y, z\).
Supported distance metrics include:
Euclidean: The square root of the sum of squared differences between the components of vectors.
Manhattan: The sum of the absolute differences between the components of vectors.
Minkowski: A generalization of the Euclidean and Chebyshev distances, parameterized by an order p.
Chebyshev: The maximum absolute difference between the components of vectors.
Cosine: A distance on unit vectors based on cosine similarity.
- tdamapper.utils.metrics.chebyshev() Metric[Any][source]
Return the Chebyshev distance function for vectors.
The Chebyshev distance is defined as the maximum absolute difference between the components of the vectors.
- Returns:
The Chebyshev distance function.
- tdamapper.utils.metrics.cosine() Metric[Any][source]
Return the cosine distance function for vectors.
The cosine similarity between the input vectors ranges from -1.0 to 1.0.
A value of 1.0 indicates that the vectors are in the same direction.
A value of 0.0 indicates orthogonality (the vectors are perpendicular).
A value of -1.0 indicates that the vectors are diametrically opposed.
The cosine distance is derived from the cosine similarity \(s\) and is defined as: \(d(x, y) = \sqrt{2 \cdot (1 - s(x, y))}\)
This definition ensures that the cosine distance satisfies the triangle inequality on unit vectors.
- Returns:
The cosine distance function.
- tdamapper.utils.metrics.euclidean() Metric[Any][source]
Return the Euclidean distance function for vectors.
The Euclidean distance is defined as the square root of the sum of the squared differences between the components of the vectors.
- Returns:
The Euclidean distance function.
- tdamapper.utils.metrics.get_metric(metric: Literal['euclidean', 'manhattan', 'minkowski', 'chebyshev', 'cosine'] | Metric[Any], **kwargs: Any) Metric[Any][source]
Return a distance function based on the specified string or callable.
- Parameters:
metric – The metric to use. If a callable function is provided, it is returned directly. Otherwise, predefined metric names returned by
tdamapper.utils.metrics.get_supported_metrics()are supported.kwargs – Additional keyword arguments (e.g., ‘p’ for Minkowski distance).
- Returns:
The selected distance metric function.
- Raises:
ValueError – If an invalid metric string is provided.
- tdamapper.utils.metrics.get_supported_metrics() list[Literal['euclidean', 'manhattan', 'minkowski', 'chebyshev', 'cosine']][source]
Return a list of supported metric names.
- Returns:
A list of supported metric names.
- tdamapper.utils.metrics.manhattan() Metric[Any][source]
Return the Manhattan distance function for vectors.
The Manhattan distance is defined as the sum of the absolute differences between the components of the vectors.
- Returns:
The Manhattan distance function.
- tdamapper.utils.metrics.minkowski(p: int | float) Metric[Any][source]
Return the Minkowski distance function for order p on vectors.
The Minkowski distance is a generalization of the Euclidean and Chebyshev distances. When p = 1, it is equivalent to the Manhattan distance, and when p = 2, it is equivalent to the Euclidean distance. When p is infinite, it is equivalent to the Chebyshev distance.
- Parameters:
p – The order of the Minkowski distance.
- Returns:
The Minkowski distance function.
tdamapper.utils.vptree
A module for fast knn and range searches, depending only on a given metric
- class tdamapper.utils.vptree.VPTree(X: ArrayRead[T], metric: Metric[T], kind: Literal['flat', 'hierarchical'] = 'flat', leaf_capacity: int = 1, leaf_radius: float = 0.0, pivoting: Literal['disabled', 'random', 'furthest'] = 'disabled')[source]
Bases:
Generic[T]A Vantage Point Tree, or vp-tree, for fast range-queries and knn-queries.
- Parameters:
X – A dataset of n points.
metric – The metric used to define the distance between points. Accepts any value compatible with tdamapper.utils.metrics.get_metric.
kind – Specifies whether to use a flat or a hierarchical vantage point tree.
leaf_capacity – The maximum number of points in a leaf node of the vantage point tree. Must be a positive value.
leaf_radius – The radius of the leaf nodes. Must be a positive value.
pivoting – The method used for pivoting in the vantage point tree.
- ball_search(point: T, eps: float, inclusive: bool = True) list[T][source]
Perform a ball search in the Vantage Point Tree.
This method searches for all points within a specified radius from a given point.
- Parameters:
point – The query point from which to search for neighbors.
eps – The radius within which to search for neighbors. Must be positive.
inclusive – Whether to include points exactly at the distance eps from point.
- Returns:
A list of points within the specified radius from the given query point.
- knn_search(point: T, k: int) list[T][source]
Perform a k-nearest neighbors search in the Vantage Point Tree.
This method searches for the k-nearest neighbors to a given query point.
- Parameters:
point – The point from which to search for nearest neighbors.
k – The number of nearest neighbors to search for. Must be positive.
- Returns:
A list of the k-nearest neighbors to the given query point.