Abstract

Technological advances have spurred an increase in data complexity and dimensionality. We are now in an era in which data sets containing thousands of features are commonplace. To digest and analyze such high-dimensional data, dimension reduction techniques have been developed and advanced along with computational power. Of these techniques, nonlinear methods are most commonly employed because of their ability to construct visually interpretable embeddings. Unlike linear methods, these methods non-uniformly stretch and shrink space to create a visual impression of the high-dimensional data. Since capturing high-dimensional structures in a significantly lower number of dimensions requires drastic manipulation of space, nonlinear dimension reduction methods are known to occasionally produce false structures, especially in noisy settings. In an effort to deal with this phenomenon, we developed an interactive tool that enables analysts to better understand and diagnose their dimension reduction results. It uses various analytical plots to provide a multi-faceted perspective on results to determine legitimacy. The tool is available via an R package named DRtool.

Title

Non-Linear Dimension Reduction, R Package, R Shiny

1 Introduction↩︎

The potency of nonlinear dimension reduction methods lies in their flexibility, allowing them to model complex data structures. That same flexibility, however, makes them difficult to use and interpret. Each method requires a slew of hyperparameters that need to be calibrated, and even when adequately calibrated, these methods require a trained eye to interpret. For example, the two most popular nonlinear dimension reduction methods, t-SNE and UMAP, sometimes generate misleading results [1], [2]. The results often cluster, even when no clusters exist in the data, and cluster sizes/locations can be unreliable. We have developed an interactive tool that analysts may use to conduct a post-hoc analysis of their high-dimensional clustering. The tool uses the minimum spanning tree (MST) to describe the global structure of clusters and provide an additional perspective on inter-cluster relationships. This allows analysts to extract more information from their dimension reduction results by making it easier to differentiate the signal and the noise.

In this paper, we describe the analytical plots provided by the tool (Section 2). We present a MST stability experiment, demonstrating the MST’s ability to approximate high-dimensional structure, as well as power and size analyses for a novel hypothesis test (Section 3). And we walk through use of the tool on two separate data sets (Section 4).

2 Methods↩︎

2.1 Minimum Spanning Tree↩︎

Graphs have been applied to many multivariate statistical problems. The authors of [3] introduced the minimal ascending path spanning tree as a way to test for multimodality. The Friedman-Rafsky test [4], along with its modern variations [5]–[7], use the MST to construct a multivariate two-sample test. Single-linkage clustering [8] and runt pruning [9] are both intimately related to the MST. In the context of dimension reduction, IsoMap [10] makes use of neighborhood graphs, [11] introduces the maximum information spanning tree, and [12] uses the MST. These methods, which fall under the category of manifold learning, use graphs to model high-dimensional data assumed to be drawn uniformly from a high-dimensional manifold. An accurate low-dimensional embedding can then be constructed from these graphs. It’s apparent that graphs are useful for describing high-dimensional data, especially when it comes to dimension reduction and cluster analysis. Our tool uses the MST to analyze the reliability of visualizations produced by nonlinear dimension reduction methods.

We’ve opted for the MST for a couple of key properties. Firstly, the MST and shortest paths along it are quick to compute. Secondly, the MST contains a unique path between any two vertices, providing a well-defined metric on the data. Lastly, it provides a good summary of the data’s structure. It contains as a subgraph the nearest-neighbor graph, and any edge deletion in the MST partitions the vertices into two sets for which the deleted edge is the shortest distance between them [4].

2.1.1 Simplified Medoid Subtree↩︎

The MST is meant to provide a robust description of the data’s global structure, and more specifically, inter-cluster relationships. As such, it should be stable in the presence of noise and unaffected by local perturbations of the data. To demonstrate MST stability, we study the effect of random noise on the inter-cluster relationships explained by the MST.

To derive the inter-cluster relationships from the MST, we first take the medoid subtree, i.e. the minimal subtree containing the medoid of each cluster, then apply a simplification procedure (Algorithm 1). The algorithm collapses paths of non-medoid vertices into single edges of equal length. We refer to the output as the simplified medoid subtree. It encodes the global inter-cluster relationships within the data.

2.1.2 Robinson-Foulds Metric↩︎

To compare simplified medoid subtrees, we use the Robinson-Foulds metric [13]. The R-F metric was originally introduced to quantify the dissimilarity of phylogenetic trees, but the algorithm generalizes to arbitrary weighted trees. It looks at partitions of each tree created by removing individual edges, then counts the number of partitions present in one tree but not the other. We modified the algorithm (Algorithm 2) to specifically measure the dissimilarity in medoid vertices.

2.2 The Tool↩︎

The main objective is to analyze and leverage the structural data embedded in the MST. For example, paths between clusters are used to study inter-cluster relationships in the context of the underlying manifold from which the data are drawn.

To start, the user must provide a data matrix, a low-dimensional embedding, and a clustering. From there, the MST is calculated and various analytical plots are provided. The primary plot is the low-dimensional embedding colored according to the provided clustering. There is an option to overlay the medoid MST to understand the global structure of the clusters.

The remaining plots require the user to select two groups of interest, which can be done interactively in one of two ways. One way is to select two endpoints. The MST path is calculated and projected onto the low-dimensional embedding. The two groups are then the classes each endpoint belongs to. The second way is to select custom groups. The user may interact with the low-dimensional embedding by drawing boundaries for each group. The projected path then connects the medoid of each group. Once the groups and path are specified, the user is provided additional plots used to investigate the relationship between the two selected groups of points.

2.3 Path Projection Plot↩︎

To better understand the path of interest, a local projection method is applied to visualize the path and nearby points in two dimensions. The goal of the projection is to “unwind" the path, so it can be used to study the relationship between the two selected groups. We apply Principal Component Analysis followed by regularized Canonical Correlation Analysis in a method we’ve dubbed the PCA – rCCA method.

2.3.1 The PCA – rCCA Method↩︎

Let \(P \in \mathbb{R}^{k \times p}\) be the matrix of high-dimensional path points with endpoints \(p_1, p_k \in \mathbb{R}^p\). Let \(X \in \mathbb{R}^{n \times p}\) be the matrix of points of interest, i.e. the points belonging to the selected groups and the points along the path.

The concept is to use Canonical Correlation Analysis to determine the two-dimensional linear projection that best unwinds the path. Given two matrices, CCA iteratively calculates linear combinations of the matrix variates for each matrix, known as canonical variate pairs, that maximize covariance. These pairs are chosen to be orthogonal, so they give rise to a projection subspace. To unwind \(P\), we use CCA to compare \(P\) against a degree \(d\) polynomial design matrix \(P_d\), \[P_d = \begin{bmatrix} 1 & 1^2 & \cdots & 1^d \\ 2 & 2^2 & \cdots & 2^d \\ \vdots & \vdots & & \vdots \\ n & n^2 & \cdots & n^d \end{bmatrix}.\] The first two canonical variate pairs are used to construct a two-dimensional projection that maximizes the covariance between the projections of \(P\) and \(P_d\). This process generates a two-dimensional subspace on which we can project all of \(X\). Regularization is required to avoid singularity because \(p\) is often much greater than \(k\). The regularization constant for \(P\) is chosen using cross-validation. No regularization constant is needed for \(P_d\). See [14] for details.

One issue with this method is the projected path often travels along the outskirts of the plot. This is due to the near-orthogonality of high-dimensional data [15]. Because the non-path points are often nearly orthogonal with the projection subspace, they are overly shrunk in the projection. The path points are less affected because the projection subspace is selected to retain the path’s shape. While this phenomenon doesn’t discredit the entire plot, it leads to misrepresentation of the path’s location relative to the rest of the points.

To alleviate this issue, we apply PCA on the entirety of \(X\) to prior to applying rCCA. Removing extraneous dimensions containing mostly noise limits the confusion of excess noise for independence. When rCCA is applied post-PCA, the projected path’s relative position to the rest of the points is more credible.

2.3.2 Calibrating Hyperparameters↩︎

The user is responsible for calibrating the dimensionality of the PCA step and the degree \(d\) of the reference polynomial design matrix. To pick a number of dimensions, the user is recommended to start with a moderately large number, relative to the dimensionality of the original data. The proportion of variance retained in the selected number of dimensions is conveniently displayed in the upper righthand corner of the plot. A larger number of dimensions retains more information, but may misrepresent the location of the path relative to the rest of the points, while a smaller number of dimensions may diminish some of the variation in the data. As such, the user is encouraged to try different numbers of dimensions. To calibrate \(d\), it is recommended to start with \(d = 2\) then increment \(d\) until the shape of the path stabilizes.

The user is also given the option to overlay a kernel density estimate. In order to do so, the bandwidth must be calibrated. The recommend procedure is to begin with a large bandwidth that estimates one mode, then gradually decrease the bandwidth until two modes appear. If the two modes correspond with the two groups of interest, and more modes do not immediately appear when continuing to decrease the bandwidth, then a bimodal distribution is a reasonable way to describe the data.

2.4 The MST Test↩︎

Another perspective on the relationship between the two selected groups can be gained from studying the local structure of the MST. The degree of connectivity between the two groups within the MST serves as a measure of separation. A large degree of connectivity indicates lesser separation, while a small degree of connectivity indicates more separation. This idea motivates a hypothesis test.

2.4.1 The Test Statistic↩︎

The test statistic, meant to quantify local connectivity, is based on the number of edges connecting the two groups of interest. However, counting single edges is too restrictive of a measure. Consider the case when the two selected groups are polar ends of the same cluster. Because the medial region of the cluster does not belong to either group, there will be zero edges connecting the two groups, indicating the maximal degree of separation. This result is undesirable because the two groups actually belong to the same cluster.

Instead, the test statistic counts the number of connecting paths rather than single edges. These paths are referred to as crossings and are counted according to the following procedure. The minimal subtree containing both groups is isolated. Because the two groups may not be adjacent in the MST, this subtree may contain points belonging to other clusters as well. To extract the structural relationship between the two groups of interest, the subtree must be simplified. The simplification process collapses paths between the two groups of interest into edges that can be counted (Algorithm 3).

To count the number of crossings, the number of edges between the two groups in the simplified subtree are counted. It is also possible for a point of non-interest to act as a mediator along a path between the two groups of interest. To account for this scenario, for each point of non-interest adjacent to both groups, we also count its maximal degree to both groups.

2.4.2 The Null Distribution↩︎

The null distribution should correspond to the number of crossings in the case when both groups belong to the same cluster. Because a cluster can be drawn from a number of unimodal distributions, we must consider a composite null hypothesis including all such distributions. Among these distributions, the distribution that maximizes the probability of rejection must be used to ensure the test has the correct size. That way, the probability of Type I error does not exceed the pre-specified significance level under any other member of the composite null hypothesis as well.

We are in search of the unimodal distribution that minimizes the number of edges crossing a pre-specified hyperplane, representing the boundary between the groups. Finding this distribution is an intractable problem, so assumptions must be made. If we assume the number of edges crossing the hyperplane is proportional to the marginal density in a neighborhood around the hyperplane, then we may reduce the problem to the one-dimensional case.

Let \(n_1, n_2 > 0\) be the sample sizes of each group and \(c \in [-1, 1]\) the location of the mode. Let \(\mathcal{F}\) be the family of distributions on \([-1, 1]\) such that

\(f\) is increasing on \([-1, c]\),
\(f\) is decreasing on \([c, 1]\),
\(\int_{-1}^0 f = \frac{n_1}{n_1+n_2}\), and
\(\int_0^1 f = \frac{n_2}{n_1+n_2}\).

Let \(\epsilon \in (0, 1)\). The aim is to find a \(f \in \mathcal{F}\) that minimizes \(\int_{-\epsilon}^\epsilon f\).

The proof (Appendix A) is broken up into multiples cases based on the values of the ratio \(n_1/n_2\) and \(c\). In all cases, a piecewise constant solution exists in which the density near the hyperplane is proportional to the density of the lesser-dense group. This formal problem motivates the procedure by which the null distribution is simulated. First, the density of each group is approximated, \[D_j = \frac{n_j}{\prod_i \sigma_i^j}\] for \(j = 1,2\). The product of singular values is used to estimate the volume of each group because high-dimensional clusters tend to look Gaussian [15]. Extraneous noise dimensions are removed prior to this process to avoid biasing the volume estimates. Now suppose \(D_1 < D_2\). Then \(n_1\) points are uniformly sampled from a hyperrectangle with side lengths \(\sqrt{12}\sigma_i^1\) (the factor \(\sqrt{12}\) ensures the variance of the sample in each principle direction is equal to the variance of the original data in each principal component), then the number of edges crossing the hyperplane in the MST is recorded. Repeated simulation yields an approximate null distribution to which the test statistic is compared. The returned \(p\)-value is the percentile of the test statistic within this bootstrapped null distribution. A one-sided test is employed because we are only interested in rejecting the null for sufficiently small numbers of edge crossings.

Power and size analyses are conducted in controlled situations. They are described in Section 2.7, and the results are presented in Section 3.2.

2.5 Heatmap↩︎

The heatmap is a very useful tool for comparing groups because it provides a feature-by-feature perspective. It pinpoints the exact features in which the two groups differ the most. The interactive heatmap also allows users to select and analyze sub-heatmaps, providing a more focused view on specific features. The features are ordered according to difference in group means.

2.6 Meta Data Plot↩︎

Along with the data and clustering, the user may also supply meta data corresponding to the samples in the original data. The meta data for each group is presented via pie charts for categorical data and box plots for numerical data. These plots are useful for discovering trends in the data.

3 Results↩︎

3.1 MST Stability Experiment↩︎

To demonstrates the MST’s robust ability to describe global structure, we conducted a stability experiment. 1,500 samples were randomly chosen from the MNIST data set of handwritten digits [16]. Each \(28 \times 28\)-pixel image was flattened into a vector of length \(28^2 = 784\), so the data contain 1,500 samples in \(784\) dimensions. A PCA pre-processing step was employed to reduce the number of dimensions to 300. The simplified medoid subtree \(T\) was then calculated.

Random Gaussian noise was then added to the data and the new simplified medoid subtree \(T'\) was calculated. The R-F distance \(RF(T, T')\) was recorded. This process was repeated 30 times.

To better interpret the R-F distances, we designed a null distribution of distances as a reference for comparison. These distances should represent R-F distances between trees that do not portray similar global structures and inter-cluster relationships. To generate the null distribution from the data, we randomly permuted the class labels and computed the R-F distances between the resulting simplified medoid subtrees and the original simplified medoid subtree. By randomly re-labelling the clusters, we are simulating examples with distinct global structures. Figure 1 shows the R-F distances produced by adding noise and permuting the class labels. The simplified medoid subtrees generated by adding noise were significantly closer to the original simplified medoid subtree than those generated by randomly permuting the class labels in terms of R-F distance, showing inter-cluster relationships in the MST are robust to noise.

Figure 4: MST stability results on the MNIST data set..

3.2 Power and Size Analyses of the MST Test↩︎

The power of the test is dependent on the relative degree of separation between the two clusters. The experiment is setup as follows. Let \(c \in (0,1]\). 50 points are randomly sampled from \([-2, -c] \times [-1,1]^{p-1}\), and 50 points are randomly sampled from \([c, 2] \times [-1,1]^{p-1}\). In other words, two hyperrectangular p-dimensional clusters are sampled and separated by a distance of \(2c\). The MST test is run and the \(p\)-value is recorded. Through simulation, the power at varying levels of \(c\) and \(p\) are estimated. The power is expected to increase with \(c\) and decrease with \(p\). In higher dimension, distances are inflated due to the increased noise-to-signal ratio, so the inter-cluster separation appears less significant.

To ensure the probability of Type I error does not exceed \(5\%\), a size analysis is also conducted. An equivalent experiment is conducted when \(c = 0\), i.e. no separation exists between the two clusters, to determine size. See Figure 2 for results.

Figure 5: Power analysis of the MST test..

The estimated size was well-below \(5\%\). At each number of dimensions, the size experiment was simulated 100 times. Under the null hypothesis, the test never returned significant at 5, 10, and 20 dimensions. At 50 and 100 dimensions, the test returned significant only once each time. The test is conservative because the size must not exceed \(5\%\) for any member of the composite null hypothesis.

4 Application↩︎

4.1 Image Data Example↩︎

To demonstrate use of the tool, we explore the MNIST data set in detail. Again, the \(784 \times 784\)-pixel images were flattened and 1,500 samples were randomly sampled. A PCA pre-processing step was applied prior to applying UMAP [17] to construct a two-dimensional embedding. To replicate a real use case, we study a k-means clustering instead of the true class labels (Figure 3). The reader may follow along using the run_example(example="MNIST", cluster="kmeans") function in our DRtools package.

Figure 6: UMAP embedding of the MNIST data set colored according to k-means clustering..

At first glance, there are three major instances of disagreement between the UMAP embedding and the k-means clustering. Classes 1 and 2 seem to form one cluster together, class 4 is split into two separate clusters, and class 9 is merged with points from other clusters.

4.1.1 Classes 1 and 2↩︎

There seems to be minimal separation between classes 1 and 2, suggesting they may correspond to the same digit. We select a path from point 25,483 in class 1 to point 44,483 in class 2. To get a closer look, we first look at the Path Projection Plot. The chosen number of dimensions is 100, which retains 97% of the variance, and the path stabilizes at a CCA degree of 4.

The resulting plot depicts overlap between the two classes. Adjusting the bandwidth of the density estimate to 1.5 shows unimodal density, suggesting the two classes may come from the same population. Showing the MST edges also does not provide any evidence of separation. The MST test results, however, may suggest otherwise. Seven crossings are counted when the approximate expectation under the null is 11.03 with a standard error of 3.523. While the bootstrapped \(p\)-value of \(0.06\) is insignificant at the 5% level, the closeness indicates a more careful examination is necessary.

Inspection of the handwritten digits themselves reveals an interesting trend. While the majority of samples from both classes depict the digit one, the angle of the stroke differs drastically between the two classes (Figure 4). Following the path from class 1 to class 2, the strokes become more slanted. Both the MST path and MST test were able to detect this phenomenon, even though the two classes technically corresponded to the same digit.

Figure 7: Path projection plot of classes 1 and 2..

Overall, the analytical plots provide a deeper look into the situation. The images of one digits follow a skewed unimodal density centered around those drawn with a vertical stroke. The tail contains those drawn with more slanted strokes, specifically strokes drawn from the top right to the bottom left. While UMAP correctly captured this cluster, the gradual decline in density associated with increasingly slanted one digits is better depicted in the Path Projection Plot.

4.1.2 Class 4↩︎

Class 4 is split between two different clusters in the UMAP embedding. We use the drawing tool to select the two clusters as our groups. The path projection settings are calibrated to 100 dimensions and a CCA degree of three. We also select the Group Coloring setting, so the points are colored according to group, rather than class. Analysis of the plot and estimated density does not provide evidence of separation. The MST edges, however, are more revealing after close inspection. There are few inter-group edges, even in overlapping regions (Figure 5). On the contrary, the MST test counts a larger number of edge crossings. 14 are counted when only 12.02 are expected with a standard error of 3.378.

Figure 8: Path projection plot of class 4 clusters..

According to the true class labels, these clusters correspond to distinct digits (Figure B1). The top class 4 cluster corresponds to the digit eight, while the bottom class 4 cluster corresponds to the digit five. The connecting class 1 cluster corresponds to the digit three. Three, five, and eight share common strokes, leading to blurred boundaries between their respective clusters and making it difficult for the MST test to detect. However, when increasing the sample size to 1,000 randomly sampled images of digits 3, 5, and 8 (up from 553), the MST test comes back significant for all pairwise comparisons (Table B1). Because the test is conservative by construction, the small effect sizes were difficult to detect at a smaller sample size. This isn’t particularly surprising because UMAP was also unable to detect the separation.

4.1.3 Class 9↩︎

Class 9 is well-separated, but its cluster also contains some points from other classes, mainly class 6. To determine if these points should belong to the same class, we use the drawing tool to select the class 9 points and the remaining points in the cluster as our groups. The path projection settings are calibrated to 100 dimensions and a CCA degree of five. Together, the points form a unimodal cluster, as shown by the approximate density calculated with a bandwidth of 1.3 (Figure 6). Visually, there is also a consistent density of MST edges throughout the cluster, even where the two groups meet. The MST test agrees. There are 16 crossings counted, just below the expected value 16.37 under the null hypothesis. All evidence points towards the merging of these two groups.

Figure 9: Path projection plot of class 9 and remainder of cluster..

According to the true class labels, this entire cluster corresponds with the digit six (Figure B1). The k-means clustering incorrectly scattered the points into multiple classes.

4.2 Mass Cytometry Data Set↩︎

We now explore a mass cytometry data set [18] covering 35 samples originating from eight distinct human tissues enriched for T and natural killer cells. The data is processed and labeled inline with the procedure used in [19]. 3,000 cells were randomly sampled. To replicate a real use case, we explore a \(k\)-means clustering instead of the true class labels (Figure 7). The reader may follow along using the run_example(example="Wong", cluster="kmeans") function.

Figure 10: UMAP embedding of the Wong data set colored according to k-means clustering..

Most of the \(k\)-means clustering seems to agree with the UMAP embedding. However, classes 4 and 8 are both split between two distinct clusters. Class 3 is also separated into three smaller sub-clusters.

4.2.1 Class 4↩︎

Class 4 is split between two separate clusters in the UMAP embedding. To diagnose, we select the two custom clusters using the drawing tool. The path projection settings are calibrated to 20 dimensions and a CCA degree of two. We also select the Group Coloring setting, so the points are colored according to group, rather than class. The plot along with the estimated density does not provide any evidence of separation (Figure 8). The two selected groups also have 18 crossings, larger than the null expectation of 15.62. All evidence indicates the two groups were sampled from the same population, in agreement with the \(k\)-means clustering.

Figure 11: Path projection plot of class 4 clusters..

To better understand why these two clusters are separated despite minimal evidence of separation, we reference the heatmap and meta data. According to the heatmap, the two groups differ most in CD8 T cell counts. This is confirmed by the cell labels provided by [19], which were passed to the tool as meta data. So while separation wasn’t observed by the MST, the discrepancy in CD8 T cell counts accounts for the splitting of the class 4 points.

Turns out, the \(k\)-means cluster got it right. Together, these two groups make up the cells sampled from skin tissue (Figure B2). Within the skin tissue cells, however, exist two subgroups differentiated by CD8 T cell count.

4.3 Class 8↩︎

The majority of class 8 points lie in a self-contained cluster. However, the rest lie in a separate nearby cluster. We select the two class 8 clusters as our two groups and study the projection of the path between them. The path projection settings are calibrated to 20 dimensions and a CCA degree of three. Similar to the UMAP embedding, the path projection plot depicts one dense cluster containing a majority of the points (Figure 9). The remainder of the points fall to one side in a low-density region. There is certainly not enough evidence to conclude the two groups belong to separate clusters from this plot alone.

Figure 12: Path projection plot of class 8 clusters..

Running the MST test returns interesting results. 20 crossings are counted when only 16.11 are expected with a standard error 4.44. The number of crossings are well above the number expected under the null hypothesis. This is due to the disparity in group densities. In order to preserve the size of the test, the null distribution must be constructed using a unimodal distribution whose density is similar to the that of the lesser-dense cluster (Section 2.4.2). This is why the expected number of crossings is so low. This phenomenon, however, has statistical meaning. The low relative density of group 1 provides minimal evidence from which we may draw conclusions. This lack of confidence is reflected in the low null expectation, and thus, the inflated p-value.

Overall, there is not enough evidence to declare the two groups are correctly separated. This it not quite correct according to the true labels. The two groups were not sampled from the same type of tissue, but the situation is slightly more complicated. See Figure B2 for details.

4.4 Class 3↩︎

The class 3 points are separated into three clusters – two larger, elongated clusters to the left, and one smaller cluster to the right (Figure 7). Initial analysis of the path projection plot and MST test did not reveal any evidence of separation. However, the heatmap and meta data explain the separation captured by UMAP. The two larger clusters to the left are completely disjoint in their CD161 gene counts, but very similar in all other features. Meanwhile, the third smaller cluster is distinct from the other two larger clusters in its TCRgD counts. The meta data also reveals the left two clusters consist of CD8 T cells, while the right cluster consists of Tgd cells.

To better understand why the MST was unable to capture the separation made apparent by UMAP, we investigated the original data. Recall 3,000 cells were randomly sampled from a total of 327,457. Of these 3,000 cells, 242 of them belong to one of the two major sub-clusters in class 3. The MST test on these 242 cells did not reject (\(p = 0.87\)). To increase power, 4,000 cells were randomly sampled from class 3, and the MST test was run on those belonging to the two major sub-clusters. The test still did not reject, but was much closer at \(p = 0.17\). When 5,000 cells were randomly sampled from cluster 3, the test did reject with \(p < 0.01\). It seems the MST does capture the separation, but the test’s power is too low to reject at smaller sample sizes.

5 Discussion↩︎

We have introduced our R package, DRtool, and exemplified its use cases. The MST serves as an effective medium for understanding high-dimensional relationships and structures. The various analytical tools provided by the package allow the user to extract a maximal amount of information from the MST by providing multiple prospectives. Such a multi-faceted view is necessary to understand contemporary dimension reduction methods that are trying to fit hundreds, or even thousands, of dimensions-worth of information into only two dimensions. Advances in multiple fields have lead to a surge in complex data, necessitating tools such as ours that help analysts assess and confirm their dimension reduction results.

Further works should explore alternate methods for projecting paths into two dimensions. The goal of the projection is to “unwind" the path, which is a non-linear transformation, but non-linear methods could pose two problems. One, most non-linear methods do not have a natural out-of-sample extension that can be used to project points of interest other than the path points. And two, non-linear methods can be prone to overfitting, especially when the path only contains a handful of points. On the other hand, linear methods define a linear transformation on the entire data space, so the projection naturally extends to points not on the path. Their rigidity also prevents overfitting. The downside is linear methods are known to fail in high dimension due to the near-orthogonality of high-dimensional data. They also shrink space, which may obscure fine structural details that only non-linear methods are capable of capturing.

Further works should also explore alternate methods of estimating cluster volume when calculating cluster density during the MST testing process. The product of singular values works well for clusters that are generally ellipsoidal or rectangular, but can fail for irregularly shaped clusters. A better estimate of the density could increase the power of the MST test.

6 Code Availability↩︎

All data and code a freely available at https://wwww.github.com/JustinMLin/DRtool.

7 Declaration of Interest↩︎

The authors report there are no competing interests to declare.

Appendices

8 Details for the Null Hypothesis Problem↩︎

Let \(n_1, n_2 > 0\) be the sample sizes of each group and \(c \in [-1,1]\) the location of the mode. Without loss of generality, assume \(c \geq 0\). Let \(\mathcal{F}\) be the family of distributions on \([-1, 1]\) such that

\(f\) is increasing on \([-1, c]\),
\(f\) is decreasing on \([c, 1]\),
\(\int_{-1}^0 f = \frac{n_1}{n_1+n_2}\), and
\(\int_0^1 f = \frac{n_2}{n_1+n_2}\).

Let \(\epsilon \in (0, 1)\). The aim is to find a \(f \in \mathcal{F}\) that minimizes \(\int_{-\epsilon}^\epsilon f\).

8.1 Case I. \(c > \epsilon\)↩︎

Lemma 1. If \(c \geq 0\) and \(\epsilon > 0\), then \[\int_{-\epsilon}^0 f \geq \epsilon\frac{n_1}{n_1+n_2}\] for all \(f \in \mathcal{F}\).

Proof. Let \(f \in \mathcal{F}\). By means of contradiction, suppose \[\int_{-\epsilon}^0 f < \epsilon\frac{n_1}{n_1+n_2}.\] Then \[\int_{-1}^{-\epsilon} f = \int_{-1}^0 f - \int_{-\epsilon}^0 f = \frac{n_1}{n_1+n_2} - \int_{-\epsilon}^0 f > (1-\epsilon)\frac{n_1}{n_1+n_2}.\] Hence, there exists \(x_0 \in (-1, -\epsilon)\) such that \[f(x_0) > \frac{n_1}{n_1+n_2}.\] By the unimodality constraint, \[f(x) \geq f(x_0) > \frac{n_1}{n_1+n_2} \textrm{ for all } x \in [-\epsilon, 0].\] This implies \[\int_{-\epsilon}^0 f \geq \epsilon\frac{n_1}{n_1+n_2},\] a contradiction. Therefore, \[\int_{-\epsilon}^0 f \geq \epsilon\frac{n_1}{n_1+n_2}.\] ◻

Let \(f \in \mathcal{F}\). By Lemma A.1, \[f(x) \geq \sup\{f(x): -\epsilon \leq x \leq 0\} \geq \frac{n_1}{n_1+n_2} \textrm{ for all } x\in [0, \epsilon).\] Hence, \[\int_0^\epsilon f \geq \epsilon\frac{n_1}{n_1+n_2}.\] We’ve shown \[\int_{-\epsilon}^\epsilon f = \int_{-\epsilon}^0 f + \int_0^\epsilon f \geq 2\epsilon\frac{n_1}{n_1+n_2} \textrm{ for all } f \in \mathcal{F}.\] Therefore, the function \(f':[-1,1] \to \mathbb{R}\) defined by \[f'(x) = \begin{cases} \frac{n_1}{n_1+n_2} & -1 \leq x \leq \epsilon \\ \frac{n_2 - \epsilon n_1}{(1-\epsilon)(n_1 + n_2)} & \epsilon < x \leq 1 \end{cases}\] belongs to \(\mathcal{F}\) and minimizes \(\int_{-\epsilon}^\epsilon f\).

8.2 Case II. \(0 < c < \epsilon\) and \(\frac{n_2}{n_1} \geq c\)↩︎

Let \(f \in \mathcal{F}\). By Lemma A.1, there exists \(x_0 \in (-\epsilon, 0]\) such that \(f(x_0) \geq \frac{n_1}{n_1+n_2}\). Thus, \[f(x) \geq f(x_0) \geq \frac{n_1}{n_1+n_2} \textrm{ for all } x \in [0, c)\] \[\Rightarrow \int_0^c f \geq c\frac{n_1}{n_1+n_2}.\]

By the unimodality constraint, \[\frac{\int_c^\epsilon f}{\int_\epsilon^1 f} \geq \frac{\epsilon - c}{1-\epsilon}\] \[\begin{align} \Rightarrow \int_c^\epsilon f &\geq \frac{\epsilon-c}{1-\epsilon} \int_\epsilon^1 f \\ &= \frac{\epsilon-c}{1-\epsilon} \left(\int_0^1 f - \int_0^c f - \int_c^\epsilon f \right) \\ &= \frac{\epsilon-c}{1-\epsilon}\frac{n_2}{n_1+n_2} - \frac{\epsilon-c}{1-\epsilon}\int_0^c f - \frac{\epsilon-c}{1-\epsilon}\int_c^\epsilon f \end{align}\] \[\Rightarrow \frac{1-c}{1-\epsilon} \int_c^\epsilon f \geq \frac{\epsilon-c}{1-\epsilon}\frac{n_2}{n_1+n_2} - \frac{\epsilon-c}{1-\epsilon}\int_0^c f\] \[\Rightarrow \int_c^\epsilon f \geq \frac{\epsilon-c}{1-c}\frac{n_2}{n_1+n_2} - \frac{\epsilon-c}{1-c}\int_0^c f\] It follows that \[\begin{align} \int_0^\epsilon f &= \int_0^c f + \int_c^\epsilon f \\ &\geq \int_0^c f + \frac{\epsilon-c}{1-c}\frac{n_2}{n_1+n_2} - \frac{\epsilon-c}{1-c}\int_0^c f \\ &= \left(1 - \frac{\epsilon-c}{1-c}\right)\int_0^c f + \frac{\epsilon-c}{1-c}\frac{n_2}{n_1+n_2} \\ &\geq \left(1 - \frac{\epsilon-c}{1-c}\right)c\frac{n_1}{n_1+n_2} + \frac{\epsilon-c}{1-c}\frac{n_2}{n_1+n_2} \\ & = c\frac{n_1}{n_1+n_2} + \frac{(\epsilon-c)(n_2 - cn_1)}{(1-c)(n_1+n_2)}. \end{align}\]

Hence, \[\int_{-\epsilon}^\epsilon f = \int_{-\epsilon}^0 f + \int_0^\epsilon f \geq \epsilon\frac{n_1}{n_1+n_2} + c\frac{n_1}{n_1+n_2} + \frac{(\epsilon-c)(n_2-cn_1)}{(1-c)(n_1+n_2)}.\] Therefore, the function \(f':[-1,1] \to \mathbb{R}\) defined by \[f'(x) = \begin{cases} \frac{n_1}{n_1+n_2} & -1 \leq x \leq c \\ \frac{n_2 - cn_1}{(1-c)(n_1 + n_2)} & c < x \leq 1 \end{cases}\] belongs to \(\mathcal{F}\) and minimizes \(\int_{-\epsilon}^\epsilon f\).

8.3 Case III. \(0 < c < \epsilon\) and \(\frac{n_2}{n_1} < c\)↩︎

Let \(f \in \mathcal{F}\). First we show \[\int_0^\epsilon f \geq \epsilon\frac{n_2}{n_1+n_2}.\] By means of contradiction, assume \[\int_0^\epsilon f < \epsilon\frac{n_2}{n_1+n_2}.\] There exists \(x_0 \in [0, \epsilon)\) such that \[f(x_0) < \frac{n_2}{n_1+n_2}.\]

If \(x_0 \leq c\), then \[f(x) \leq f(x_0) < \frac{n_2}{n_1+n_2} < \frac{n_1}{n_1+n_2} \textrm{ for all } x \in (-\epsilon, 0]\] \[\Rightarrow \int_{-\epsilon}^0 f < \epsilon\frac{n_1}{n_1+n_2},\] contradicting Lemma A.1.

If \(x_0 > c\), then \[f(x) \leq f(x_0) < \frac{n_2}{n_1+n_2} \textrm{ for all } x \in [\epsilon, 1]\] \[\Rightarrow \int_\epsilon^1 f < (1-\epsilon)\frac{n_2}{n_1+n_2}.\] Thus, \[\int_0^1 f = \int_0^\epsilon f + \int_\epsilon^1 f < \epsilon\frac{n_2}{n_1+n_2} + (1-\epsilon)\frac{n_2}{n_1+n_2} = \frac{n_2}{n_1+n_2},\] contradicting \(f \in \mathcal{F}\). Therefore, \[\int_0^\epsilon f \geq \epsilon\frac{n_2}{n_1+n_2}.\]

We’ve shown \[\int_{-\epsilon}^\epsilon f = \int_{-\epsilon}^0 f + \int_0^\epsilon f \geq \epsilon\frac{n_1}{n_1+n_2} + \epsilon\frac{n_2}{n_1+n_2}.\] Therefore, the function \(f':[-1,1] \to \mathbb{R}\) defined by \[f'(x) = \begin{cases} \frac{n_1}{n_1+n_2} & -1 \leq x \leq 0 \\ \frac{n_2}{n_1 + n_2} & 0 < x \leq 1 \end{cases}\] belongs to \(\mathcal{F}\) and minimizes \(\int_{-\epsilon}^\epsilon f\).

8.4 Case IV. \(c = 0\)↩︎

Let \(f \in \mathcal{F}\). By Lemma A.1, \[\int_{-\epsilon}^0 f \geq \epsilon\frac{n_1}{n_1+n_2}.\] A symmetrical argument shows \[\int_0^\epsilon f \geq \epsilon\frac{n_2}{n_1+n_2}.\] Therefore, the function \(f':[-1,1] \to \mathbb{R}\) defined by \[f'(x) = \begin{cases} \frac{n_1}{n_1+n_2} & -1 \leq x \leq 0 \\ \frac{n_2}{n_1+n_2} & 0 < x \leq 1 \end{cases}\] belongs to \(\mathcal{F}\) and minimizes \(\int_{-\epsilon}^\epsilon f\).

9 Supplementary Figures and Tables↩︎

Figure 13: MNIST Embedding with True Labels.

Figure 14: Wong Embedding with True Labels.

References↩︎

[1]

Coenen, A. and A. Pearce (2024). Understanding umap. https://pair-code.github.io/understanding-umap/.

[2]

Wattenberg, M., F. Viégas, and I. Johnson (2016). How to use t-SNE effectively. https://distill.pub/2016/misread-tsne/.

[3]

Rozál, G. and J. Hartigan (1994). The MAP test for multimodality. Journal of Classification 11, 5–36.

[4]

Friedman, J. and L. Rafsky (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Annals of Statistics 7(4), 697–717.

[5]

Bhattacharya, B. (2019). A general asymptotic framework for distribution-free graph-based two-sample tests. Journal of the Royal Statistical Society Series B: Statistical Methodology 81(3), 575–602.

[6]

Chen, H. and J. Friedman (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American Statistical Association 112(517), 397–409.

[7]

Chen, H., X. Chen, and Y. Su (2018). A weighted edge-count two-sample test for multivariate and object data. Journal of the American Statistical Association 113(523), 1146–1155.

[8]

Gower, J. and G. Ross (1969). Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society Series C: Applied Statistics 18(1), 54–64.

[9]

Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20, 25–47.

[10]

Tenenbaum, J., V. de Silva, and J. Langfor (2000). A global geometric framework for nonlinear dimensional reduction. Science 290(2319).

[11]

King, B. and B. Tidor (2009). : Maximum information spanning trees for dimension reduction of biological data sets. Bioinformatics 25(9), 1156–1172.

[12]

Probst, D. and J.-L. Reymond (2020). Visualizing very large high-dimensional data sets as minimum spanning trees. Journal of Cheminformatics 12(12).

[13]

Robinson, D. and L. Foulds (1981). Comparison of phylogenetic trees. Mathematical Biosciences 53, 131–147.

[14]

Tuzhilina, E., L. Tozzi, and T. Hastie (2023). Canonical correlation analysis in high dimensions with structured regularization. Statistical Modelling 23(3), 203–227.

[15]

Diaconis, P. and D. Freedman (1984). Asymptotics of graphical projection pursuit. The Annals of Statistics 12(3), 783–815.

[16]

Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29(6).

[17]

McInnes, L., J. Healy, N. Saul, and L. Großberger (2018). : Uniform manifold approximation and projection. The Journal of Open Source Software 3(3).

[18]

Wong, M., D. Ong, F. Lim, K. Teng, N. McGovern, S. Narayanan, W. Ho, D. Cerny, H. Tan, R. Anicete, B. Tan, T. Lim, C. Chan, P. Cheow, S. Lee, A. Takano, E.-H. Tan, J. Tam, E. Tan, J. Chan, and E. Newell (2016). A high-dimensional atlas of human T cell diversity reveals tissue-specific trafficking and cytokine signatures. ScienceDirect 45(2), 442–456.

[19]

Becht, E., L. McInnes, J. Healy, C.-A. Dutertre, I. Kwok, L. Ng, F. Ginhoux, and E. Newell (2019). Dimensionality reduction for visualizing single-cell data using umap. Nature Biotechnology 37, 38–44.

An Interactive Tool for Analyzing High-Dimensional Clusterings