The distribution of persistent cycles
Let (mathcal {S}) be a d-dimensional metric measure space, and let ({textbf{X}}_n = (X_1,ldots ,X_n)in S^n) be a sequence of random variables (points), whose joint probability law is denoted by (mathbb {P}_n). Let (mathbb {P}= (mathbb {P}_n)_{n=1}^infty), and denote (mathbb {S}=(mathcal {S},mathbb {P})), which we refer to as the sampling model. Fix a filtration type (mathcal {T}) (e.g., echor Rips), and a homological degree (k>0), and consider the k-th noise persistence diagram ({{textrm{dgm}}}_k^{_{textbf{N}}}({textbf{X}}_n;mathcal {T})), which in short we denote by ({{textrm{dgm}}}_k). We study the distribution of the random persistence values (left{ pi (p)right} _{pin {{textrm{dgm}}}_k}) (where (pi (p) = {{,textrm{death},}}(p)/{{,textrm{birth},}}(p))), and refer to them as the (pi)-values of the diagram. Theoretical analysis shows that the largest (pi)-value (({{,textrm{death},}}/{{,textrm{birth},}}) ratio) of points inthe noise (({{textrm{dgm}}}_k^{_{textbf{N}}})) is (o((log n)^{1/k}))26, while the (pi)-values of the signal features (({{textrm{dgm}}}_k^{_{textbf{S}}})) are (Theta (n^{1/d}))33. Thus, the (pi)-values provide a strong separation (asymptotically) between signal and noise in persistence diagrams. We stress that in this paper we study the entire ensemble of persistence values, not only the maximal ones.
We begin by considering the case where (mathbb {P}_n) is a product measure, and the points (X_1,ldots ,X_n) are iid (independent and identically distributed). Given ({{textrm{dgm}}}_k) as defined above, denote the empirical measure of all (pi)-values
$$begin{aligned} Pi _{n} = Pi _{n}(mathbb {S},mathcal {T},k):= frac{1}{|{{textrm{dgm}}}_k|}sum _{pin {{textrm{dgm}}}_k}delta _{pi (p)}, end{aligned}$$
where (delta _x) is the Dirac delta-measure at x. In Fig.2 we present the CDF of (Pi _{n}) for the ech complex with various choices of (mathbb {S}) and k. Similar plots are available for the Rips complex in Sect. 3 of the Supplementary Information. We observe that if we fix d (dimension of (mathcal {S})), (mathcal {T}), and k, then the resulting CDF depends on neither the space (mathcal {S}) nor the distribution (mathbb {P}_n). This leads to our first conjecture.
The distribution of (pi)-values in the echcomplex. We take the empirical CDFs of the (pi)-values (log-scale), computed from various iid samples. The legend format is (mathcal {T}/ mathbb {P}/ d / k), where (mathcal {T}) is the complex type, (mathbb {P}) is the probability distribution, d is the dimension of the sampling space, and k is the degree of homology computed. By box,torus, sphere, projective(-plane), and Klein(-bottle) we refer to the uniform distribution on the respective space or its most natural parametrization, while normal and cauchy refer to the corresponding non-uniform distributions. See Methods and Sect. 3 in the Supplementary Information for further details.
Fix (d,mathcal {T},) and (k>0). For any (mathbb {S}in mathcal {I}_d),
$$begin{aligned} lim _{nrightarrow infty }Pi _{n} = Pi ^*_{d,mathcal {T},k}, end{aligned}$$
where (Pi ^*_{d,mathcal {T},k}) is a probability distribution on ([1,infty )).
The precise notion of convergence and the extent of the class (mathcal {I}_d) are to be determined as future work. We conjecture that (mathcal {I}_d) is quite large. In our experiments, the space (mathcal {S}) varied across a wide range of manifolds and other spaces. The distribution (mathbb {P}_n) is continuous and iid, but otherwise fairly generic (possibly even without moments, see the Cauchy example in Fig.2). We name this phenomenon weak universality, since on one hand the limit is independent of (mathbb {S}) (hence, universal), while on the other hand it does depend on (d,mathcal {T},k) and the iid assumption. This is in contrast to the results we discuss next.
The following procedure was discovered partly by chance. While non-intuitive, the results are striking. Given a random persistence diagram ({{textrm{dgm}}}_k), for each (pin {{textrm{dgm}}}_k) apply the transformation
$$begin{aligned} ell (p) := A{{,mathrm{log log },}}(pi (p)) + B, end{aligned}$$
(1)
where
$$begin{aligned} A = {left{ begin{array}{ll}1 &{} mathcal {T}=text {Rips},\ 1/2 &{} mathcal {T}= check{textrm{C}}text {ech},end{array}right. }qquad B = -lambda - A{bar{L}}, end{aligned}$$
(2)
and where ({bar{L}} = frac{1}{|{{textrm{dgm}}}_k|} sum _{pin {{textrm{dgm}}}_k}{{,mathrm{log log },}}(pi (p))) and (lambda) is the Euler-Mascheroni constant (=0.5772156649(ldots)). We refer to the set (left{ ell (p)right} _{pin {{textrm{dgm}}}_k}) as the (ell)-values of the diagram. In Fig.3 we present the empirical CDFs of the (ell)-values, as well as the kernel density estimates for their PDFs, for all the iid samples that were included in Fig.2. The plots for the Rips complex are similar, and can be found in Sect. 3 of the Supplementary Information. We observe that all the different settings ((mathbb {S},mathcal {T},k)) yield exactly the same distribution under the transformation given by (1). We refer to this phenomenon as strong universality.
The distribution of (ell)-values. (top) All the iid samples included in Fig.2 (26 curves). (bottom) A selection non-iid and real-data point clouds. The left column shows the empirical CDF, and the middle column is the kernel density estimate for the PDF, with the LGumbel distribution shown as the dashed line. The right column shows the QQ plots compared this distribution.
While strong universality for iid point-clouds is by itself a very unexpected and useful behavior, a natural question is how generally it applies in other scenarios. In Fig.3 we also include a selection of non-iid samples and real-data (see Experimental results and Methods for details). While the distribution of the (pi)-values for these models is vastly different than the iid case, all of these examples exhibit the same strong universality behavior.
To summarize, our experiments highly indicate that persistent (ell)-values have a universal limit for a wide class of sampling models (mathbb {S}), denoted by (mathcal {U}). For our main conjecture, we consider the empirical measure of all (ell)-values,
$$begin{aligned} mathcal {L}_n = mathcal {L}_n(mathbb {S},mathcal {T},k):= frac{1}{|textrm{dgm}_k|}sum _{pin {{textrm{dgm}}}_k}delta _{ell (p)}. end{aligned}$$
For any (mathbb {S}in mathcal {U}), (mathcal {T}), and (kge 1),
$$begin{aligned} lim _{nrightarrow infty }mathcal {L}_n = mathcal {L}^*, end{aligned}$$
where (mathcal {L}^*) is independent of (mathbb {S}), (mathcal {T}), and k.
Observe that in this Conjecture, the only dependence on the distribution generating the point-cloud isin the value of B (2) (similar to the role the mean and the variance play in the central limit theorem). In Sect. 5 of the Supplementary Information, we examine the value of B for different iid settings. As suggested by Conjecture1, our experiments confirm that the value of B (for the iid case) depends on (d,mathcal {T},k), but is otherwise independent of (mathbb {S}). Revealing the exact relationship between all parameters remains future work.
We note that models with homogeneous spacing between the points, such as perturbed lattice models, or repulsive point processes, do not follow Conjecture2. See Sect. 2.4 in the Supplementary Information.
A natural question is whether the observed limiting distribution (mathcal {L}^*) is a familiar one, and in particular, if it has a simple expression. Surprisingly, it seems that the answer might be yes. We denote the left-skewed Gumbel distribution by (textrm{LGumbel}), whose CDF and PDF are given by
$$begin{aligned} F(x) = 1-e^{-e^x},quad text {and}quad f(x) = e^{x-e^{x}}. end{aligned}$$
(3)
The expected value of this distribution is the Euler-Mascheroni constant ((lambda)) used in(2). In Fig.3, the black dashed lines represent the CDF and PDF of the LGumbel distribution. In addition, the right column presents the QQ-plots of all the different models compared to the LGumbel distribution. These plots provide very strong evidence for the validity of our final conjecture.
(mathcal {L}^* = textrm{LGumbel}).
The Gumbel distribution often emerges as the limit of extreme value distributions (i.e., minima or maxima). We wish to emphasize that the limiting LGumbel distribution in Conjecture 3 describes the entire ensemble of all (ell)-values and so it does not describe any extreme values. Consequently, while there may exist an indirect connection to extreme value theory, it is by no means evident or straightforward. The appearance of the LGumbel distribution in this context, is therefore quite surprising.
The plots in Fig.3 exhibit a remarkable fit with the LGumbel distribution, with one minor exception observed in the deviation of the tails of the QQ-plots. This slight deviation can be attributed to slow convergence rates. Existing theoretical results26 indicate that the largest (pi)-value tends to infinity with n, but its growth rate is logarithmic. Consequently, the largest (ell)-value is of order (log log log (n)), and so particularly the tails will exhibit a slow rate of convergence. Since the point-clouds we are able to process consist of at most millions of points, our ability to accurately capture the tail of the distribution is quite limited. It is important to note that this limitation holds regardless of the validity of Conjecture3. While the limiting distribution has a non-compact support, our experiments cover a restricted range of death/birth values. Therefore, if we draw the QQ-plot against any other distribution with non-compact support, we will observe similar deviations.
We present a large body of experimental evidence collected to support our conjectures. As the results do not seem to be significantly impacted by the choice of n, we leave this detail to the Supplementary Information (Section 3). Complementing the statistical plots presented here and in the Supplementary Information, we also performed a Kolmogorov-Smirnov goodness-of-fit test for our entire collection of point-clouds. The details are provided in Sect. 3.4 of the Supplementary Information. The test did not detect significant difference between the distribution of (ell)-values and the LGumbel distribution in any of the point-clouds, providing further support for the validity of Conjectures1, 2, 3.
We began by considering samples from the uniform distribution on various compact manifolds, with diverse geometry and topology. Next, we tested non-uniform distributions in (mathbb {R}^d), by taking products of well-known distributions on (mathbb {R}). We attempted to test a wide range of settings. The beta distribution has a compact support ([0,1]), while the normal and Cauchy distributions are supported on (mathbb {R}). The standard normal distribution is an obvious choice and Cauchy was chosen as a heavy-tailed distribution without moments. Finally, we considered more complex modelssampling from the configuration space of a closed five-linkage, and stratified spaces (intersecting manifolds of different dimensions). The results for many of the experiments are in Figs.2 and 3 (see Sect. 3 of the Supplementary Information for the complete set of experiments). All of the iid sampling models we tested support Conjectures1, 2, 3.
To better understand the extent of universality, as well as to consider more realistic models, we tested more complex cases. We tested two vastly different models: sampling points from the path of a d-dimensional Brownian motion, and a discrete-time sample of the Lorenz dynamical systema well-studied chaotic system. The results in Fig.3 confirm that these non-iid models exhibit strong universality as well. Surprisingly, the results for the Brownian motion demonstrate the best fit with the LGumbel distribution, among all the settings we tested (see Figs. 10 and 11 in the Supplementary Information). This could be related to the fractal, or self-similarity, behavior of the Brownian motion, but remains a topic for future study.
The most important test for Conjectures2 and 3 is with real world data. We tested three different examples (see Methods for mode details). (1) Natural images: We sampled (3times 3) patches from natural gray-scale images taken from van Hateren and van der Schaaf dataset34. We applied the dimension reduction procedure proposed byLee etal.35, which results in a point-cloud on a 7-dimensional sphere embedded in (mathbb {R}^8). We tested both the 7-dimensional point-cloud, as well as its lower-dimensional projections. (2) Audio recording: We applied the time-delay embedding transformation36 to an arbitrary speech recording to create a d-dimensional point-cloud. (3) Sentence embeddings: We used a pretrained sentence transformer37, to convert the entire text in a book into a 384-dimensional point-cloud. Our experiments using both the ech and Rips complexes, show a remarkable matching to the universal distribution (see Fig.3 for a subset).
Based on Conjectures23, we present a hypothesis testing framework for individual cycles in persistence diagrams. We address finite and infinite cycles separately.
Given a persistence diagram ({{textrm{dgm}}}= left{ p_1,ldots ,p_mright}), our goal is to determine for each point (p_i) whether it is signal or noise. This can be modelled as a multiple hypothesis testing problem with the i-th null-hypothesis, denoted (H_0^{(i)}), is that (p_i) is a noisy cycle. Assuming Conjectures2 and 3, we can formalize the null hypothesis in terms of the (ell)-values (1) as
$$begin{aligned} H_0^{(i)}: ell (p_i) sim textrm{LGumbel}. end{aligned}$$
In other words, cycles that deviate significantly from the LGumbel distribution should be declared as signal. If the observed persistence (ell)-value is x, then its corresponding p-value is computed via
$$begin{aligned} ptext {-value}_i = mathbb {P}left( ell (p_i) ge x;|;H_0^{(i)}right) = e^{-e^x}. end{aligned}$$
(4)
Since we are testing multiple cycles simultaneously, we applied the Bonferroni correction to the p-values, which sufficed for our experiments. The signal part of a diagram (for significance level (alpha)) can thus be recovered, via
$$begin{aligned} {{,{textrm{dgm}}}}_k^{_{textbf{S}}}(alpha ) = left{ pin {{{textrm{dgm}}}}_k: e^{-e^{ell (p)}} < frac{alpha }{|{{{textrm{dgm}}}}_k|}right} . end{aligned}$$
Computing persistent homology for an entire filtration is often intractable. The common practice is to fix a threshold (tau), and compute ({{textrm{dgm}}}_k(tau)) for the partial filtration. This often introduces cycles that are infinitei.e., born prior to (tau), but die after (tau). The question we address here is how to efficiently determine whether such cycles are statistically significant. Let (p=(textrm{b},textrm{d})in {{textrm{dgm}}}_k(tau)) be an infinite cycle, i.e., (textrm{b}le tau) is known and (textrm{d}>tau) is unknown. While we do not know (ell (p)), we observe that (ell (p)> tau /textrm{b}), which gives an upper bound for the p-value,
$$begin{aligned} ptext {-value}_i < ptext {-value}_i(tau ):= e^{-e^{tau /textrm{b}}}. end{aligned}$$
If (ptext {-value}_i(tau )) is below the required significance value (e.g.(alpha /|{{textrm{dgm}}}_k(tau )|)), we can declare p as significant, despite not knowing the true death-time. Otherwise, we can determine the minimal value (tau ^*) required so that (ptext {-value}_i(tau ^*)) is below the significance value. We then compute ({{textrm{dgm}}}_k(tau ^*)), and if the cycle represented by p remains infinite (i.e.(textrm{d} > tau ^*)), we declare it significant. We observe that for measuring significance, we do not need to know the exact value of (textrm{d}), only whether it is smaller or larger than (tau ^*), and we need onlyto compute the filtration up to (tau ^*), rather than the actual death time (textrm{d}). The key point is that the death time (textrm{d}) may be much larger than (tau ^*).
The procedure we just described works well for studying a single infinite cycle. However, it is likely that ({{textrm{dgm}}}_k(tau )) contains multiple infinite cycles. Moreover, increasing the threshold may result in new infinite cycles emerging as well. We therefore propose the iterative procedure described in Algorithm 1. Briefly, at every step the algorithm picks one infinite cycle, and chooses the next threshold (tau) so that we can determine if it is significant or not. The value (pi _{min }(x)) in the Algorithm 1, is the minimum (pi)-value required so that the resulting p-value (4) is smaller than x. Formally,
$$begin{aligned} pi _{min }(x) = ell ^{-1}left( F^{-1} left( 1-xright) right) = ell ^{-1}({{,mathrm{log log },}}(1/x)), end{aligned}$$
where F is the CDF of the LGumbel distribution. In the algorithm, we choose the earliest-born infinite cycle ((min (I))), while we could have chosen the latest-born ((max (I))), or any intermediate value. This choice represents a trade-off between the number of iterations needed and the overestimation of (tau). Choosing the earliest born cycle results in the smallest threshold, but with potentially more iterations, while choosing the last cycle will have fewer iterations with a possible overestimation of (tau).
Algorithm 1 Finding the threshold for infinite cycles
(begin{aligned} & tau leftarrow tau_{0} \ & {mathbf{do}} \ & quad D leftarrow {text{dgm}}_{k} (tau ) \ & quad I leftarrow left{ {{text{b}}:({text{b}},{text{d}}) in D,,{text{d}} = infty ,{text{and}},tau /{text{b}} < pi_{{{text{min}}}} left( {alpha /left| D right|} right)} right} \ & quad tau leftarrow left{ {begin{array}{*{20}l} {min ,(I) cdot pi_{min } left( {alpha /left| D right|} right)} hfill & {I ne emptyset } hfill \ tau hfill & {I = emptyset } hfill \ end{array} } right. \ & {mathbf{while}},left| I right| > 0 \ & {mathbf{return}},tau \ end{aligned})
We present two examples for our hypothesis testing framework. In all our experiments, we set the desired significance level to be (alpha =0.05).
Computing p-values:We begin with a toy example, sampling 1000 points on an 8-shape (a wedge of circles) in (mathbb {R}^2) (see Fig.4), where we vary the width of the neck. We expect one cycle to always be significant (the outer one), but the significance of the second cycle depends on the width of the neck. For each width value we computed the persistence diagram, and checked how many cycles were significant (i.e., (ptext {-value}
Computing p-values for the 8-shape (a wedge of circles). (top-left) Persistence diagrams for two instances of the 8-shape with different neck gaps, with the significance lines shown for (alpha =0.05) (the dotted and dashed lines correspond to (W=0.1) and (W=0.4), respectively). (top-right) The average number of signal cycles detected in 100 repetitions, as a function of the neck gap. (bottom) In green we show significant cycles. On the left ((W=0.1)) we see two significant cycles (p-values = 0.0005,0.012), and on the right ((W=0.4)) only the outer cycle is significant (p-value=0.0015).
Next, we apply this method to a real-world dataset, specifically the van-Hateren natural image database mentioned earlier. The main claim by de-Silva & Carlsson30 is that the space of (3times 3) patches has a 3-circle structure in (mathbb {R}^8), leading to the conclusion that the patches are concentrated around a Klein-bottle38. This was supported by five relatively long 1-cycles in the persistence diagram computed over the patches. To provide quantitative statistical support for this claim, we randomly selected a subset of the patches, processed them30, and computed p-value for all cycles (using the Rips complex). We repeated this experiment for varying numbers of patches, and computed the average number of detected signal cycles over 250 trials. The results are presented in Fig.5. Firstly, we observe that there exists a single 1-cycle that is nearly always detected (the primary circle), while other cycles appear as we increase the sample size. Secondly, we observe that the fifth cycle is intermittently detected. Plotting a 2-dimensional projection of the points, we see that this cycle contains very sparse areas, increasing its birth time and consequently the p-value.
To conclude, using this approach, we are able to correctly detect the signal cycles discovered by de-Silva & Carlsson30, as well as quantitatively declare the significance level for each cycle.
Testing the 3-circle model in the natural image patches. (left) Four different 2-dimensional projections of the 8-dimensional patches point-cloud. The projection on the top-right shows a 1-cycle that is quite thin and contains large gaps. (right) The p-value curve for the patches dataset as a function of the number of samples. We observe that we are not detecting 5-cycles in 100% of the cases. This is most likely due to the thin cycle.
Infinite cycles:To test Algorithm 1, we used a point-cloud on a 2-dimensional torus in (mathbb {R}^3). Computing the p-values of the two signal 1-cycles requires a massive complex and becomes computationally intractable for even a few thousand points. Using Algorithm 1, the filtration size is incrementally increased until the signal was detected. At 50k points, this saves approximately 80% of the edges that would otherwise be needed. This ratio would significantly increase for higher dimensional simplexes. See Sect. 7 in the Supplementary Information for the complete details.
Link:
A universal null-distribution for topological data analysis | Scientific ... - Nature.com
Read More..