SISH
SISH is a histology-image search pipeline that addresses the scalability issues of speed, storage and pixel-wise label scarcity. While image-level class labels or annotation for whether a pair or triplet of images are similar/dissimilar has often been leveraged to explicitly learn an embedding space that captures semantic similarity between images, or to identify key points before retrieval50,51, it is difficult to directly apply such techniques to WSI image search due to the high dimensionality of large WSIs and the lack of patch-level annotation. Instead, SISH builds upon a set of preprocessed mosaics from WSIs without pixel-wise or ROI-level labels to reduce storage and labelling costs, by relying on indices learned via self-supervised learning and pretrained embeddings. SISH scales with a constant search speed, regardless of the size of the database, by taking advantage of the benefits of discrete latent codes from a VQ-VAE, and using guided search and ranking algorithms. We present these essential components of SISH in this section and provide a pictorial illustration for the methods described in Fig. 2. For clarity, we have summarized all symbols used in the following text in Supplementary Table 18.
VQ-VAE41 is a variant of a Variational AutoEncoder (VAE) that introduces a training objective that allows for discrete latent codes. Let ({{{boldsymbol{e}}}}in {{mathbb{R}}}^{Ktimes D}) be the latent space (that is, codebook) where K is the number of discrete codewords and D is the dimension of the codewords. We set K=128 and D=256 in our experiments. To decide the codeword of the given input, an encoder q encodes input x as ze(x). The final codeword zq(x) of x and the training objective function are given by
$${{{{boldsymbol{z}}}}}_{q}(x)={{{{boldsymbol{e}}}}}_{k},{{{rm{where}}}} k={{{{rm{argmin}}}}}_{j}leftVert {{{boldsymbol{{z}}}_{e}}}({{{boldsymbol{x}}}})-{{{{boldsymbol{e}}}}}_{j}rightVert ,$$
(1)
$$log p({{{boldsymbol{x}}}}| {{{{boldsymbol{z}}}}}_{q}({{{boldsymbol{x}}}}))+leftVert {{{rm{sg}}}}[{{{boldsymbol{{z}}}_{e}}}({{{boldsymbol{x}}}})]-{{{boldsymbol{e}}}}rightVert +alpha leftVert {{{boldsymbol{{z}}}_{e}}}({{{boldsymbol{x}}}})-{{{rm{sg}}}}[{{{boldsymbol{e}}}}]rightVert ,$$
(2)
where is a hyperparameter and sg denotes the stop gradient operation. A stop gradient operation acts as the identity function during the forward pass while having zero gradient during the backward pass. The first term in the objective function optimizes the reconstruction of the encoder and decoder, the second term is used to update the codebook, and the third term is used to prevent the encoders output from diverging too far from the latent space. The architecture of our VQ-VAE model is shown in detail in Supplementary Fig. 4. We re-ordered the codebook on the basis of the value of the first principal component and changed the latent code accordingly as we found that the re-ordered codebook can provide a more semantic representation of the original input image (see Supplementary Fig. 5).
We show how each patch i in the mosaic of a WSI can be represented by a tuple (pi, hi) composed of a patch index pi and a patch texture feature hi. To get pi, we encode and re-map the latent code zi from the encoder and re-ordered codebook from the VQ-VAE. The index pi is determined by the following equations:
$${{{{boldsymbol{z}}}}}_{i,1}={{{rm{AVGPOOL}}}}(2,2)({{{{boldsymbol{z}}}}}_{i})$$
(3)
$${{{{boldsymbol{z}}}}}_{i,2}={{{rm{AVGPOOL}}}}(2,2)({{{{boldsymbol{z}}}}}_{i,1})$$
(4)
$${{{{boldsymbol{z}}}}}_{i,3}={{{rm{AVGPOOL}}}}(2,2)({{{{boldsymbol{z}}}}}_{i,2})$$
(5)
$${p}_{i}={{{rm{SUM}}}}({{{{boldsymbol{z}}}}}_{i,1})+{{{rm{SUM}}}}({{{{boldsymbol{z}}}}}_{i,2})times 1{0}^{6}+{{{rm{SUM}}}}({{{{boldsymbol{z}}}}}_{i,3})times 1{0}^{11}$$
(6)
$${{{rm{SUM}}}}({{{{boldsymbol{z}}}}}_{{{{boldsymbol{i,1}}}}})in [0,130048],{{{rm{SUM}}}}({{{{boldsymbol{z}}}}}_{{{{boldsymbol{i,2}}}}})in [0,32512],{{{rm{SUM}}}}({{{{boldsymbol{z}}}}}_{{{{boldsymbol{i,3}}}}})in [0,8128]$$
(7)
To convert the information in the latent code from higher to lower resolution, we apply a series of 22 average pooling. We then take the summation to aggregate the latent code in each resolution, as the summation operator has better expressive power than the mean or maximum52. We get the final integer index by taking the summation of the information aggregated in each resolution and multiplying it by 100, 106 and 1011, respectively. The intuition behind choosing the power is to keep the information of the latent code in each resolution (that is, zi,1, zi,2 and zi,3) separate. For example, multiplying Sum(zi,2) by 106 separates the feature in the second layer from Sum(zi,1) since the maximum of the latter is 130,048. Likewise, multiplying Sum(zi,3) by 1011 separates the feature in the last layer from the previous two. We insert each pi into the vEB tree for fast search. We apply this process to WSIs in the datasets to build our databases. Note that the time complexity of all operations in the vEB tree is (O(log log (M))). On the basis of the properties of the vEB tree, M can be determined by
$$M={2}^{x} > max ({p}_{i}),,$$
(8)
where x is the minimum integer that satisfies the inequality. Since our codebook size ranges from 0 to 127, we can determine the maximum summation Sum(z) in each level. Solving the inequality, we find that the minimum M that satisfies the inequality is M=1,125,899,906,842,624. Because M is a constant that only depends on the index generation pipeline, our search performance is O(1). One limitation of using vEB is that it has a large space complexity O(M) where M depends on the size of the codebook and the dynamic range of the index used for search. M remains fixed and does not scale with the number of data points (WSIs or patches) in the database. To get hi, we use DenseNet121 to extract features from the 1,0241,024 patch at 20, then follow the algorithm proposed in ref. 36 to binarize it (that is, starting from ; if the next value is smaller than the current one, the current value 0 is assigned, and 1 is assigned otherwise).
In addition to creating the tuple to represent the patch, we also make a hash table H with pi as key and the metadata of the patch as value. The metadata include the texture feature hi, the name of the slide associated with the patch, the coordinates on the slide from which the patch is cropped, the file format of the slide and the diagnosis of the slide. Note that different patches could share the same key. In this case, the value is a list that stores the metadata for each patch. If the size of the search database is significantly large, which is expected to be the case for most practical real-world databases, the search speed would be greater than pre- and post-processing steps. When running a fixed number of iterations, the K-means clustering algorithm (Lloyds algorithm) has time complexity O(BKIC) where B is the number of patches in a WSI, K is the number of cluster centroids, I is the number of iterations and C is the dimension of each input data point. For fixed I, K and C, the initial clustering steps of mosaic construction is O(B). To obtain the final mosaic, a fixed percentage (e.g. 5%) of patches are sampled from each cluster, and hence the resulting mosaic varies from slide to slide with size B = 0.05 B. During retrieval, the number of total candidates proposed (before ranking) is T (ksucc + kpred) B (see the next section for the definition of T, ksucc and kpred). For ranking, the complexity is O(B). Therefore, given fixed ksucc, kpred and T, the time complexity of retrieval is O(B). Note that since the size of a WSI is capped by the physical size of the glass slide and the tissue specimen, for a fixed patch size, we can always pick a reasonable constant Bmax to upper bound the maximum B in the database and in incoming query slides. Therefore, the entire workflow has a theoretical constant time complexity of O(1). In real-world scenarios where we expect the size of the database to scale to hundreds of thousands or millions of slides, the time complexity of retrieval will dominate over other steps such as mosaic generation and ranking if we do not use an O(1) search algorithm and it instead scales with O(n) or O(nlogn), where n is the size of the database. However, we note that while in most practical scenarios with increasingly large databases, the size of the WSI database (n) would be larger than the size of the number of patches in the query slide (B); in rare cases where the size of the database is very small, such that average B is not negligible compared to n, while the search operation will continue to have a constant O(1) complexity, the speed of the overall pipeline may be limited by the mosaic generation O(Bmax). Mosaic generation can also be completed before case review, further improving search speeds.
For clarity, we use mi to denote the patch index in the mosaic of the query slide to distinguish those in the database. Given a query slide I represented as I={(m1,h1),(m2,h2),,(mk,hk)} with k patches, where each tuple is composed of the index of the patch mi and its texture features hi, we apply guided-search to each tuple and return the corresponding results ri. The output takes the form of RI={r1,r2,,rk}. Each ri={(pi1,i1),(pi2,i2),,(pin,in)}, a set of tuples consisting of the indices of similar patches (pi1,pi2,,pin) and their associated metadata (i1,i2,,in). ij includes all metadata associated with the j-th patch plus the Hamming distance between hi and hj. A visual illustration is shown in Fig. 2.
The drawback to using only mi for the query is that the current patch index is sensitive to minor changes in zi,3. For example, a patch that differs from another by 1 incurs a 1011 difference in index, putting the two patches far from each other in the vEB tree. To address this issue, we create a set of candidate indices mi,c+ and mi,c along with the original mi by adding and subtracting an integer C for T times from Sum(zi,3). We call helper functions forward-search and backward-search to search the neighbour indices in mi,c+ and mi,c, respectively. Both functions include only those neighbouring indices whose Hamming distance from the query hi is smaller than a threshold, h. The details of these algorithms are presented in Algorithms 1 through 3.
Algorithm 1 Guided Search Algorithm
Hhash table Hash table with patch index as key and metadata as value
C,T501011,10 Integer and number of times for addition and subtraction
h128 Threshold of the Hamming distance between query patch index and the neighbor
ksucc,kpred375 Number of time to call vEB.Successor() and vEB.Predecessor()
Function GUIDED-SEARCH(mi,hi,C,T,h,kpred,ksucc,H,vEB)
mi,c+,mi,c,results{},{},{}
V{}
mi,c+.insert(mi)
for t1,2,...,T do
mtmp+,mtmp-mi+tC,mitC
mi,c+.insert(mtmp+)
mi,c.insert(mtmp-)
results+,VFORWARD-SEARCH(mi,c+,ksucc,h,V,H,vEB)
resultsBACKWARD-SEARCH(mi,c,kpred,h,V,H,vEB)
results.insert(results+)
results.insert(results)
resultsSORT-ASCENDING(results,key=results.hammingdistance)
return results
Our ranking function Ranking (Algorithm 4) takes the results RI={r1,r2,,rk} from Guided-Search as input. The output is the top-k similar slides given the query slide I. We set k equal to 5 for all experiments, except for anatomic site retrieval where k is equal to 10. The intuition behind Ranking is to find the most promising patches in RI on the basis of the uncertainty. It relies on three helper functionsWeighted-Uncertainty-Cal (Algorithm 5), Clean (Algorithm 6) and Filtered-By-Prediction (Algorithm 7).
Weighted-Uncertainty-Cal (Algorithm 5) takes RI as input and calculates the uncertainty for each ri by computing their entropy (that is, frequency of occurrence of slide labels). The lower the entropy, the less uncertain the patch and vice versa. The output is the entropy of ri, along with records that summarize the diagnosis occurrences and Hamming distance of each element in ri. The disadvantage of counting the occurrences naively in the entropy calculation is that the most frequent diagnosis in the anatomic site dominates the result and therefore downplays the importance of others. We introduce a weighted occurrence approach to address this issue. The approach counts the diagnosis occurrences by considering the percentage of the diagnosis in the given site and the diagnosiss position in the retrieval results. It calculates the weight of each diagnosis in the anatomic sites by the reciprocal of the number of diagnosis. We normalize the weights such that their summation is equal to a constant N. A diagnosiss final value in ri is the normalized weight of the diagnosis multiplied by the inverse of position where the diagnosis appears in ri. Therefore, the same diagnosis can have different weighted occurrences because of its position in ri. As such, less frequent diagnoses and those with lower Hamming distance (that is, close to the beginning of the retrieval results) gain more importance in the ranking process. As illustrated in Fig. 2, we summarize RI with three metadata values, Slb, Sm and Sl, to facilitate the subsequent processes. Specifically, Sm is a list that stores tuples of form (index of ri, entropy, Hamming distance info in ij, length of ri), Sl is an array that only stores the length of ri and Slb is a nested dictionary that stores the disease occurrences in ri.
Algorithm 2 Forward Search Algorithm
functionForward-Search(mi,c+,ksucc,h,V,H,vEB)
res+{}
for i+ in mi,c+ do
succ_cnt,succprev0,i+
whilesucc_cnt succvEB.Successor(succprev) ifsuccV or succ is empty then break else ifH[succ].len()==0 then // The case when the patient is identical to query slide I succprevsucc continue else // Find the patch with smallest Hamming distance in the same key distj,jArgminj(Hamming-Distance(hi,H[succ])) ifdistj V.insert(succ) metaH[succ][j] res+.insert((distj,meta)) succ_cnt,succprevsucc_cnt+1,succ returnres+,V Algorithm 3 Backward Search Algorithm functionBackward-Search(mi,c,kpred,h,V,H,vEB) res{} fori in mi,cdo pred_cnt,predprev0,i whilepred_cnt predvEB.Predecessor(predprev) ifpredV or pred is emptythen break else ifH[pred].len()==0then // The case when the patient is identical to query slide I predprevpred continue else // Find the mosaic with smallest Hamming distance in the same key distj,jArgminj(Hamming-Distance(hi,H[pred])) ifdistj V.insert(pred) metaH[pred][j] res.insert((distj,meta)) pred_cnt,predprevpred_cnt+1,pred returnres Algorithm 4 Results Ranking Algorithm function RANKING(Rs, D_inv, N, K) if Rs is empty then return D_invNORMALIZE(D_inv,N) Normalize the reciprocal of diagnosis count so that the sum is equal to N. N=10 for the fixed site and N=30 for the anatomic site experiments, respectively. Slb,Sm{},{} Sl{} for each patchs results ri in RS do if ri is not empty then Ent,label_cnt,distWEIGHTED-UNCERTAINTY-CAL(ri,D_inv) Read more:
Fast and scalable search of whole-slide images via self-supervised deep learning - Nature.com