Machine-learning-guided recognition of and cells from label-free infrared micrographs of living human islets of … – Nature.com

From image collection to dataset creation

The whole machine learning workflow is schematically represented in Fig.1. In brief, it starts with an algorithms training which consists of three main phases, namely: (i) live-islet autofluorescence intensity imaging by exciting at 740nm and collecting in the 420460-nm range, which is dominated by NAD(P)H and lipofuscin signals; (ii) NAD(P)H auto-fluorescence lifetime imaging at the same focal plane in live islets at both low (2.2mM) and high glucose (16.7mM), with subtraction of the lipofuscin intrinsic signal, to produce metabolic data in terms of balance between free and protein-bound NAD(P)H; (iii) islet fixation and immunostaining using antibodies against glucagon and insulin to identify single and cells and then extract single-cell information from both intensity and lifetime data through spatial matching of immunofluorescence and live-islet acquisitions (Fig.1a). At this point, we curate the manual processing of experimental data to extract a set of numerical features (Fig.1b) and store them in a feature matrix. Each row of the matrix is associated with an outcome (cell identity, obtained by immunofluorescence) denoted as either or , and this is described in the target vector. At this point, the majority of the dataset is used to train a model that captures the relationship between numerical features and cell type (Fig.1c). The rest of dataset is used during the testing phase, where the performance of the model is evaluated by predicting cell type using the data portion withheld from the training phase. Upon successful completion of the testing phase, the model becomes capable of inferring cell type (i.e. the target vector) from newly collected data of sole autofluorescence and lifetime imaging, eliminating the need of performing immunostaining for cell type recognition.

General workflow. (a) Using a fluorescence microscope equipped with a FLIM (Fluorescence Lifetime Imaging Microscopy) module, data were collected from 15 Human Langerhans islets in three types of images: autofluorescence intensity (cartoon in grayscale), FLIM images, typically visualized as a phasor plot (blue cloud), and immunofluorescence images (red and green cartoon, where red represents cells, and green represents cells). (b) Single-cell data were obtained through manual segmentation of the acquired images, which resulted in one image per each segmented cell. For each cell, a number of parameters were calculated and included in a dataset then used to train a Machine Learning algorithm. (c) After a testing phase, the model can be employed to determine cell identity from new images without the need for additional immunostaining.

In more detail, to build the input dataset we performed label-free multi-photon imaging of human islets (Fig.2a), which provided two distinct types of data: islets autofluorescence intensity (Fig.2a, top panel) and lifetime (Fig.2a, center panel) micrographs. The autofluorescence signal was elicited at 740nm through multiphoton excitation and collected in the 420460-nm optical window. Each islet was measured twice: first, at 2-mM glucose concentration, which maintains a starvation condition, and then after 510min exposure to 16-mM glucose concentration, which stimulates insulin secretion from cells. Following multiphoton imaging of live islets, these were fixed and prepared for immunofluorescence (Fig.2a, bottom panel). This step involves tissue fixation, followed by permeabilization and, ultimately, incubation with anti-glucagon (red signal) and anti-insulin (green signal) antibodies to identify and cells, respectively. After image acquisition, manual segmentation (Fig.2b) was carried out to extract single-cell information: 151 features were extracted (Fig.2c) and used to construct what is referred to as the feature matrix. Each row of the matrix is associated with an outcome, specifically cell identity, denoted as either or , and this is described in the target vector. At the end, the feature matrix contains data from N=1932 cells, with each cell associated with N=151 features. In contrast, the target vector exclusively contains immunofluorescence-derived information on cell identity.

From image collection to dataset creation. (a) Human Langerhans islets' autofluorescence and lifetime are measured using label-free fluorescence microscopy, giving an autofluorescence image (top) and a phasor plot (center) of the islet as result of the live-cell imaging step. In the following step, fixation and permeabilization are performed. Then, islets are incubated with antibodies (green: anti-insulin, red: anti-glucagon), leading to the corresponding immunofluorescence image (bottom) of the islet. (b) Already obtained autofluorescence images are manually segmented by outlining Regions Of Interest (ROIs), obtaining single-cell data. Likewise the entire islet, each single cell has an associated autofluorescence image, a phasor plot, and cell identity information obtained from immunofluorescence. (c) Single-cell images are used to extract 151 features per cell, which are organized in a feature matrix. In this matrix, each row represents a single cell and each column corresponds to a specific feature. The target vector contains information about cell identity, which are derived from the immunofluorescence images.

Most of the numerical entries of the feature matrix (thecomplete list is reported in Supplementary Material)are derived from either autofluorescence intensity and lifetime data through the utilization of descriptive statistics parameters including, for instance, minimum and maximum values, trends, range of most common values, and data dispersion (Fig.3). Notably, in the optical window used for NAD(P)H detection, human islets also contain marked autofluorescence originating from lipofuscin-enriched granules20,21. These granules, byproducts of lysosomal digestion, are primarily composed of lipids and proteins, and directly correlate with age of donor19,22. Since and cells are known to possess different amounts of lipofuscin19, we decided to include a parametrization of lipofuscin granules by estimating their area normalized by the cell area. Cell morphology is instead described by three key parameters: cell area, perimeter, and circularity. Circularity quantifies how closely the cell shape resembles a perfect circle, with a value of 1 indicating a perfect circle. For what concerns autofluorescence lifetime data, the Fourier transformation converts the lifetime decay measured in each pixel of the image into a data point in the phasor plot, characterized by three parameters: the g and s coordinates, which describe the time constant of autofluorescence decay, and the frequency of observation of each specific set of g, s coordinates. Phasor clusters were quantitatively analyzed by extracting both the cluster barycenter and its standard deviation. In addition, by combining phasor-FLIM data acquired at two glucose concentrations, additional information about cell metabolism could be obtained: in fact, the shift in NAD(P)H lifetime upon glucose stimulation can be used as a descriptor of the average metabolic balance between glycolysis and oxidative phosphorylation in and cells. Finally, infrared-imaging-derived features were supplemented by adding donor-related clinical parameters (Table S1) such as age, body mass index (BMI), and the insulin stimulatory index (SI), this latter intended as the overall insulin secretion efficiency of donor-derived islets measured by a standard ELISA assay.

Overview of calculated features. In total, 151 features (in italic) have been extracted from phasor, autofluorescence, clinical, and experimental data. These features describe or summarize (in bold): Phasor plot characteristics, Cell metabolism, Cell morphology Lipofuscin content, Donor-related demographic and clinical data, Experimental conditions. In addition, various descriptive statistics parameters are used as general-purpose descriptors to summarize both autofluorescence and phasor relevant characteristics.

To facilitate the exploratory data analysis we employed the Principal Component Analysis (PCA)23 as a dimensionality reduction algorithm. We first chose the optimal number of components to avoid information loss and plot the explained variance with respect to the number of components (Fig.4a). The explained variance decreases rapidly even with few components, thus we reduced the dimensionality of the dataset from 151 to 2, making the entire dataset amenable to visualization in a 2D Cartesian plot and enabling us to observe the impact of specific features through color mapping. The PCA outcome is represented as a 19322 matrix in order to visualize only single-cell data.

Explorative data analysis with PCA and K-means clustering. The dataset dimensionality has been reduced from 151 to 2 using PCA to allow graphical representation. (a) A graphical representation of explained variance respect to the number of principal components gives an idea on how much components/dimensions are needed to retain enough information after dimensionality reduction. The explained variance drops rapidly, meaning that two components are enough to visualize the data without significant information loss. (b) The bidimensional PCA scatterplot (bottom, right) appears separated on the basis of cell type, despite mildly clustered. This suggests classification is possible using complex algorithms, maybe using a supervised approach or neural networks, as confirmed their distribution using kernel density estimation plots on the first principal component (top) and second principal component (botton, left). (c) Using experimental glucose concentration as colormap, it becomes evident that glucose concentration does not significantly affect cell classification. This implies that glucose concentration has low classification power, implying that the classification model will be able to classify cells independently of this experimental condition. (d) The elbow method allowed to choose a suitable number of clusters to have good performance by computing the WCSS (Within-Cluster Sum of Squares, i.e. sum of squared distances of all points from the centroid they belong) indicates for each iteration. The elbow (red dot) indicates the optimal number of clusters, which is 10. (e) The Gini impurity index has been computed for all clusters to assess within-cluster heterogeneity. The ideal case would be having only one class per cluster, which would result in Gini=0. However, the average Gini among all clusters is 0.37.

For instance, if data are color-coded according to cell type, and cells show mild segregation (Fig.4b, bottom right), as confirmed by kernel density estimation (KDE) plot on both the first principal component (Fig.4b, top) and second principal one (Fig.4b, bottom left), suggesting that classification might be reached, but using sophisticated supervised algorithms. If cells are color-coded by means of the glucose concentration used in the experiment (Fig.4c), it becomes challenging to accurately distinguish between and cells. This implies that glucose concentration may not possess strong classification power, thus the algorithm might be able to classify cells independently of the experimental glucose concentration used. To support this hypothesis more quantitatively the need of a Supervised Learning approach, we conducted a clustering analysis using the widely-employed k-means algorithm. First, we selected the proper amount of clusters using the elbow method. This consists in performing k-means iteratively by progressively increasing the number of clusters and calculating, for each iteration, the WCSS (Within-Cluster Sum of Squares), which represents a quantitative evaluation of how much data points are tight-bound to the cluster centroid. The optimal number of clusters should ideally match the number of classes of the classification problem (i.e. 2), but this would perform poorly here, as demonstrated by the elbow-test results (Fig.4d). The best score is reached for the highest number of clusters, but this in turn is a sign of data overfitting: the suitable number of clusters chosen was 10 (Fig.4d, red dot). For the chosen number of clusters, we assessed the performance of k-means by quantifying data heterogeneity within each cluster using the Gini impurity index (Fig.4e), exploiting the labels on the data obtained by immunofluorescence. The ideal scenario would be Gini=0, which indicates that the cluster only contains one class. Other way round, if Gini=1 (worst case), it means that data within the cluster is entirely diverse. The average Gini coefficient across all clusters is 0.37, which confirms our hypothesis about the supervised approach. To give the reader a more synthetic view of the results, we calculated the ROC_AUC (i.e. area under a ROC curve) of a two-component K-Means on PCA data, obtaining 0.60, thus reinforcing our conclusions: the explorative data analysis using PCA showed mild clustering of and cells, prompting us to use supervised classification algorithms.

Before training the model, we cleaned the dataset by manually reviewing cells, and we discarded those for which cell identity could not be confidently determined to prevent the introduction of noise into the training phase (Fig.5a). The following step involved data-preprocessing operations to favor model performance and stability: these included numerical encoding categorical features, features scaling, handling of missing values and outliers. A critical point in data preprocessing was that of addressing dataset imbalance, i.e. the unequal number of and cells in the training set. Neglecting cells from the most abundant class (i.e. cells) could lead to a biased model due to the high biological heterogeneity of Langerhans islets (Table S2)24, considering that several algorithms are built on the hypothesis of balanced classes as inputs). To address this, we employed the Synthetic Minority Oversampling Technique (SMOTE)25. This algorithm leverages existing data to generate synthetic data entries, rebalancing the : ratio of the whole dataset from 2:1 to 1:1, thus improving model training. At this point, the dataset was divided into the training and test sets (Fig. S1) to prevent overestimation of model performance during testing. Model performance and stability were further enhanced by implementing both Cross Validation and hyperparameters tuning procedures. Repeated stratified fivefold Cross Validation (with 3 repetitions) was applied, and Grid Search was chosen for cross-validation and hyperparameters tuning. The area under a ROC curve (ROC_AUC) was selected as the optimization metric, given its appropriateness for machine-learning problems based on imbalanced classes, as in this case. Four different algorithms were trained and tested (Fig.5b) using 970 cells for training (a mix of real data and synthetic data generated by SMOTE) and 216 cells (real data) for testing. Training and testing performances were then compared based on various metrics, including precision, recall, and F1 score, in addition to the area under the ROC curve. Regarding the two cell types under study (Fig.5c), cells generally exhibited scores exceeding 0.80, while cells exhibited slightly lower overall performances ranging from 0.60 to 0.70. This discrepancy may be linked to the degree of cell-type-specific information embedded in the extracted biological features. For instance, it was recently demonstrated and confirmed that cells have a significantly higher lipofuscin content compared to cells (i.e., twofold)22 and display a distinct metabolic shift toward oxidative phosphorylation upon glucose stimulation26, which is not as clearly observed in cells19. In this scenario, the extracted features convey the proper amount of information to explain the behavior of cells with confidence, while it takes more effort to take decisions on cells. All the tested algorithms showed high performance, but unsatisfactory precision or recall on -cell classification, with the exception of XGBoost. XGBoost displayed high performance and classification stability (i.e. all the computed scores were quite similar within the same class), and was thus selected for a further optimization step.

Supervised-learning results from four different models. (a) After creating the feature matrix and target vector, data undergo several preprocessing steps to enhance the performance and stability of classification. The process starts with manual cleaning, where only cells with clearly defined identities are retained in the dataset, excluding over a thousand cells, resulting in a cleaned dataset with 861 cells. Preprocessing includes encoding categorical features, handling missing values, handling outliers, and scaling the data. The dataset is then rebalanced using SMOTE (Synthetic Minority Oversampling Technique), and it is split into training and test sets. The training set, after SMOTE, comprises 970 cells and 151 features. Before training, cross-validation and hyperparameter tuning are performed to obtain a stable and high score. The model is tested on the testing data, which can be considered as new, unseen data. The original data is cleaned to improve algorithm performance. (b) Four different algorithms are tested and compared: multivariate logistic regression, boosted decision tree (XGBoost), Support Vector Machine for classification, and K-Nearest Neighbor for binary classification. Each algorithm is optimized using the most common hyperparameter range and Grid Search as the optimization algorithm. (c) Evaluation of precision, recall, F1 score, and the area under an ROC curve reveals that XGBoost is the most promising algorithm in terms of classification performance and stability. XGBoost is further optimized with Optuna, allowing for the selection of a wider hyperparameters range to improve its performance.

The optimization of XGBoost was performed by using Optuna27 that, contrary to Grid Search, does not evaluate all possible hyperparameter combinations but efficiently explores the hyperparameter space through sampling and pruning algorithms. For feature selection, we leveraged the embedded method of the XGBoost algorithm, which provides an importance score for each feature ranging from 0 to 1, based on their significance within the classification task. After an initial XGBoost training using all features, these were sorted from the most relevant to the least, and a new training phase initiated with a restricted number of features and setting different cutoff thresholds (Fig.6a). This process was aimed at enhancing model performance and, potentially, at reducing computational cost. A detailed view of all computed scores can be found in Table S3. The model with the highest performance achieved a ROC_AUC of 0.86 by using the top 116 features out of 151, thus indicating that the majority of the features are essential for optimal classification (Fig.6b). This is likely due to the high biological heterogeneity of Langerhans islet cells, both within and across donors. As mentioned earlier, Rouiller and co-workers showed that and cells disaggregated from rat islets can be separated using fluorescence-activated cell sorting (FACS). This separation relied on their intrinsic autofluorescence (mostly due to flavoproteins elicited at 488nm) and the characteristic size of the cells18. This observation prompts us to consider the significance of delving deeper into the analysis of intrinsic signals (e.g. by building a more complex algorithm as deep learning, at expense of interpretability) or by extracting more information-rich features to achieve similar or higher model performances based on standard imaging. However, a classification algorithm is needed to not underperform - or - cells classification, as evidenced by the K-Means analysis in the Explorative Data Analysis. In order to make a direct comparison with XGBoost, we applied the same pre-processing to the dataset as in the algorithm training phase, then we applied the 2-component k-Means, obtaining ROC_AUC=0.72, much lower than XGBoost (Fig.6b). Coming back to model interpretability, XGBoost has an embedded method which allows to extract and identify the most important features able to explain the classification power. By plotting the nine most important features (Fig.6c) we can observe that 6 out of 9 features are related to static autofluorescence, and the first three are able to explain more than 60% of the classification power, suggesting that most of classificatory information is encoded in the autofluorescence intensity. Indeed, by color-mapping the PCA plot (Fig. S2) for the most important feature (i.e. intensity_all_whisker_high), it can be seen that it follows the cell-type distribution shown in KDE plots. This observation is also corroborated by previous ones on the higher lipofuscin content of cells22 and their increased fluorescence intensity due to oxidative metabolism28,29,30 as compared to cells. To ensure model stability, we conducted additional assessments. First, we increased the number of folds from 5 to 10, implementing tenfold repeated stratified cross-validation. All training and Optuna-optimization steps were repeated and the same evaluation scores calculated (Table S4a), showing ROC_AUC=0.86, which is comparable to the fivefold cross-validation results (Fig.6b) together with the other metrics. Additionally, we performed the Salzberg test31, a method that involves shuffling the labels in the target vector of the training dataset, allowing the algorithm to learn from noise. This test showed a ROC_AUC=0.53, which is a 33% decrease for both training and testing (Table S4b), confirming that the model optimized during the standard training procedure was not influenced by overfitting. Furthermore, we attempted to classify data that had been excluded from training during the dataset cleaning procedure. The resulting ROC_AUC was 0.64, and all computed metrics displayed lower performance (Table S4c), thus validating the effectiveness of the cleaning procedure.

XGBoost Optimization with Optuna, Feature Importance, and Model Stability Assessment. XGBoost performances have been further improved via feature selection and larger hyperparameters tuning using Optuna. (a) We selected the optimal number of features by training XGBoost with Optuna, selecting a subset of the most important features, discovering that almost all features are needed for optimal performance. (b) The best model has been obtained for 116 features; it shows a ROC_AUC=0.86, and precision comparable with FACS on dissociated cells made by other researchers. (c) By plotting the 9 best features, we can observe that more than 50% of classification power comes almost entirely from autofluorescence images.

Go here to read the rest:
Machine-learning-guided recognition of and cells from label-free infrared micrographs of living human islets of ... - Nature.com

Related Posts

Comments are closed.