Identifying microRNAs associated with tumor immunotherapy response using an interpretable machine learning model … – Nature.com

ICB response prediction using miRNA expression profiles

We compiled predictive ICB responses using the TIDE model and miRNA expression profiles from 7721 samples across 19 different tumor types within The Cancer Genome Atlas (TCGA) dataset (Table 1). To predict immunotherapy response using miRNA expression profiles, we first developed a random forest classifier to determine CTL levels. The optimal parameters for the random forest classifier were determined through a grid search with tenfold cross-validation (Table 2). Using the identified optimal parameters, we trained random forest classifiers on the designated training data and rigorously assessed the predictive performance on the independent test data. The results showed that the random forest classifier predicted the CTL levels well, with an AUC of 0.9400 (Fig.2A). Furthermore, when evaluating the performance using the F1 score and Balanced AUC indicators, high performance was confirmed, with an F1 score of 0.9849 and a Balanced AUC of 0.7182.

Predicted results for each model learned using miRNA expression profiles. (A) ROCAUC of the random forest classifier that predicts the CTL level. The class True signifies the high group and False signifies the low group. (B) Scatterplot of the random forest regression model for predicting the dysfunction score. The red line indicates the regression line. (C) Scatterplot of the random forest regression model for predicting the exclusion score. The red line indicates the regression line. (D) Scatterplot of the stepwise prediction model predicting ICB response based on the TIDE score. The red line indicates the regression line.

Next, we predicted the dysfunction and exclusion scores based on random forest regression. A grid search with tenfold cross-validation was performed to determine the optimal parameters for random forest regression (Table 2). Employing the optimal parameters, two random forest regression models to predict the dysfunction and exclusion scores were independently learned from the training data. The predictive results with the independent test datasets showed that the MSE of the regression model for predicting the dysfunction and exclusion scores were both 0.0361. The Pearson correlation coefficient (PCC) between the observed and predicted values was also calculated. The PCC for the dysfunction score prediction model was 0.8158 and that for the exclusion model was 0.8704. This indicated a strong positive correlation between the predicted and actual values in both models (Fig.2B,C).

Finally, we predicted the ICB responses based on the TIDE score by combining the two-step machine learning model, constructed a random forest classifier for CTL prediction, and random forest regression models for the dysfunction and exclusion scores. The MSE of the combined stepwise model was 0.0360. Furthermore, the PCC between the observed and predicted values exhibited a strong positive correlation of 0.9270 (Fig.2D).

Thereafter, we used SHAP, an interpretable machine learning approach, to analyze the results of our machine learning models. Using SHAP analysis, we identified informative miRNAs that contributed to the prediction of target values. Figure3 shows the top 20 miRNAs ranked according to their feature importance scores in each model.

Shapley value plot for exhibiting feature importance. (A) SHAP feature importance for the random forest classifier to predict CTL level, (B) summary plot for the random forest classifier when the CTL prediction model predicts the CTL level is high, (C) summary plot for the random forest classifier when the CTL prediction model predicts the CTL level is low, (D) SHAP feature importance for random forest regression to predict dysfunction score, (E) summary plot for random forest regression to predict dysfunction score, (F) SHAP feature importance for random forest regression to predict exclusion score, and (G) summary plot for random forest regression to predict exclusion score. (A, D, F) are plots that arrange features based on the average of the absolute Shapley values, which serve as indicators of feature importance. (B, C, E, G) are summary plots that depict feature importance and feature effects simultaneously. Each point signifies the Shapley value of the feature and instance. The x-axis represents the Shapley value, and the y-axis represents each feature. The color of each point corresponds to the high and low feature values (i.e., miRNA expression values).

For CTL-level prediction based on a random forest classifier, hsa-miR-155 was the most informative feature with the highest Shapley value. In particular, focusing on high and low CTL predictions, the expression of hsa-mir-155 was positively associated with CTL-level prediction (Fig.3B,C). Notably, miR-155 is an essential factor orchestrating the CD8+T cell response in cancer, and its overexpression has been associated with the enhancement of the anti-tumor response22,23. hsa-miR-150, which had the second-highest impact on model predictions, exhibited a similar trend. miR-150 also plays a crucial role in the differentiation and functional regulation of CD8+T cells24. The absence of miR-150 leads to a decline in the killing ability of CD8+T cells24. In addition, hsa-miR-4772, hsa-miR-21, hsa-miR-142, and hsa-miR-10a were also identified with notably high Shapley values.

In the random forest regression model used to predict the dysfunction score, the miRNA with the highest Shapley value was hsa-miR-10b. The Shapley value of hsa-mir-10b was negative when its expression was low, and positive when its expression was high (Fig.3E). This indicated a positive correlation between hsa-miR-10b expression and dysfunction prediction. In contrast, hsa-miR-183 negatively correlated with dysfunction prediction. Both miR-150 and miR-155 showed positive correlations in dysfunction predictions and played an important role in dysfunction mechanisms, as well as in CTL level predictions. Furthermore, miR-151a and miR-210 exhibit negative correlations, similar to those of miR-183.

In the random forest regression model predicting the exclusion score, hsa-miR-10b also showed the largest Shapley value (Fig.3G); however, it exhibited a negative correlation with hsa-miR-10b and the exclusion prediction, in contrast to the dysfunction prediction model. This observation serves as an example of how exclusion prediction, which has a mechanism opposite to that of dysfunction, is negatively correlated with dysfunction prediction. In contrast to the dysfunction results, hsa-miR-150 and hsa-miR-155 demonstrated opposite behaviors in exclusion prediction. Additionally, hsa-miR-10a, which was also identified in the CTL-level prediction, showed a positive correlation with exclusion prediction and played an important role in model prediction. Furthermore, the expression level of miR-194-1 and miR-194-2 is negatively correlated to the exclusion prediction.

Next, we verified whether ICB response could be predicted using a small number of informative miRNAs. We selected miRNAs with an average absolute Shapley value of 0.01 or higher (SHAP 0.01). Using this criterion, three miRNAs were identified in the CTL model, five miRNAs in the dysfunction prediction model, and 12 in the exclusion prediction model (Fig.3A,D,F). Because only a limited number of features were used to construct the models, we employed a simple algorithm to predict immunotherapy response.

To predict the CTL level, we applied logistic regression25 and determined the optimal parameters by conducting a grid search with tenfold cross-validation (Table 2). The model using the three informative miRNAs achieved an F1 score of 0.9805, a balanced accuracy of 0.7249, and an AUC value of 0.9300 (Fig. S1A). This analysis confirmed that a small subset of highly informative miRNAs displayed a similar performance in predicting CTL levels, even when a logistic regression model was utilized.

Subsequently, dysfunction and exclusion scores were predicted using a small number of informative miRNAs based on multiple linear regression. The obtained results showed that the MSE for the dysfunction model using the top miRNA (SHAP 0.01) was 0.0754 and that for the exclusion prediction model was 0.0840. The PCCs between the predicted and actual values were 0.5707 and 0.6638 for the dysfunction and exclusion prediction models, respectively (Figs. S1B,C). From these results, we confirmed that the performance was slightly degraded with a reduced number of features; however, the models still demonstrated comparable performance with only a small number of selected miRNAs.

Finally, to predict the ICB responses based on the TIDE scores, we applied a stepwise machine learning model by combining the logistic regression classifier for the CTL level and the linear regression model for dysfunction and exclusion scores. The MSE of the model that used the most informative miRNA (SHAP 0.01) was 0.0690. We also observed a strong positive correlation with informative miRNAs; the PCC of the top miRNA (SHAP 0.01) was 0.8457 (Fig. S1D).

Similarly, we applied robust criteria for the identification of informative miRNAs and verified whether having fewer miRNAs could result in the accurate prediction of immunotherapy response. We selected miRNAs with an average absolute Shapley value of 0.02 or higher (SHAP 0.02); two miRNAs were identified in the CTL model, four miRNAs in the dysfunction prediction model, and five miRNAs in the exclusion prediction model (Fig.3A,D,F).

For CTL level prediction using logistic regression, the model obtained an F1 score of 0.9800, a balanced accuracy of 0.7459, and an AUC value of 0.91 (Fig. S2A). In addition, the models showed good performance for dysfunction and exclusion score prediction using linear regression. The MSE for dysfunction prediction was 0.0810 and that for exclusion prediction was 0.0984. The PCCs were 0.5220 and 0.5900 for the dysfunction score and exclusion score prediction models, respectively (Fig. S2B,C). Furthermore, for the ICB response prediction based on the TIDE scores using the two-step machine learning model combining logistic regression and linear regression, the MSE was 0.0753 and the PCC was 0.8595 (Fig. S2D). Although the performance was slightly lower than that of the model using all miRNAs for predicting the ICB response, these results suggest that informative miRNAs based on Shapley values still exhibit strong predictive capability, even with a limited number of miRNAs and relatively simple classification and regression models.

To examine the biological roles of the informative miRNAs, we predicted the target genes of the informative miRNAs selected by Shapley values using miRDB and TargetScan. A list of the genes targeted by the top miRNAs from each model is shown in Tables S1 and S2. We investigated the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways enriched in the target genes. Tables 3, 4, 5 and Tables S3S5 show the results of enrichment analyses using the informative miRNAs (SHAP 0.01) of each model. The top 20 pathways are listed in Tables 3, 4, 5 in ascending order of P-values, and all KEGG pathways satisfying statistical significance (adjusted P value<0.05) are shown in Tables S3S5.

The first-ranked KEGG pathway in the CTL-level prediction model was the TNF signaling pathway (Table 3). The following pathways are involved in the Hepatitis B and IL-17 signaling pathway. Hepatitis B is a significant contributor of hepatocellular carcinoma (HCC)26. Additionally, immune-related pathways such as the Fc epsilon RI signaling pathway and the T cell and B cell receptor signaling pathways were observed at the top. Enrichment analysis also revealed several other cancer-related terms, including Pathways in cancer, PI3K-Akt signaling pathway, Prostate cancer, Renal cell carcinoma, and Pancreatic cancer (Table S3). These results suggest a significant role for these miRNAs and their target genes in cancer and immunotherapy.

Tables 4 and S4 present the informative pathways identified using the dysfunction score prediction model. One of the most significantly enriched pathways was melanogenesis, which produces mutagenic intermediates that induce immunosuppression. The following term represents the Wnt signaling pathway and the ErbB signaling pathway. Moreover, our analysis identified various cancer-related terms, including Hepatocellular carcinoma, Prostate cancer, Breast cancer, and Gastric cancer, as well as Pathways in cancer (Table S4).

Tables 5 and S5 present the pathways identified using the exclusion score prediction model. The first pathway is Proteoglycans in cancer, which plays a significant role in regulating cytokine and chemokine expression on the cell surface. Moreover, various cancer-related pathways and terms, such as MAPK signaling, PI3K-Akt signaling pathway, and Rap1 signaling pathway, Pathways in cancer, Prostate cancer, Renal cell carcinoma, Lung cancer, and Breast Cancer, were also identified, along with immune-related pathways like T cell and B cell receptor pathway and Helper T cell differentiation. Furthermore, the presence of PD-L1 expression and the PD-1 checkpoint pathway in cancer indicate that the genes targeted by miRNAs are directly associated with immunotherapy.

Additionally, it was noted that several pathways related to the brain and neurons were observed, including Axon guidance27, a subfield of neurodevelopment associated with the process of neurons sending axons to reach accurate targets; Neurotrophin signaling pathway28, a protein that supports the survival, development, and function of neurons; Long-term potentiation29, a process that strengthens signal transmission between neurons; as well as Dopaminergic synapse and Cholinergic synapse (Table S5). This could be because the majority of CTL-low (exclusion) samples were involved in the TCGA LGG tumor type (Table S6).

Enrichment analysis results using the top miRNAs (SHAP value 0.02) of each model also identified diverse pathways related to cancer and immunity (Table S7-S9). These findings would provide valuable insights into the molecular mechanisms underlying exclusion and immune response regulation in cancer.

We proceeded to validate the stepwise machine learning model based on a random forest trained on all miRNAs using data from 12 distinct tumor types not included in the previous training and test phases (Fig.1F,I and Table S10). For the random forest classifier predicting CTL levels, we achieved an F1 score of 0.9912 and an AUC value of 0.9400 (Fig. S3A). When predicting the dysfunction and exclusion scores via random forest regression models, the MSE for the dysfunction score prediction model was 0.0478, and that for the exclusion score prediction model was 0.0641. The MSE value of the stepwise machine learning model for predicting the ICB response based on the TIDE score was 0.0475. Moreover, it could be observed that both the predicted value and the actual value showed a positive correlation (PCC=0.8698). (Fig. S3B-D).

Furthermore, we validated the predictive potential of our immunotherapy response prediction model using small subsets comprising informative miRNAs (SHAP 0.01 and SHAP 0.02) by applying the same approaches to the 12 tumor types (Fig.1F,I). The models employing informative miRNAs (SHAP 0.01) to predict CTL levels using logistic regression showed an F1 score of 0.9901 and an AUC of 0.9300 (Fig. S4A). In the dysfunction and exclusion score predictions using linear regression, the MSE were 0.0660 and 0.0677, respectively. Moreover, a positive correlation was observed between the predicted and actual values (PCC=0.2899 and 0.4198, respectively) (Fig. S4B,C). Lastly, the stepwise model used to predict the ICB response based on the TIDE score with informative miRNA (SHAP 0.01) yielded an MSE of 0.0661 and a PCC of 0.8335 (Fig. S4D).

Additionally, the results of models with a smaller number of informative miRNAs and strict criteria (SHAP 0.02) revealed compelling outcomes. CTL-level prediction using the logistic regression classifier model showed an F1 score of 0.9904 and an AUC of 0.9300. (Fig. S5A). The linear regression models to predict the dysfunction and exclusion scores also achieved good performances, with the dysfunction score prediction model showing an MSE of 0.0585 and a PCC of 0.3822 and the exclusion score prediction model displaying an MSE of 0.0797 and a PCC of 0.2816 (Fig. S5B,C). In addition, for the prediction of the ICB response using the combined stepwise machine learning model with SHAP 0.02, the MSE was 0.0594 and the PCC was 0.8538 (Fig. S5D). Notably, the experimental results from the external validation datasets confirmed that not only did our model exhibit robust predictive performance regardless of tumor type, but the informative miRNAs were also useful for tumor immunotherapy response prediction.

We further validated the stepwise machine learning model trained on all miRNAs, using novel external independent data from PCAWG (Pancancer Analysis of Whole Genomes). The parameters of each model were set through grid search with tenfold cross-validation (Table S11). For the random forest classifier predicting CTL levels, we achieved an F1 score of 0.9589 and an AUC value of 0.9226 (Table S12). Regarding the prediction of dysfunction and exclusion scores through a random forest regression model, the MSE for the dysfunction score prediction model was 0.0245, and for the exclusion score prediction model, it was 0.0251 (Table S12). The MSE value of the stepwise machine learning model for predicting ICB response based on the TIDE score was 0.0248 (Table S12).

Furthermore, we identified informative miRNAs using the SHAP analysis in the PCAWG cohort (Fig. S6). In addition, we investigated which miRNAs were informative in each tumor type using SHAP (Table S13). It was noted that the informative miRNAs at TCGA cohorts were also similarly identified even at the PCAWG datasets, even though the direct comparison of the miRNAs is difficult because TCGA represents precursor miRNA expression and the PCAWG provides the mature forms. For instance, miR-150 demonstrated the significance in CTL and Dysfunction models. Furthermore, miR-155 was also assigned at a high ranking.

We also validated the predictability of ICB response prediction models in the PCAWG cohort using the informative miRNAs (SHAP 0.01 and SHAP 0.02) extracted from the TCGA cohort (Table S12). The model employing informative miRNAs (SHAP 0.01) achieved an F1 score of 0.9556 and an AUC of 0.9161 for predicting CTL levels via logistic regression. For dysfunction and exclusion score predictions using linear regression, the MSEs were 0.0371 and 0.0528, respectively. The stepwise model for predicting ICB response based on the TIDE score with informative miRNA (SHAP 0.01) yielded an MSE of 0.0376.

Similarly, the model utilizing informative miRNAs (SHAP 0.02) extracted from the TCGA cohort attained an F1 score of 0.9527 and an AUC of 0.9097 for predicting CTL levels via logistic regression. For dysfunction and exclusion score predictions using linear regression, the MSEs were 0.0364 and 0.0798, respectively. Finally, the stepwise model for predicting ICB response based on the TIDE score with informative miRNA (SHAP 0.02) yielded an MSE of 0.0364. The results with the external datasets from PCAWG further affirmed the effectiveness of the informative miRNAs in predicting ICB responses.

Next, we employed the random forest-based ICB response prediction model on the TCGA cohort, stratified by tumor type, to investigate variations in the efficacy of ICI treatment in each tumor type. The parameters of each model were set through grid search with tenfold cross-validation (Table S11). The MSE values of the combined stepwise models for each tumor type ranged from 0.0093 to 0.0494 (Table S12). Notably, these results closely similar to the predictive performance derived from the entire tumor cohort. Thus, this suggests that the differences in ICI treatment response among various cancer types are minimal.

In addition, we investigated which miRNAs were informative in each tumor type using SHAP (Table S13). Even though there existed some differences in each tumor type, some informative miRNAs such as miR-150 and miR-155 were frequently observed at the highly-ranked miRNAs. This result indicates that these miRNAs are closely related to ICB responses across the tumor types.

Moreover, we also evaluated how well the stepwise model pre-trained using the whole 19 TCGA cohorts predicted the test data (20%) for each tumor type (Table S14). The MSE was ranged from 0.0113 to 0.1824 using total miRNAs. Using the information miRNAs (SHAP 0.01), the MSE was ranged from 0.0166 to 0.5530. Similarly, in the SHAP 0.02 model, the MSE was ranged from 0.0159 to 0.5562. These results showed the informative miRNAs were utilized for the prediction of ICB treatment responses even at a variety of cancer types.

Read more here:
Identifying microRNAs associated with tumor immunotherapy response using an interpretable machine learning model ... - Nature.com

Related Posts

Comments are closed.