A comparative study of 11 non-linear regression models highlighting autoencoder, DBN, and SVR, enhanced by SHAP … – Nature.com

Result 1: Evaluation of the prediction accuracy of 11 AI models

Given the complex relationship between phenotype and genotype, genes can contribute positively or negatively to traits. Some genes have a significant impact, while others have a minor influence. So, we applied the non-linear regression algorithm in this study.

Considering the specific problem of predicting phenotypes and the characteristics of the SNP genotype dataset, we conducted a comparison using the most suitable non-linear regression machine learning and deep learning models. Seven machine learning models and four deep learning models were selected for this purpose. The seven machine learning models include SVR (Support Vector Regression), XGBoost (Extreme Gradient Boosting) regression, Random Forest regression, LightGBM regression, Gaussian Processes (GPS) regression, Decision Tree regression and Polynomial regression; The four deep learning models include Deep Belief Network (DBN) regression, Artificial Neural Networks (ANN) regression, Autoencoders regression, and Multi-Layer Perceptron (MLP) regression.

In this study, we collected phenotype data from 1918 soybean accessions and applied the corresponding SNP genotype data in our research. To address the large dataset size and redundancy of genotype data, we employed two steps. First, we used the one-hot encoder to convert the genotype data (ATCG nucleotide code) into an array. Then, Principal Component Analysis (PCA) was used to reduce the dimensionality of the data. Finally, the chosen models were applied using the respective algorithms.

GridsearchCV is a cross-validation procedure. To determine the optimal parameters, we employed the GridsearchCV method to fine-tune the hyperparameters and identify the best model for phenotype prediction. In order to evaluate the performance of the regression models, we utilized four evaluation metrics: R2 (R-squared), MAE (Mean Absolute Error), MSE (Mean Squared Error), and MAPE (Mean Absolute Percentage Error). These metrics were used to assess the prediction accuracy of each model.

Our results showed that, among seven machine learning models and four deep learning models, Polynomial regression exhibited the highest training performance, with an R2 value of 1.000, a MAE value of 0.00, an MSE value of 0.000, and an MAPE value of 0.000 during training, indicating a very close match between predicted and actual values. (Table 1). Among the seven machine learning models evaluated, lightGBM demonstrated the highest R2 value for the training set, achieving an impressive score of 0.967. Following closely is SVR, which obtained an R2 value of 0.926 for the training phase. Moving on to the test set, the SVR model performed the best, achieving the highest R2 value of 0.637, closely followed by Polynomial Regression with an R2 value of 0.614. Regarding the Mean Absolute Error (MAE) metric, the lightGBM model exhibited the lower value for the training set, registering a MAE of 0.068. Conversely, for the test set, the Polynomial Regression achieved the lowest MAE of 0.216, followed by the SVR model showcased the lower MAE value of 0.237. Considering the Mean Squared Error (MSE) metric, the lightGBM model demonstrated the lower value for the training set, yielding an MSE of 0.009. On the other hand, for the test set, the SVR model achieved the lowest MSE value, which amounted to 0.096, Followed by the Polynomial Regression which has MSE value of 0.102. Focusing on the Mean Absolute Percentage Error (MAPE), the Polynomial Regression model displayed the lowest value for the training set, and then the lightGBM model obtaining an MAPE of 0.025. In contrast, thePolynomial Regression and SVR model secured the lower MAPE for the test set, recording a value of 0.080 and 0.086 respectively (Table 1).

Among the four deep learning models, in training phase, ANN model got the highest R2 for train value of 0.995; and the lowest MAE for train value of 0.011, lowest MSE for train value of 0.001 and Lowest MAPE for train value of 0.004 (Table 1). when comparing with evaluation metrics R2, MAE and MSE in testing phase, the Autoencoder model got the best performance as mentioned above.

When comparing with other evaluation indicators, among all the models evaluated, the Autoencoder model had the highest R2 value for the test set, reaching an impressive 0.991. Additionally, the Autoencoder model obtained the lowest MAE value of.0.034 and the lowest MSE value of 0.002 during testing, indicating an excellent fit. (Table 1). Furthermore, the Autoencoder achieved the lowest MAPE value of 0.1011 during testing, indicating its good performance on unseen data.

Examining the test results, the correlation analysis reveals that R2_Autoencoder (0.991) outperforms R2_DBN (0.704), R2_SVR (0.637), and R2_Polynomial Regression (0.614). In the MAE analysis, MAE_Autoencoder (0.034) is lower than MAE_DBN (0.2.1), MAE_Polynomial Regression (0.216), and MAE_SVR (0.237). The MSE analysis shows that MSE_Autoencoder (0.002) is less than MSE_DBN (0.082), MSE_SVR (0.096), and MSE_Polynomial Regression (0.102). Regarding the MAPE analysis, MAPE_Autoencoder (0.011) is lower than MAPE_DBN (0.072), MAPE_Polynomial Regression (0.080), and MAPE_SVR (0.086).

In summary, based on our analysis of predictive model accuracy, the top four models are Autoencoder, DBN, SVR, and Polynomial Regression. This includes two machine learning models, SVR and Polynomial Regression, and two deep learning models, Autoencoder and DBN.

It should be noted that several articles have highlighted the drawbacks of percentage error metrics like MAPE. Caution is advocated by Stephan and Roland against relying on MAPE for the selection of the best forecasting method or the rewarding of accuracy, with an emphasis on the potential pitfalls associated with its minimization (Stephan and Roland, 2011)35. To further assess the performance of each model and gain a deeper understanding of the relative disparities between testing and training results, considering the magnitudes of the values being compared, we employed Relative Difference Analysis (RDA) on all four-evaluation metrics (Table 1). Our findings revealed that among the 11 models analyzed, Autoencoders demonstrated the most favorable performance, with a relative difference value of 0.001for R2, 0.014 for MAE, 0.046 for MSE, and 0.008 for MAPE. Additionally, the Decision Tree model achieved the lowest relative difference value for MAPE, with a value of 0.156. However, it also exhibited the highest R2 relative difference value at 1.035. On the other hand, the Polynomial Regression model displayed the highest relative difference values, with 2.000 for MAE, 2.000 for MSE, and 2.00 for MAPE (Table 1).

It has come to our attention that the R2 test score of the Multiple Linear Regression (MLR) model is significantly lower, approximately four times, than the R2 train score of the Autoencoder model. Additionally, the test loss in the MLR model is noticeably higher when compared to the train loss. The MAE for test score is at 0.331, which is a staggering 1.08E+14 times greater than the MAE for train of the MLR model. Additionally, the MSE for test of the MLR model stands at 1.26E+28 times higher than the MSE for train (Table 1). This observation strongly suggests a severe case of underfitting in the MLR model, as depicted in Table 1. Underfitting is distinct from overfitting, where the model may perform well on the training data but struggles to generalize its learning to the testing data. Underfitting becomes evident when the model's simplicity prevents it from establishing a meaningful relationship between the input and the output variables. The presence of underfitting in the MLR model signifies that the Linear model is too simplistic to be effectively utilized for phenotype prediction.

In order to further evaluate these 11 models, we plotted prediction accuracy evaluation based on Mean Absolute Error (MAE) (Fig.1), as well as overfitting evaluation based on Mean Squared Error (MSE) (Fig.2).

Histogram of Prediction Accuracy Evaluation of 11 Models by MAE Value. In this Figure, the histogram displays accuracy scores in the model evaluation using Mean Absolute Error (MAE). The blue dots represent the target values of the training data (y_train_pred), while the orange dots correspond to the target values of the testing data (y_test_pred). The X-axis represents the true values, and the Y-axis represents the prediction values.

Overfitting Evaluation of 11 Models Based on MSE Value. In this Figure, the histogram illustrates the evaluation of overfitting for each model. The blue line represents the Mean Squared Error (MSE) of the training data, while the orange line represents the MSE of the testing data. The Y-axis indicates the values of MSE, and the X-axis corresponds to different parameters for each model: For Decision tree, XGboost, and Random forest, the X-axis represents the max depth. For Gaussian process, the X-axis represents degrees of freedom. For SVR (Support Vector Regression), the X-axis represents the c-value. For lightGBM, the X-axis represents the number of iterations. For polynomial regression, the X-axis represents the polynomial degree. For DBN (Deep Belief Network) regression and Multilayer perception, the X-axis represents the hidden layer size. For Autoencoder and ANN (Artificial Neural Network) models, the X-axis represents the number of epochs. Each model's performance and overfitting tendencies can be observed and compared using these representations.

The probability plots of standardized residuals for each regression model provide a clear visual representation. The true values and predictions of the autoencoder model align well along the 45-degree line, with MAE of 0.03 for the training set and 0.03 for the test set. This demonstrates that the model's predictions adhere to the normality assumption. Similarly, the SVR model (MAE train=0.11 and MAE test=0.24), XGBoost model (MAE train=0.12 and MAE test=0.25), and DBN model (MAE train=0.03 and MAE test=0.02) also show good alignment between true values and predictions. On the other hand, the Multiplayer Perception model, Decision Tree model, Polynomial Regression model and MLR model exhibit a looser aggregation of true values and predictions, with data points scattered more loosely along the 45-degree line (Fig.1). The results of overfitting analysis indicate that SVR, lightGBM, Autoencoder, and ANN models fit both the training and test data exceptionally well, demonstrating a stable performance (Fig.2). While the testing loss of the MLP model shows significant fluctuations when the hidden layer size is below 400, it exhibits a robust fit for the training and test data when the hidden layer size exceeds 400. On the contrary, the Decision tree and DBN models demonstrate relatively poorer fits. As evident from the figures, the Decision tree model displays the least disparity between training and testing losses when the maximum depth (MAX Depth) is set to 5.0. Yet, when the depth is either below 5.0 or above 5.0, the gap between training and testing losses tends to widen. Regarding the DBN model, a relatively stable gap between training and testing losses is maintained for hidden layer sizes below 100. However, when the hidden layer size exceeds 100, the gap gradually increases. Similarly, the Polynomial regression model performs well when the polynomial degree is below 7. However, when the degree surpasses 9, there is a sharp increase in the gap between the training and testing losses (Fig.2). Both the Random forest and Gaussian process models exhibit a growing gap between training and testing losses with an increase in the maximum depth or the degree of freedom (degree of freedom) (Fig.2).

In summary, based on our comprehensive analysis, it is evident that Autoencoder, SVR, and ANN outperform the other models in relative terms. These models are suitable for genotype to phenotype prediction and minor QTL mapping. It could be the powerful tools in AI assisted breeding practice.

Our objective is to discover the most effective artificial intelligence model and utilize feature selection techniques to pinpoint genes responsible for specific physiological activities in plants. These identified genes will aid in precise phenotype prediction and gene function mining. To ensure the model's reliability, efficiency, low computational requirements, versatility, and openness, this study employs the Support Vector Regression (SVR) model as an illustrative example. We assess four distinct feature selection algorithms: Variable Ranking, Permutation, SHAP, and Correlation Matrix. Apart from the feature importance data, the Correlation Matrix method also provides valuable insights. A heatmap is employed to visualize the strength of correlations. In Fig.3, we present the heatmap showcasing the top 100 features identified through the Correlation Matrix analysis based on the SVR model (Fig.3). Additionally, the SHAP output plot offers a concise representation of the distribution and variability of SHAP values for each feature. Figure4 illustrates the summary beeswarm plot of the top 20 features derived from our SHAP importance analysis based on the SVR model. This plot effectively captures the relative effect of all the features in the entire dataset (Fig.4).

Correlogram of Top 100 Features (SNP) Identified in SVR Correlation Analysis. The figure displays a heatmap representing the correlations between the top 100 features (Single Nucleotide PolymorphismsSNP) identified in the SVR (Support Vector Regression) correlation analysis. The heatmap uses varying shades of the color gray, with higher values indicating stronger correlations between the variables. This visualization allows for a clear and visual assessment of the interrelationships among the features, providing valuable insights into their associations and potential implications in the study.

Summary Beeswarm Plot of Top 20 Features from SHAP Importance Analysis based on SVR Model. This figure presents a beeswarm plot summarizing the top 20 features derived from our SHAP (SHapley Additive exPlanations) importance analysis using the SVR (Support Vector Regression) model. The plot visually captures the relative effect of each feature across the entire dataset, allowing for a comprehensive understanding of their respective influences. The beeswarm plot provides an intuitive representation of the feature importances, aiding in the identification of key contributors to the model's predictions and facilitating insightful data-driven decisions.

We ranked all SNPs based on the absolute values of feature importance obtained from four feature selection methods respectively (see Supplementary 1). Considering that the ranking results do not follow a normal distribution and the assumptions of equal variances, we conducted a significance analysis of the differences in these rankings using the Wilcoxon signed-rank test, instead of the paired t-test.

Our results showed that the difference between Variable ranking and Permutation ranking is significant at P-value 0.05 level. The difference between Variable ranking and ranking of Correlation Matrix or SHAP were not significant. The difference between Permutation ranking and ranking of Correlation Matrix or SHAP were not significant. The difference between Correlation Matrix ranking and SHAP ranking was not significant also (Table 2.).

Compare to the importance results of other three methods, SHAP importance provide very rich information of negative contribution genes (Supplementary 1). Understanding the positive and negative contributions is vital for studying the gene's function and its role in plant physiological activities. Consequently, in the subsequent biological analysis, we made use of the SHAP importance results from our research.

By employing the basic local alignment search tool (BLAST), we conducted a comparative analysis of the sequences associated with 1033 single nucleotide polymorphisms (SNPs) against the annotated genes available in the soybase database (https://www.soybase.org/). Among these SNPs, 253 displayed a perfect match with their corresponding genes (refer to Supplementary 2). Subsequently, we performed a Gene Ontology (GO) analysis on these 111 genes and mapped their positions to the chromosomes of soybeans, as illustrated in Fig.5.

Whole Genome View of 111 Identified Genes. The figure presents a visual representation of identified genes, where each red dot represents a corresponding gene from the BLAST (Basic Local Alignment Search Tool) hit. The genes displayed in the plot are related to soybean branching. This comprehensive genome view provides valuable insights into the spatial distribution and clustering patterns of the branching-related genes, aiding in the exploration and understanding of their potential functional significance.

We conducted GO enrichment analysis on these 111 genes from three aspects: molecular function, cellular components, and biological process. Our analysis results revealed that the GO terms related to Biological Processes could be clustered into seven categories, with a total occurrence of 31 genes. The most prominent category was "signal transduction" (11 out of 31), followed by "translation" and "lipid metabolic process," each accounting for six out of 31 genesrespectively. Regarding Molecular Function, the GO terms could be grouped into 13 categories, with a total of 157 gene occurrences. The most prevalent category was "protein binding" (31 out of 157), followed by "transferase activity" (22 out of 157), and "kinase activity" (20 out of 157). Concerning Cellular Components, the GO terms could be classified into 21 categories, with a total of 380 gene occurrences. The most significant category was the "plasma memberane" (56 out of 380), followed by "cytoplasm" (42 out of 380), and "extracellular region" (42 out of 380). For detailed results, please refer to Fig.6 and Supplement 3.

Analysis of GO ontologies distribution. The figure displays three pie charts representing the distribution of three kinds of Gene Ontology (GO) ontologies, namely Cellular Component, Molecular Function, and Biological Process. Each pie chart is color-coded to distinguish different types of GO, and the size of each segment represents the proportion of that specific GO type within its respective ontology category. The accompanying number table provides the count of genes associated with each GO type, followed by the ID and category of the corresponding GO term. This analysis provides a comprehensive overview of the functional annotations of the genes in the study, highlighting their involvement in various cellular components, molecular functions, and biological processes.

Furthermore, we performed Gene Ontology enrichment analysis using the agriGO database. The outcomes revealed the functional distribution of 111 genes associated with biological processes (Fig.7). Notably, these processes exhibited a significant level (level 19) of overall metabolic activities. We observed a negative regulation between multicellular organismal processes and cell recognition. Additionally, a complex interplay of negative and positive regulations among reproduction-related processes, including reproductive process, pollination, pollen-pistil interaction, and recognition of pollen were detected (Fig.7).

GO Term Enrichment Analysis of 244 Genes using AgriGo Database Corresponding to Biological Function. The figure presents the results of GO term enrichment analysis performed on the 244 genes using the AgriGo database, focusing on their biological functions. The color shading in the illustration ranges from red to yellow, representing the significance levels of the enriched GO terms, with red indicating strong significance and yellow indicating weaker significance.Furthermore, different arrow types are employed to indicate the regulation relationships between the enriched GO terms and the genes. For instance, a green arrow signifies negative regulation, while other arrow types correspond to various regulation types.This analysis provides valuable insights into the functional annotations and regulatory relationships of the studied genes, shedding light on their roles and potential biological implications in the context of the AgriGo database.

Read more here:
A comparative study of 11 non-linear regression models highlighting autoencoder, DBN, and SVR, enhanced by SHAP ... - Nature.com

Related Posts

Comments are closed.