Development of machine learning-based predictors for early diagnosis of hepatocellular carcinoma | Scientific Reports – Nature.com

Derivation of HCC predictors

The whole procedure of analysis was designed as follows in Fig.1. In present study, we used two feature selection methods and eight classification algorithms mentioned above to build sixteen predictors for HCC diagnosis by using gene expression profiles of 988 HCC and 332 CwoHCC accessed from the GEO database. First, on the basis of gene expression profiles of 988 HCC and 332 CwoHCC, 25,341,086 and 20,559,429 stable gene pairs were acquired, respectively. Among 25,341,086 and 20,559,429 gene pairs, there were 5765 stable reversal gene pairs between HCC tissues and CwoHCC tissues. Then, filtering gene pairs using 2902 secreted genes, we obtained 242 gene pairs, where gene i and gene j were secreted gene. Next, based on novel profiles with 242 features (gene pairs) (see Methods section), we captured the optimal feature (see Fig.2). Table 1 showed the comparison of classification performance of various predictors obtained based on accuracy, F1-Score fitness function and AUC value. The results presented in Table 1 illustrated that nine predictors, including mRMR+KNN, mRMR+SVM, mRMR+LR, mRMR+XGBoost, mRMR+LMT, MRMD+KNN, MRMD+SVM, MRMD+LR and MRMD+LMT, showed excellent results for all performance metrices, and reached accuracy of 1, F1-score of 1 and AUC of 1, respectively. Among these nine predictors, the predictor of mRMR+KNN and mRMR+SVM had the least number of 11 gene pairs (see Table 2).

The workflow of analyses.

A plot to show the IFS curve. Through adding features (gene pairs) ranked by mRMR and MRMD feature selection method one by one, the optimal feature was obtained when the highest accuracy was achieved.

Subsequently, we used independent datasets (including testing set, GEO sets, ICGC set and TCGA set) to validate the performance of various algorithms. In Table 3, for the 3057 HCC samples and 84 CwoHCC samples, MRMD+SVM predictor with 28 gene pairs (see Table S3) gained the highest accuracy and F1-score than other predictors in independent datasets, the accuracy, F1-score, and AUC were 0.9834, 0.9915, 0.9278 (95% CI is 0.89150.9642), respectively. However, the results also indicated that mRMR+SVM predictor with 11 gene pairs gained the highest AUC than other predictors in independent datasets, the AUC was 0.9384 (95% CI 0.92550.9514).

Since mRMR+SVM predictor and mRMR+KNN predictor with the least number of 11 gene pairs showed great results for all performance metrices in independent data, and MRMD+SVM predictor gained the highest accuracy and F1-score in independent datasets among 16 predictors, thus we focused on these three predictors in the next analysis. The detailed validation results of these three predictors in biopsy and surgery samples were shown in Table 4. For biopsy samples, both mRMR+SVM predictor and mRMR+KNN predictor yielded sensitivity of 1, specificity of 1 by using testing set (29 HCC samples and 48 CwoHCC samples), while MRMD+SVM predictor yielded sensitivity of 1, specificity of 0.8542. In GEO biopsy sets, mRMR+SVM predictor correctly classified 96.18% of the 131 HCC samples (GSE121248, GSE47197), mRMR+KNN predictor correctly classified 66.41% of the 131 HCC samples as well as all (100%) of the 131 HCC samples were correctly classified by MRMD+SVM predictor. For surgery samples, in the testing set (220 HCC samples and 36 CwoHCC samples), the sensitivity and specificity of two predictors (mRMR+SVM predictor and mRMR+KNN predictor) were 1. While, the sensitivity and specificity of MRMD+SVM predictor was 1 and 0.8889. This result demonstrated that mRMR+SVM predictor, mRMR+KNN predictor and MRMD+SVM predictor could discriminate HCC from CwoHCC correctly when using biopsy samples.

For surgery samples, in GEO surgery sets, 84.1% of the 2063 HCC samples were correctly classified by mRMR+SVM predictor, 70.04% of the 2063 HCC samples were correctly classified by mRMR+KNN predictor and 98.01% of the 2063 HCC samples were correctly classified by MRMD+SVM predictor. Moreover, among 2063 HCC samples, based on mRMR+SVM predictor, 79.76% of the 657 formalin-fixed paraffin-embedded (FFPE) HCC samples (GSE109211, GSE62743, GSE46444, GSE10141, GSE164760, GSE19977) were correctly recognized as HCC; while 58.14% of the 657 FFPE HCC samples was correctly classified by mRMR+KNN predictor and 99.85% of the 657 FFPEHCC samples was correctly classified by MRMD+SVM predictor. This result demonstrated that mRMR+SVM and mRMR+KNN predictor were available to the FFPE samples with RNA degradation. For the RNA-seq expression data obtained from TCGA and ICGC, the 11 gene pairs based on mRMR+SVM predictor could correctly identify 99.19% of the 371 HCC and the 98.77% of the 243 HCC samples, respectively.

While the 11 gene pairs based mRMR+KNN predictor could correctly identify 98.11% of the 371 HCC RNA-seq and the 97.94% of the 243 HCC RNA-seq samples. And MRMD+SVM predictor with 28 gene pairs could correctly identify all 371 HCC RNA-seq and all 243 HCC RNA-seq samples. This result demonstrated that mRMR+SVM predictor, mRMR+KNN predictor and MRMD+SVM predictor had a cross-platform ability. In summary, these three predictors had a cross-platform ability and could discriminate HCC from CwoHCC when using surgery samples, including FFPE samples with RNA degradation.

Furthermore, in Table S4, 82.86% of the 741 normal tissues in patients with HCC samples (NwHCC) samples and 82.04% of the 334 cirrhosis tissues in patients with HCC samples (CwHCC) samples were correctly classified by mRMR+SVM predictor, 67.48% of the 741 NwHCC samples and 57.49% of the 334 CwHCC samples were correctly classified by mRMR+KNN predictor, and 99.87% of the 741 NwHCC samples and 97.01% of the 334 CwHCC samples were correctly classified by MRMD+SVM predictor. This result showed that these three predictors could identify HCC adjacent tissues (CwHCC and NwHCC) from CwoHCC when using biopsy and surgery samples.

In conclusion, for biopsy and surgery samples, these three predictors could identify HCC and its adjacent tissues (CwHCC and NwHCC) from CwoHCC even when sample location is not accurate and samples are FFPE samples with RNA degradation. Additionally, these three predictors had a cross-platform ability. Importantly, the performance of HCC diagnostic signature based on MRMD+SVM is superior to mRMR+KNN predictor and mRMR+SVM predictor in some independent datasets.

To further verify the performance of mRMR+SVM, mRMR+KNN and MRMD+SVM predictor developed in current study, we compared with the existing predictors. Two published studies about finding REOs-based signature for early HCC diagnosis have been completed by Ao et al. and our previous work. In 2018, combining rank difference with majority voting rule, Ao et al. presented a signature by applying 491 HCC samples and 149 CwoHCC samples. This signature, including 19 gene pairs, was chosen from 72 reversal gene pairs. And it yiled the accuracy of 0.9969. In 2020, we identified an early diagnostic signature of HCC from 857 reversal gene pairs on the basis of mRMR and SVM. Using 1091 HCC samples and 242 CwoHCC samples, 11 gene pairs were derived and denoted as the signature, which achieved 1 of accuracy. Due to the difference of training data, a comparison of current results in this paper with existing results in previous studies is an unfair comparison. Therefore, we utilized the same evaluation criteria. To further assessed effectiveness of presented predictors, experimental results in independent datasets were used to perform comparison objectively.

In Table 2, for training set, both mRMR+SVM predictor with 11 gene pairs and mRMR+KNN predictor with 11 gene pairs achieved accuracy of 1, F1-score of 1, as well as the number of gene pairs is the least. Also, MRMD+SVM predictor with 28 gene pairs achieved accuracy of 1, F1-score of 1. As shown in Table 3, for a total of 3057 HCC samples and 84 CwoHCC samples, mRMR+SVM predictor was the best predictor, which yielded AUC of 0.9384, and its accuracy and F1-score were 0.8914 and 0.9351, respectively. In Table 4 and Table S4, for biopsy samples, based on the mRMR+SVM predictor, 96.18% of the 131 HCC samples from 2 datasets (GSE121248, GSE47197) could be correctly identified as HCC. Moreover, 75.26% of the 97 NwHCC samples from 2 datasets (GSE121248 and GSE64041) and all 80 CwHCC samples in GSE54236 were classified as HCC. While, based on MRMD+SVM predictor, all of 131 HCC samples could be correctly identified as HCC, all 97 NwHCC samples and all 80 CwHCC samples were classified as HCC. For surgery samples, 1800 HCC samples from 24 datasets were used to perform evaluation and 657 of them were FFPE HCC samples from 6 datasets. Thus, mRMR+SVM predictor could correctly discriminate 1800 HCC samples and 657 FFPE HCC samples with the sensitivity of 0.8428 and 0.7976, respectively. Also, MRMD+SVM predictor could correctly discriminate 1800 HCC samples and 657 FFPE HCC samples with the sensitivity of 0.9872 and 0.9985, respectively. This result demonstrated that mRMR+SVM predictor and MRMD+SVM predictor had the potential to classify FFPE samples with partial RNA degradation. Moreover, based on mRMR+SVM predictor, 614 out of 741 NwHCC samples from 9 datasets and 229 out of 334 CwHCC samples from 6 datasets were predicted as HCC. While based on MRMD+SVM predictor, all 741 NwHCC samples and all 334 CwHCC samples were predicted as HCC. For RNA-seq data, based on mRMR+SVM predictor, 368 out of 371 HCC samples from TCGA and 11 out of 50 NwHCC tissues were correctly identified as HCC. While based on MRMD+SVM predictor, all 371 HCC samples and all 50 NwHCC tissues were correctly identified as HCC. In addition, 240 out of 243 HCC samples from TCGA were also correctly identified as HCC. While based on MRMD+SVM predictor, all 243 HCC samples were also correctly identified as HCC.

Results in Table S4 displayed the identification of both HCC and its adjacent non-cancer (NwHCC and CwHCC) from CwoHCC by biopsy and surgery samples. For 131 HCC biopsy samples, the sensitivity of proposed mRMR+SVM predictor with 11 gene pairs (18 secreted genes) and MRMD+SVM predictor with 28 gene pairs was 0.7526 and 1, which were higher than Aos method (0.6031). The identification ability of proposed mRMR+SVM predictor was also better than Aos method in 80 CwHCC samples. Additionally, among these methods, mRMR+SVM predictor and MRMD+SVM predictor displayed the better classification in 657 HCC FFPE samples, 1800 HCC surgery samples (657 HCC FFPE samples were included) and all 1931 HCC samples (1800 HCC surgery samples and 131 HCC biopsy samples were contained). For 657 HCC FFPE samples, the accuracy of Aos method, our previous method (11 gene pairs, 2020), proposed mRMR+SVM predictor and MRMD+SVM predictor in this study was 0.172, 0.3973, 0.7976, 0.9985, respectively. For 1800 HCC samples, the accuracy of Aos method, our previous method, proposed mRMR+SVM predictor and MRMD+SVM predictor was 0.6639, 0.7656, 0.8428, 0.9872, respectively. For 1931 HCC samples, the accuracy of Aos method was 0.6572, the accuracy of our previous method was 0.7815, while the accuracy of the proposed mRMR+SVM predictor and MRMD+SVM predictor could increase to 0.8503 and 0.97, respectively. Above result suggested that mRMR+SVM predictor and MRMD+SVM predictor displayed the better performance when comparing with Aos method and our previous method.

In conclusion, methods developed in this paper produced higher accuracy and had superior prediction and diagnosis abilities compared to other published methods, especially for FFPE samples. Therefore, the mRMR+SVM predictor and MRMD+SVM predictor were deemed superior and more suitable predictors for facilitating early HCC diagnosis in clinical practice.

Read this article:
Development of machine learning-based predictors for early diagnosis of hepatocellular carcinoma | Scientific Reports - Nature.com

Related Posts

Comments are closed.