Cancer-associated genes and essentiality scores
We first determined whether cancer-related genes are likely to have high essentiality scores. We aggregated several essentiality scores calculated by multiple metrics5 for the list of genes identified in the COSMIC Census database (Oct 2018) and for all other human protein coding genes. Two different approaches to scoring genes essentiality are available. The first group of methods calculates the essentiality scores by measuring the degree of loss of function caused by a change (represented by variation detection) in the gene. It uses the following methods: residual variation intolerance score (RVIS), LoFtool, Missense-Z, the probability of loss-of-function intolerance (pLI) and the probability of haplo-insufficiency (Phi). The second group (Wang, Blomen and Hart- EvoTol) studies the impact of variation on cell viability. For all methods above measuring essentiality, a higher score indicates a higher degree of essentiality. Each method is described in detail in5.
We find that on average the cancer genes exhibit a higher degree of essentiality compared to the average scores calculated for all protein coding human genes and all metrics (Fig.1). We find that genes associated with cancer have higher essentiality scores on average in both categories (intolerance to variants and cell line viability), compared to the average scores across all human genes. P values are consistently<0.00001 (Table 1).
We also investigated whether Tumor Suppressor Genes (TSGs) or Oncogenes as distinct groups of genes would show different degrees of essentiality. (If the gene is known to be both an oncogene and a TSG, then the essentiality score of that gene would be present in both the oncogene and the TSG groups). We found no significant differences in the degrees of essentiality on average for either group compared to the set of all cancer genes (Table 1; Fig.1).
The results are particularly of interest in the context of cancer, as essential genes have been shown to evolve more slowly than nonessential genes20,21,22, although some discrepancies have been reported22. A slower evolutionary rate indicates less probability to evolve resistance to a cancer drug. This is particularly important in the case of anticancer drugs as it was reported that these drugs cause a change in the selection pressure when administered, leading to increased drug resistance23.
This association between cancer-related genes and essentiality scores prompted us to develop methods to identify cancer-related genes using this information. We used a machine-learning approach. A range of open-source algorithms were applied and tested to produce the most accurate classifier. We focus on properties related to proteinprotein interaction networks, as essential genes are likely to encode hub proteins, i.e., those with highest degree values in the network21,24.
A total of nine different modelling approaches (or configurations) were run on the data to ensure the selection of the best performing approach (the list of these can be found in the Supplementary Information Table 2, along with their performance metrics). The performance metric used to rank the models was Logarithmic Loss (LogLoss), LogLoss is an appropriate and known performance measure when the model is of a binary-classification type. The LogLoss measures confidence of the prediction and estimates how to penalise incorrect classification. The selection mechanism for the performance metric takes the type of model (binary classification in this case) and distribution of values into consideration when recommending the performance metric. However, other performance metrics were also calculated (Supplementary Information Table 2). The performance metrics are calculated for all validation and test (holdout) sets to ensure that the model is not over-fitting. The particular model with best performance result (LogLoss) in this case was: eXtreme Gradient Boosted Trees Classifier with Early Stopping. The model shows very close LogLoss values for training/validation and holdout data sets (Table 2), demonstrating no over-fitting.
The model development workflow (i.e., the model blueprint) is shown in Fig.2. This shows the pre-processing steps and the algorithm used in our final model, and illustrates the steps involved in transforming input into a model. In this diagram, Ordinal encoding of categorical variables converts categorical variables to an ordinal scale while the Missing Values Imputed node imputes missing values. Numeric variables with missed values were imputed with an arbitrary value (default9999). This is effective for tree-based models, as they can learn a split between the arbitrary value (9999) and the rest of the data (which is far away from this value).
Model development stages.
To demonstrate the effectiveness of our model, a chart was constructed (Fig.3) that shows across the entire validation dataset (divided into 10 segments or bins and ordered by the average outcome prediction value) the average actual outcome (whether a gene has been identified as cancer gene or not) and the average predicted outcome for each segment of the data (order from lowest average to highest per segment). The left side of the curve indicates where the model predicted a low score on one section of the population while the right side of the curve indicates where the model predicted a high score. The "Predicted" blue line displays the average prediction score for the rows in that bin. The "Actual" red line displays the actual percentage for the rows in that bin. By showing the actual outcomes alongside the predictive values for the dataset, we can see how close these predictions are to the actual known outcome for each segment of the dataset. Also, we can determine if the accuracy diverges in cases where the outcome is confirmed as cancer or not, as the segments are ordered by their average of outcome scores.
The Lift Chart illustrating the accuracy of the model.
In general, the steeper the actual line is, and the more closely the predicted line matches the actual line, the better the model. A close relationship between these two lines is indicative of the predictive accuracy of the model; a consistently increasing line is another good indicator of satisfactory model performance. The graph we have for our model (Fig.3) thus indicates high accuracy of our prediction model.
In addition, the confusion matrix (Table 3) and the summary statistics (Table 4) show the actual versus predicted values for both true/false categories for our training dataset (80% of the total dataset). The model statistics show the model reached just over 89% specificity and 60% sensitivity in predicting cancer genes. This means that we are able to detect over half of cancer genes successfully while only misclassifying around 10% of non-cancer genes within the training/validation datasets. The summary statistics (Table 4) also shows the F1 score (harmonic mean of the precision and recall) and Matthews Correlation Coefficient (MCC is the geometric mean of the regression coefficient) for the model. The low F1 score reflects our choice to maximise the true negative rate (preventing significant misclassification of non-cancer genes).
To further confirm the models ability to predict cancer genes, we used the model on 190 new cancer genes that had been added to the COSMIC Cancer Census Genes between October 2018 and April 2020. Applying the model, we were able to predict 56 genes out of the newly added 190 genes as cancer genes, all of which were among the false positives detected by the model. This indicates that the model is indeed suitable to use to predict novel candidate cancer genes that could be experimentally confirmed later. A full ranked list of candidate genes predicted to be cancer associated by our model is available in Supplementary Information Table 3.
Another way to visualise the model performance, and determine the optimal score to use as a threshold between cancer and non-cancer genes, is the prediction distribution graph (Fig.4) which illustrates the distribution of outcomes. The distribution in purple shows the outcome where gene is not classified as a cancer gene while the second distribution in green shows the outcomes where gene is classified as a cancer gene. The dividing line represents the selected threshold at which the binary decision creates a desirable balance between true negatives and true positives. Figure4 shows how well our model discriminates between prediction classes (cancer gene or non-cancer gene) and shows the selected score (threshold) that could be used to make a binary (true/false) prediction for a gene to be classified as a candidate cancer gene. Every prediction to the left of the dividing line is classified as non-cancer associated and every prediction to the right of the dividing line is classified as cancer associated.
The prediction distribution graph showing how well the model discriminates between cancer and non-cancer genes.
The prediction distribution graph can be interpreted as follows: purple to the left of the threshold line is for instances where genes were correctly classified as non-cancer (true negatives). Green to the left of the threshold line is for instances were incorrectly classified as non-cancer (false negatives). Purple to the right of the threshold line, is for instances that were incorrectly classified as cancer gene (false positives). Green to the right of the threshold line, is for instances were correctly classified as cancer genes (true positives). The graph again confirms that the model was able to accurately distinguish cancer and non-cancer genes.
Using the receiver operating characteristic curve (ROC) curve produced for our model (Fig.5), we were able to evaluate the accuracy of prediction. The AUC (area under the curve) is a metric for binary classification that considers all possible thresholds and summarizes performance in a single value, with the larger the area under the curve, the more accurate the model. An AUC of 0.5 shows that predictions based on this model are no better than a random guess. An AUC of 1.0 shows that predictions based on this model are perfect. (This is highly uncommon and likely flawed, indicating some features that should not be known in advance are being used in model training and thus revealing the outcome.) As the area under the curve is of 0.86, we conclude that the model is accurate. The circle intersecting the ROC curve represents the threshold chosen for classification of genes. This is used to transform probability scores assigned to each gene into binary classification decisions, where each gene would be classified as a potential cancer gene or not.
The receiver operator characteristic (ROC) curve indicating model performance.
Feature impact measures how much worse a models error score would be if the model made predictions after randomly shuffling the values of one field input (while leaving other values unchanged) and thus shows how useful each feature is for the prediction. The scores were normalised so that the value of the most important feature column is 100% and the other subsequent features are normalised to it. This helps identify those properties that are particularly important in relation to predicting cancer genes and would aid in further our understanding of the biological aspects that might underline the propensity of a gene to be a cancer gene.
Closeness and degree are ranked as the properties with the highest feature impact (Fig.6). Both are proteinprotein interaction network properties, indicating a central role of the protein product within the network. We find that both correlate with likelihood of cancer association. Other important properties such the phi essentiality score (probability of haploinsufficiency compared to baseline neutral expectation) and Tajimas D regulatory (measures for genetic variation at intra-species level and for proportion of rare variants) show that increased essentiality accompanied with occurrence of rare variants increase the likelihood of pathological impact and for the gene to be linked to cancer initiation or progression. We also note that greater length of a gene or transcript increases the likelihood of a somatic mutation, so increasing the chance of a mutation within that gene, thus increasing the likelihood of it being a cancer gene.
The top properties ranked by their relative importance used to make the predictions by the model.
To confirm that the selected model performance is optimal based on the input data used, we created a new blended model combining the best 2nd and 3rd modelling approaches from all modelling approaches tested within our project and compared the performance metric (AUC) of our selected model with the new blended model. We found that improvement is small (0.008), despite the added complexity, where the blended model achieved an AUC of 0.866 and our single selected model achieved an AUC of 0.858.
We have also retrained our model using a dataset that excludes general gene properties and found that a reduction in models performance was evident but very small. The model trained on this dataset achieved an AUC of 0.835 and a sensitivity of 55% at a specificity of 89%. This small reduction in the predictability of the models indicates that essentiality and proteinprotein interaction network properties are the most important features predicting cancer genes and that information carried by gene general properties can be in most part be represented by information carried by these properties. This can be rationalised, as longer genes (median transcript length=3737) tend to have the highest number of proteinprotein interactions25.
According to a recent comprehensive review of cancer driver genes prediction models, currently the best performing machine learning model is driverMAPS with an AUC of 0.94, followed by HotNet2 with an AUC of 0.814. When comparing our model performance using AUC to the other 12 reviewed cancer driver genes prediction models, our model would come second with an AUC of 0.86. Our predictive model achieved better AUC measured performance when compared to the best model that used a similar network based approach (HotNet2 with AUC=0.81) and better than the best function-based prediction model (MutPanning with AUC=0.62). The strong performance of our model indicates the importance of combining different and distinctive gene properties, when building prediction models, while avoiding reliance on the frequency approach that could mask important driver genes that were detected in fewer samples. Despite the apparent success and high AUC score reported by our model, this should be treated with some caution. The AUC value is based on the ROC curve which is constructed by varying the threshold and then plotting the resulting sensitivities against the corresponding false positive rates. Several statistical methods are available to use to compare two AUC results and determine if the difference is significant26,27,28. These methods require the ranking of the variables in its calculations (e.g., to calculate the variance or covariance of the AUC). The ranking of predicated cancer associated genes was not available from all the other 12 cancer driver genes prediction methods. Thus, we were not able to determine whether the difference between the AUC score of our method and the AUC scores of these methods is significant.
The driverMAPS (Model-based Analysis of Positive Selection) method (the only method with higher AUC compared to our model) identifies cancer candidate genes using the assumption that these genes would exhibit elevated mutation rates in functionally important sites29. Thus, driverMAPS combines frequency- and function-based principles. Unlike our model that uses certain cohorts of genes properties, the parameters used in driverMAPS are mainly derived and estimated from factors influencing positive selection on somatic mutations. However, there are few features in common between the two models, such as dN/dS.
Despite driverMAPS had the overall best performance, network-based methods (like our method) showed much higher sensitivity than driverMAPS therefore potentially making more them more suited to distinguish cancer driver from non-driver genes. The driverMAPS paper29 provides a list of novel driver genes. We found that 35% of these novel candidate genes were also predicted by our model. Differenced in genes identified as cancer-related in the two approaches could be attributed to the different nature of features used by the two models. We believe that there is evidence30 pointing to genes with low mutation rates, but with important roles in driving the initiation and progression of tumours. Genes with high mutation rates were also shown to be less vital than expected in driving tumor initiation31. This variability in the mutation rate correlation with identified driver genes might explain some genes that our model does not identify as cancer-related genes where driverMaps does. Our model uses properties that are available for most protein coding genes, while driverMaps applies to genes already identified in tumour samples and predicts their likelihood to be driver cancer genes. Thus, the candidate list of genes provided by driverMaps is substantially smaller than our list. Using an ensemble method that evaluates both driverMAPS score and our models score for each gene, may produce more a reliable outcome. This would require further validation.
Enriching the models training dataset with added properties that show correlations with oncogenes could enhance the model prediction ability and elevate further the accuracy of the model. One potential feature is knowing whether a gene is an Ohnolog gene.
Paralogs retained from whole genome duplications (WGD) events that have occurred in all vertebrates, some 500 Myr ago are called ohnologs after Susumu Ohno32. Ohnologs have been shown to be prone to dominant deleterious mutations and frequently implicated in cancer and genetic diseases32. We investigated the enrichment of ohnologs within cancer-associated genes. Ohnolog genes can be divided into three sets: strict, intermediate and relaxed. These three sets are constructed using statistical confidence criteria32 . We found that 44% of the total number of cancer-associated genes (as reported in COSMIC census) belongs to an ohnolog family (using strict and intermediate thresholds). Considering that 20% of all known human genes are ohnologs (strict and intermediate) and that cancer-associated genes comprise less than 4% of all human genes, the enrichment of ohnolog genes with cancer-related genes is two times higher than expected. If only ohnologs that pass the strict threshold were considered, the fraction of cancer-related genes that are ohnologs is still high at 34%.
When performing pathway analysis (carried out using PANTHER gene ontology release 17.0), we found that cancer associated ohnologs show statistically significant enrichment (>tenfold) in many pathways and particularly within signalling pathways known to be cancer associated such as Jak/STAT, RAS and P53 (Supplementary Information Table 4). On the other hand, ohnologs that are not cancer associated are present in fewer signalling pathways and at enrichment ( The rest is here:
Essentiality, proteinprotein interactions and evolutionary properties are key predictors for identifying cancer ... - Nature.com