Enhanced performance of gene expression predictive models with … – Nature.com

The advances in the field of Machine Learning have revolutionised other fields as well. With the increasing computational power and decreasing costs, the predictive power of modern-day deep learning networks allows scientists to apply those methods to various tasks that would be impossible to solve otherwise. Those advances did not omit the genomics field as well1,2. The first attempts to predict the expression solely on the DNA sequence started just after The Human Genome Project3however, they had a vast number of limitations4,5 and have mainly concentrated on the classical modelling approaches. However, those limitations started to disappear with the expansion of deep learning models. One of the first major studies on the usage of CNNs6 and XGBoost7 started a new era in predicting the expression with the introduction of ExPecto1. Then it continued with the use of CNNs through multiple models, including Basenji28, and finally with the use of transformer-based models like Enformer2. However, in our study, we have decided to take a standard approach available with the help of CNNs and expand it further with the input change to include spatial genomic information. The ExPecto model we decided to advance takes 20kbp surrounding the TSS of a given gene and uses expression from that to train a deep neural network to predict the epigenetic factors. Using those factors the tissue-specific gene expression profile is calculated with a high Spearman correlation score. In our study, we have investigated if the epigenetics marks alone are sufficient for the complex task of prediction of the expressionand have given a hypothesis that while they are incredibly informative, there is still a place for improvement. We decided that we would like to investigate the effects of the spatial chromatin architecture inside cell nuclei on the expression by exploring the models created with 3D information available and without it. To do that, we have modified the ExPecto algorithm accordingly, so it uses not only the 20kbp region around the TSS but also regions that are linearly distalbut are, in fact, spatially close, thanks to the spatial interactions that are mediated by specific proteins of interest. The overview of the algorithm proposed by us, SpEx (Spatial Gene Expression), is shown in Fig.1.

The architecture of SpEx. The spatial heatmaps are used for obtaining the regions close to the TSS (excluding+20kbp from TSS), and sequence from those regions is taken, and put into classic deep learning ExPecto modelwhich generates epigenetic signal over those regions. The classical features from ExPecto are merged with those obtained from spatially close regions, and the decision trees predict the expression levels. See Methods for more information about the algorithm.

To prove the model's validity, we decided to create an empirical study on how specific protein-mediated interactions are helping in the prediction of gene expression. To do that, we have selected the three most important proteins for loop creationcohesin, CTCF, and RNAPOL2. The effects of those proteins being unable to bind or be created properly were shown in multiple studies and were the inspiration for asking whether the machine learning models, provided we add 3D information (from interactions mediated by those proteins), will improve.

Cohesin is a protein complex discovered in 19979,10 by two separate groups of scientists. The complex is made out of SMC1, SMC3, RAD21, and SCC3. However, in human cell lines, SCC3 (present in yeast) is replaced by its paraloguesSA111, SA212, and SA313. However, SA3 appears only in cohesin during mitosis14, and we will concentrate on SA1 and SA2 since they are forming cohesin in somatic cells. The complex is essential in the proper functioning of the cell nucleusas is fundamental for the loop extrusion15, it stabilises the topologically associating domains (cohesin-SA1)16, allows interactions between enhancers and promoters (cohesin-SA2)16. The depletion of cohesin in a nucleus removes all the domains17, and completely destroys the spatial organisation of the chromatin. Mutations of cohesin negatively affect the expression of the genese.g. in Cornelia de Lange syndrome18,19 and cancer20, where the altered complex is incapable of sustaining its proper function, leading to diseases.

CTCF (CCCTC-binding factor) is an 11-zinc finger protein. Its primary function is the organisation of the 3D landscape of the genome21. This regulation includes: creating topologically associated domains (TADs)22,23,24, loop extrusion25, and alternative splicing26. The protein very often works with the previously mentioned cohesin complex, allowing loop formation. CTCF, as a regulator of the genome, binds to specific binding motifs and regulates around that loci. That is why, in case of mutations in the motifs, it might bind improperly, thus allowing disease development. However, not only mutations in the binding sites are disease prone. Mutations in the CTCF protein itself have proven to significantly influence the development of multiple conditions. Some of the examples of diseases induced by a mutation in the CTCF proteins include MSI-positive endometrial cancers27, breast cancers28,29, and head or neck cancer30.

Therere are three common RNA Polymerase complex proteins in eukaryotic organismsI, II, and III31. In this study, we will focus mainly on RNAPOL2, as that is responsible for the transcription of the DNA into messenger RNA32,33, thus having the most significant impact on the expression of the genes. The mechanisms responsible for creating the RNAPOL2 loops are complex and require not only RNAPOL2 protein but also several other transcription factors34,35. The mutations in those transcription factors have been shown to be linked to various diseases36, including acute myeloid leukaemia37,38,39, Von HippelLindau disease40,41, sporadic cerebellar hemangioblastomas42, benign mesenchymal tumours43, xeroderma pigmentosum, Cockayne syndrome, trichothiodystrophy44, and Rubenstein-Taybi syndrome45.

Multiple studies have shown the spatial landscape created by cohesin-mediated chromatin loops. The first major cohesin ChIA-PET study from 201446 showed the internal organisation of chromatin in the chromosomes. For example, the study provided a list of enhancer-promoter interactions, which can be a starting point for gene expression study.

The next study from 202047 extended the 2014 study and showed that among 24 human cell types, 72% of those loops are the same; however, the remaining 28% are correlated to the gene expression in different cell lines. Those loops mostly connect enhancers to the promoters, thus regulating the gene expression. Another interesting insight from this study is that those different profiles of interactions are effective in clustering the cell types depending on the tissue they were taken from.

CTCF, as mentioned above, is responsible for loop extrusion. That is why it is very popular to investigate CTCF-mediated interactions. Once again, like with the cohesin complexes, ChIA-PET is used for obtaining the interactions mediated by CTCF. One of the major studies from 201548 shows the genomic landscape among 4 cell lines. They discovered that SNPs occurring in the motif of the CTCF-binding site can alter the existence of the loopand by that, contribute towards the disease development. They assessed the SNPs residing in the core CTCF motifs and found 70 of those SNPs. Of those, 32 were available from the previously done GWAS studies, and 8 were strongly associated (via linkage disequilibrium) with disease development.

Another study from 201949 analysed mutations using 1962 WGS data with 21 different cancer types. Such an analysis, enhanced with the usage of CTCF ChIA-PET data, showed that disruptions of the insulators (that are creating the domains) by motif mutations and improper binding of CTCF (and, by that, diminish of the loop) lead to cancer development. Using a computational approach, they have found 21 potentially cancerous insulators.

The transcription chromatin interactions, such as the ones mediated by RNAPOL2, are of great interest as wellthey control the transcription directly, after all. The study from 201250 showed the RNAPOL2-mediated ChIA-PET interactions on 5 different cell lines to show the transcriptional genomic landscape. Another study from 202051 performed the same experiments on RWPE-1, LNCaP, VCaP, and DU145 cancer cell lines. Similar to the 2012 study, they have shown the spatial interactions based on RNAPOL2, but this time in cancer cell lines. Furthermore, they showed that cohesin and CTCF interactions provide a stable structural framework for the RNAPOL2 interactions to regulate the expression, thus making all of the proteins that we describe in this section crucial for the proper expression of the genes.

Those findings were the main motivation for our analysisas based on the evidence, the cohesin, CTCF, and RNAPOL2 interactions should give us more information on the genetic expression, thus improving the metrics for the machine learning models. In this work, we present an extension of the ExPecto1 deep learning model that is enriched with spatial information, thus, as expected, improving the statistical metrics.

ExPecto1 is a model introduced in 2018 for predicting gene expression from the sequence. It uses a deep neural network (namely, Convolutional Neural NetworkCNN). It is composed of, most importantly, 6 convolutional layers, 2 MaxPoolings (the activation function for all the layers is ReLU). For the exact architecture, see the original paper. As mentioned, the input to the network is the DNA sequence, and the output is in the form of the 2002 epigenetic factorscollected from ENCODE and Roadmap Epigenomics. The network takes 2000bp as the window and predicts the epigenomic of its 200bp middle, using the remaining base pairs as the context. The model is then applied to 20,000bp region surrounding TSS, and the step size is determined by the aforementioned 200bp, yielding 2002 features multiplied by 200 bins (100 left and 100 right), so the total number of features describing the gene is 400,400. Then, those features are transformed using exponential functions (10 upstream and 10 downstream TSS), so the final number of the features is 40,040. Then, xgboost (namely, gradient boosting of linear regression models) is used for the prediction of the expression of gene expression. They obtained a Spearman correlation score of 0.819, and the testing was done on chromosome 8.

View post:

Enhanced performance of gene expression predictive models with ... - Nature.com

Related Posts

Comments are closed.