Unsupervised ensemble-based phenotyping enhances discoverability of genes related to left-ventricular morphology – Nature.com

In the following, we present our GWAS results. First, we investigate handcrafted phenotypes. Second, we examine unsupervised phenotypes obtained via shape PCA. Finally, we examine the results of our proposed UPE approach.

The loci were annotated with gene names on the basis of proximity to the lead single nucleotide polymorphism (SNP) if there was no additional causal evidence in the literature, or with nearby genes likely to mediate the association. For this, we used a diverse array of tools: the functional mapping and annotation (FUMA) web tool14, g:Profiler15, S-PrediXcan16 and the Ensembl Biomart database17. Among the candidate genes provided by these tools, a literature review was conducted to find evidence of an association with cardiovascular phenotypes, or experimental. Genes with asterisks were annotated solely on the basis of proximity and hence constitute totally novel findings.

We performed GWAS on traditional cardiac indices obtained using our segmentation approach. These indices were LVEDV, LV sphericity index at end diastole (LVEDSph), LV myocardial mass (LVM) and LV mass-to-volume ratio (LVMVR=LVM/LVEDV). Note that the LVEDSph as calculated here has not been investigated in previous GWAS (although a related phenotype, named LV internal dimensions was studied in an early GWAS of echocardiography-derived LV traits18). Details on how to compute this phenotype can be found in the Supplementary Information.

In the following, we discuss the associations found for each of these phenotypes. The Manhattan plots are shown in Extended Data Figs. 14.

For LVEDV, we discover nine independent associations. The association at intergenic SNP rs11153730 is probably related to PLN. This gene plays a crucial role in cardiomyocyte calcium handling by acting as a primary regulator of the SERCA protein (sarco- or endoplasmic reticulum Ca2+-ATPase), which transports calcium from the cytosol into the SR1 (ref. 19). Mutations in PLN have a well-established relationship with dilated cardiomyopathy (DCM)20. In ref. 4, PLN was found to be associated with LVEDV and LVESV. However, ref. 2 does not report this locus for the same phenotypes. The locus on chromosome 2 (with lead SNP rs2042995) is widely known to be associated with TTN. This gene encodes the protein titin, which is responsible for assembling myocyte sarcomere, and determines the stretching, contraction and passive stiffness of the myocardium21. This gene has been reported by refs. 2,4,11. rs375034445 lies within the body of BAG3; this is a well-known cardiac gene coding for a cellular protein that is predominantly expressed in skeletal and cardiac muscle, which plays a role in myocyte homeostasis and in the development of heart failure22; also, it shows a stronger association with LVESV and LV ejection fraction (LVEF), as found in previous studies2,4. The locus near the ATXN2 gene has previously been reported for LVEDV and stroke volume (SV)4. A candidate casual gene for this association is gene MYL2, the lead SNP (rs35350651) lies 558808 base pairs away from this genes transcription start site (TSS)23. The gene TMEM43 has been found in ref. 4 in association with LVESV and LVEF. Finally, gene MYH6 harbours SNP rs365990. This gene provides instructions for making a protein known as the cardiac -myosin heavy chain, which is expressed throughout the myocardium during early cardiac development24. Mutations in this gene, as well as the neighbouring MYH7 responsible for the -myosin heavy chain, have been linked to several pathologies: cardiomyopathies, arrhythmias and congenital heart disease (CHD). Two additional associations are located close to genes RRAS2 and ATG4D, respectively.

For LVEDSph, we find nine additional independent associations, apart from the PLN locus. rs35564079 is located 8,250bp upstream of the TSS of NKX2-5, in chromosome 5. This gene plays a crucial role in heart development; in particular, in the formation of the heart tube, which is a structure that will eventually give rise to the heart and great vessels. NKX2-5 helps determine the hearts position in the chest and also develops the heart valves and septa. Mutations in the NKX2-5 gene have been associated with several types of congenital heart defect, including atrial septal defects and atrioventricular block25. It has not been reported in refs. 2 or 4, but shows borderline significance with the fractal dimension of the LV trabeculae11. rs72007904 is located 300kb upstream of the TSS of the gene ABRA. ABRA codes for a cardiac and skeletal muscle-specific actin-binding protein located in the Z disc and M-line and binds with actin. Consistent with this, it is differentially expressed in cardiac tissues and skeletal muscle in the genotype-tissue expression (GTEx) data. ABRA has been associated with DCM in mice26. rs35001652 is close to KDM1A, a gene that codes for a histone demethylase involved in cardiac development, according to studies in mice27. rs463106 lies in the body of gene PRDM6. The mouse homologue of this gene, Prdm6, has been found to be important in early cardiac development28. An interesting association, with SNP rs162746, is close to gene EN1, however, we were not able to find a strong candidate gene in this region. Finally, rs573709385 lies in a gene desert in chromosome 2, the closest protein-coding genes are ACVR2A and ZEB2 (both at around 1.6Mb).

For LVM, four associations are found: rs4767239 is probably related to developmental gene TBX5 (T-box transcription factor 5), which has a known role in developing the heart and the limbs29. Through familial studies, mutations in this gene have been associated with Holt-Oram syndrome, a developmental disorder affecting the heart and upper limbs. In particular, there have been no recent reports on GWAS on LV phenotypes. The locus near the CENPW gene has a cardiac gene, HEY2, possibly causal for this association. HEY2 has been shown to suppress cardiac hypertrophy through an inhibitory interaction with GATA4, a transcription factor that plays a key role in cardiac development and hypertrophy30. HEY proteins are direct targets of Notch signalling and have been shown to regulate multiple key steps in cardiovascular development. Studies have found that the loss of HEY2 in mice leads to cardiac defects with high postnatal lethality31. This locus has also been reported as associated to right-ventricular phenotypes32. rs3740293 overlaps gene SYNPO2L, which is highly expressed in cardiac tissues (LV and atrial appendage) and skeletal muscle, making it a strong candidate gene. This SNP is also close to gene MYOZ1, which is also supported by our GWAS study (section on transcriptome-wide association studies, TWAS). Both genes have been previously proposed as candidates for cardiac phenotypes, in particular atrial fibrillation33,34. However, MYOZ1 shows very high expression only in the latter. Loss-of-function variants in this SYNPO2L have also been found causative of atrial fibrillation35, supporting this gene as a more likely candidate. rs73243622 is close to the candidate gene PPARGC1A. Finally, gene CDKN1A has been found in ref. 4 in association with LVESV and LVEF. Finally, for LVMVR, three new loci were found, apart from the PLN locus: rs2070458 close to SMARCB1 (in chromosome 22), rs17460016 in the FNDC3B locus (in chromosome 3) and rs12542527 (in chromosome 8). The last is an eQTL for the MTSS1 gene also linked to LV fractal dimension11.

The detailed summary statistics for the significant associations with handcrafted phenotypes are provided as Supplementary Data.

A shape PCA model was fit to our set of meshes (Methods). The effect on LV shape for the first 16 modes is shown in the Supplementary Fig. 7. GWAS was performed for these 16 modes and 18 independent loci were found with study-wide significance (P<3.1109). PC1, which is highly correlated with LVEDV, reconfirms the associations with TTN, MYL2 and MYH6. A new association, in chromosome 4, is an indel (chr4:120304290_GC_G) located 200kb downstream of MYOZ2. This gene codes for protein that functions by tethering calcineurin to alpha-actinin at Z-discs in muscle cells and inhibits the pathological cardiac hypertrophic response36. Another candidate gene in this locus is PDE5A. Indeed, some of the strongest associations overlap the body of this gene (although not the lead variant, which is the indel mentioned above). It has been shown that PDE5A is expressed in cardiac myocytes and may have pro-hypertrophic effects37.

PC2 is strongly linked with a new locus in chromosome 17, GOSR2. This component seems to be linked to LV conicity. Ref. 11 reports the GOSR2 locus as significantly associated with trabecular fractal dimension in slices 3 and 4, however, previous GWAS in global LV indices have not reported this locus. More broadly in the literature on genetics of cardiovascular phenotypes, it has been reported as associated to ascending aorta distensibility38, mitral valve geometry39 and CHD40.

PC3, highly correlated with LVEDSph, re-discovers the PLN and NKX2-5 loci. It also adds an association in chromosome 1, the SNP rs12142143, which lies within the ACTN2 gene. This gene codes for the Z disc protein -actinin-2. This locus has been reported for SV in ref. 4.

PC6 has hits in the TBX5 and NKX2-5 loci, with a new association near the NAV3 gene, that has been found to play a role in heart development in zebrafish41. PC7 is associated to a SNP near the TSS of PITX2 gene. It encodes for a transcription factor required for mammalian development, and disruption in its expression in humans causes CHD and is associated with atrial fibrillation. PC10 is linked to the PRDM6 locus (discussed before in connection with LVEDSph). PC11 is associated to SNPs rs59894072 (close to TBX3, a known cardiac gene42) and rs56229089. The second, in turn, is close (1Mb) to two possible candidate genes: KCNJ2, a potassium channel gene that is active in skeletal muscles and cardiac muscles43 and SOX9, a gene implicated in cardiac development44. The detailed summary statistics for the significant associations with shape PCs are provided in Supplementary Data.

CoMAs were trained on LV meshes at end diastole, using a range of network hyperparameters. The reconstruction performance for these models is shown in Supplementary Fig. 1.

GWAS was performed on all latent variables, for all training runs achieving a good reconstruction performance (Methods). A run is an instance of model training, defined by the choice of hyperparameters: in particular, random seeds controlling training and validation samples, weight initialization, network architecture and KullbackLeibler divergence weight. The number of such runs was R=36. The results obtained with nz=8 and nz=16 (8 and 16 latent variables, respectively) are reported, with a total number of 384 latent variables in the pooled representation. First, we examine the prevalence of significant GWAS loci found in all runs of our ensemble. To count the loci, we split the genome into approximately linkage disequilibrium-independent genomic regions45 and computed the number of loci below the usual genome-wide significance threshold of 5108 (see details in the Methods section); Table 1 shows the results.

We found 49 independent associations with study-wide significance. All of the previously discussed findings are recovered by UPE with study-wide significance, except the following loci: MTSS1, TBX3, PPARGC1A and FNDC3B (the last two show with suggestive significance in UPE). The summary statistics of the GWAS for the best latent variable of each of these 49 loci are displayed in Table 1. When a gene name is displayed in bold letters, it means that this locus was found only via the ensemble approach. Most loci have previous evidence supporting their plausible role in cardiac pathways. In addition, many of them are totally novel and represent interesting avenues for further research.

In what follows, we perform an in-depth analysis of our novel genetic findings in the light of recent literature.

We now describe loci that have not been linked to structural LV phenotypes in recent GWAS, but count with other types of evidence.

rs11706187 is probably linked to developmental gene SHOX2. The mouse homologue of SHOX2, Shox2, is essential to differentiate cardiac pacemaker cells by repressing Nkx2-5 (ref. 46). Whereas both TBX5 and NKX2-5 are highly expressed in adult cardiac tissues according to GTEx data, SHOX2 is not highly expressed in these tissues. A possible hypothesis is that rs11706187 regulates the expression of SHOX2 in developmental or pre-adult stages.

A particularly interesting association, with the SNP rs2245109, is located within the body of the STRN gene on chromosome 2 and is probably causally related to it: this gene encodes the protein striatin, which is expressed in cardiomyocytes and has been shown to interact with other proteins involved in the mechanism of myocardial function47. Mutations in this gene have been shown to lead to DCM in dogs48. In humans, there has been a recent GWAS on heart failure that reported this locus, but our study links it with cardiac morphology. Moreover, our estimated effect size is substantially higher; suggesting that this latent variable is an endophenotype closer to the underlying biology. This could provide insight to unravel the aetiology of a heterogeneous condition such as heart failure. The lead SNP has a high minor allele frequency (MAF) of 47.4%. This locus also contains eQTLs for this gene, as evidenced by TWAS (section TWAS). Something similar occurs with the RNF11 locus, although this does not reach genome-wide significance for heart failure (P=3.2106). The lead variant for this locus is an indel with low frequency (MAF 1.4%) and large estimated standardized effect size ((hat{beta }=)0.138). This locus has also been linked to the QRS (a combination of the Q, R and S waves) interval, although the causative gene is not clear49, some candidates being RNF11 itself, CDKN2C, C1orf185 and FAF1.

The SRL gene, which encodes the sarcalumenin protein, harbours the SNP rs889807. Sarcalumenin is a protein that binds Ca2+ located in the longitudinal sarcoplasmic reticulum of the heart. Its main function is to regulate Ca2+ reuptake in the sarcoplasmic reticulum by interacting with the cardiac sarco (endo)plasmic reticulum Ca2+-ATPase 2a (SERCA2a). According to GTEx data, this gene is highly expressed in adult cardiac tissue (both in the LV and atrial appendage tissues) and skeletal muscle.

Several associations lie near genes of the ADATMS (a disintegrin and metalloproteinase with thrombospondin motifs) family50: ADAMTS1 and ADAMTS5 (near rs2830977 on chromosome 21, with P=1.41010), ADAMTS6 (rs753963943 on chromosome 5, P=5.61011) and ADAMTS18 (chromosome 16, P=5.21013).

An association lies 260kb upstream of GATA6, a transcription factor that plays a critical role in the development of the heart. It has been found to regulate the hypertrophic response51. Sequence variants in this gene have been discovered to predispose for CHD phenotypes52,53.

rs12889267 lies 3,700kb upstream of the TSS of NDRG2. This gene has been demonstrated to play a role in protection against ischaemia and/or reperfusion injury, in a study in rats54.

One SNP overlaps KDM2A. As KDM1A, it is a histone demethylase gene. Although its link to the heart is less clear, there exists evidence from knockout studies in mice that supports its importance in embrionic development, including heart development55.

rs206524 is located within a gene for long non-coding RNA, LINC01254. A possible candidate protein-coding gene is NDUVF2, located 1.3Mb upstream of the SNP. According to the GTEx dataset, NDUFV2 is highly expressed in cardiac and skeletal muscle tissue.

rs12046416 is located 8,268bp upstream of the TSS of GJA5, a gene that is expressed in atrial myocytes and mediates the coordinated electrical activation of the atria56.

In addition to the loci with previous evidence discussed above, we report a number of novel genetic loci with P

In addition to genetic loci with P

A cluster of associations in chromosome 1 is located in a region that includes the S100 family of genes. In particular, the lead SNP in this region, rs985242, is located within the genes S100A1 and S100A13. The S100 is a family of low-weight Ca2+-binding EF-hand proteins, with 25 human genes identified.

The SNP rs28681517 lies within gene ADAMTSL3, whose associated protein has been shown to play a crucial role in maintaining cardiac structure and function in mice58.

SNP rs569550 lies 578,846 base pairs away from KCNQ1, which belongs to a large family of genes that provide instructions for making potassium channels. KCNQ1 encodes the alpha subunit of the potassium channel KvLQT1. Mutations in KCNQ1 are responsible for the long QT syndrome59.

Deletion 15:48690566_TC_T is a relatively common variant (MAF 14.4%), and is located 10kb downstream of the transcription end site of FBN1. Mutations in this gene are associated with Marfan syndrome, a genetic disorder that affects connective tissues in the body. It can have various manifestations, including cardiovascular complications.

rs9814240 is a coding variant in the LMCD1 gene. Mutations in this gene are causative of hypertrophic cardiomyopathy in mice60, however, no association had been found between variants in this gene and human cardiac phenotypes. Moreover, this gene has been found to interact with (the homologous of) GATA6 in mice61. GATA6 is located near one of the loci discovered with study-wide significance.

The effect of these loci on the LV morphology was evaluated by selecting the single phenotype with the strongest P value for the associated locus. To help characterize these latent variables, the Spearman correlation coefficient between the latter and the handcrafted LV indices were calculated and shown in Supplementary Table 4. We also examine the shapes of the average mesh within different ranges of quantiles for this latent variable, from 0 through 1. This is shown in Fig. 2, along with the associated Manhattan plots, for the loci PLN, TTN and STRN. The direction of effect is shown by indicating with arrows which allele favours which shape. We observe a very distinct effect on the morphology of each of these SNPs. While the PLN variant influences a latent variable that has a a smaller effect on LVEDV (Spearman r=0.722) and a strong link to LVEDSph (r=0.532), the best latent variable for TTN gene shows a greater correlation with LVEDV (r=0.910). Consistent with this, the GWAS on LVEDSph shows no significant signal for TTN, but a strong one for PLN (P=1020, Extended Data Fig. 2), which is also in line with a previous finding of ours10. Furthermore, these findings are in line with the effects of PC1 and PC3, where TTN and PLN loci are found, respectively.

ac, Manhattan plots for LV latent variables with best association for SNPs at the PLN (a), TTN (b) and STRN (c) loci. On top are shown the average meshes corresponding to the following range of quantiles, for each latent variable (from left to right): [0, 0.01], [0.095, 0.105], [0.495, 0.505], [0.895, 0.905] and [0.99, 1].

The SNP in the STRN gene is associated with a subtle phenotype that controls mitral orientation without a concomitant change in LV size (Fig. 2). This is consistent with the fact that it was not discovered in previous studies of structural LV phenotypes. Notably, this effect is consistent with the observed effect of PC4, for which this locus reaches genome-wide significance (see Supplementary Fig. 2 for the effect of PC4).

We performed TWAS using the S-PrediXcan tool16, to test the possibility of a mediating effect of gene expression and intron excision events on structural phenotypes. This tool is fed with models that impute gene expression and intron excision data on the basis of the genotype, which in turn were trained using data from the GTEx project, v.8 (ref. 62).

Our focus was on cardiovascular tissues, specifically the LV, atrial appendage and coronary, aortic and tibial arteries. To maintain statistical rigour, we applied a significance threshold of PGEx=2.2109, which adjusts for multiple comparisons (324 phenotypes and 68,919 tissuegene pairs). Similarly, for alternative splicing, the threshold was set at PAS=8.21010, considering the same multiple testing correction (187,535 being the number of intron-tissue pairs tested).

In the cardiac tissues (LV and atrial appendage), we identified genes located within loci of previously reported genes. In the LV, these included NKX2-5, STRN, SYNPO2L (FUT11, SEC24C and SYNPO2L itself), PLN, HEY2 (CENPW gene), TTN (FKBP7 gene), CENPV, GOSR2 (MAPT and GOSR2 itself) and FDPS (SCAMP3, ARHGEF2, RIT1, GOSR2, MAPT, HCN3, GBA, MSTO1, RUSC1, FUT11, SYT11, ADAM15 and FDPS itself). For the atrial appendage, the genes included PLN, STRN, NKX2-5, SYNPO2L and MYOZ1 within the SYNPO2L locus, as well as FKBP7 and SCAMP3. Many of these genes had been previously implicated on the basis of independent knowledge, bolstering the evidence for their potential causal roles. Notably, our analysis also revealed the direction of the effect on gene expression: higher PLN expression was associated with a more spherical LV morphology, while lower NKX2-5 expression was linked to the same phenotype (refer to Fig. 2b). Furthermore, an elevated STRN expression (in both cardiac tissues) was associated with a more horizontal mitral orientation (Fig. 2c). Detailed results for significant gene expression associations are provided as Supplementary Data.

In the case of arterial tissues, we found significant associations within various loci, such as the SYNPO2L locus (with the genes AGAP5, FUT11, SEC24C and ARHGAP27), FDPS (ARHGEF2, CLK2, FAM189B, GBA, GON4L, HCN3, NPR1 and SYT11), CENPW, TTN (PRKRA and FKBP7 genes), PLN (CEP85L and PLN), GOSR2 (WNT3, CRHR1, LRRC37A and MAPT), KDM2A, LINC01562, MYH6 (MYH6 and MYH7), RP11-383I23.2, RP11-574K11.29, SCAMP3, MYL2 (SH2B3 gene), SOST and TCF21.

Detailed results for intron excision events are provided in Supplementary Data.

We use the tool g:Profiler to find pathways for which our sets of genes were enriched. To define the gene sets, we selected a region of 100kb around each lead variant and chose the genes whose TSS was located within that window. Gene ontology terms belong to one of three different categories: molecular functions, cellular components and biological processes. Within the cellular component category, we have found a relevant enriched term, Sarcomere, comprising the following nine genes from our query: ACTN2, MYOZ1, SYNPO2L, BAG3, TNNT3, TNNI2, MYH6, MYH7, KY (P=9.2103). Within the biological process category, the terms Myofibril assembly, striated muscle cell development and sarcomere organization result enriched (P=1.2103, P=1.4103 and P=1.5103, respectively). Within the molecular function category, the term calcium-dependent protein binding is enriched (P=2.9108), although it is composed of nine members of the S100A family (which encompass a single locus), apart from SYT8 and TNNT3.

To detect pleiotropic effects, we performed a phenome-wide association study of the lead SNPs from Table 1. For this, we queried the Integrative Epidemiology Unit OpenGWAS Projects database. The results are included in the Supplementary Data File. We discuss briefly here some associations with cardiovascular phenotypes. A number of loci were associated to cardiac electrical phenotypes: CDKN1A, NDRG2, PLN, TBX5 and MYH6. The following loci were associated to pulse rate: SYNPO2L, NDRG2, MYH6, SRL, GOSR2, GATA6, ACTN2, KIAA1755, TMEM43, SLC27A6 and FNDC3B. The lead SNP at the PRDM6 locus was associated to heart rate recovery post exercise. The following loci were associated to blood pressure phenotypes (diastolic, systolic or hypertension): SYNPO2L, KCNQ1, MYL2, NDRG2, MYH6, SRL, GOSR2, GATA6, HSPB7, RNF11, EFEMP1, FNDC3B, NME9, PRDM6 and PLN. Finally, SYNPO2L, TBX5, MYH6, GOSR2, PITX2 and CDKN1A were associated to cardiac arrhytmias.

We set apart a subset of 5,470 UKBB participants of British ancestry for which the whole pipeline was run identically to the individuals from the discovery set. We report the detailed results in the Supplementary Material, including the estimated statistical power for each SNP on the basis of the effect size estimate (hat{beta }) from the discovery phase. Among the 49 study-wide significant loci, we report 28 that replicate with P<0.05 (whereas seven replicate with the more stringent Bonferroni threshold of P<0.05/49), as well as 47 loci for which the estimated direction of effect is consistent with that found in the discovery phase. For the suggestive associations, 11 loci replicated (out of 25) with the threshold of P<0.05, whereas 22 have a concordant direction of effect between the discovery and replication phases.

For comparison, we collected the GWAS summary statistics from previous studies on LV phenotypes, derived also from UKBB CMR images, namely refs. 2,4 and 11. We also include the results for LVESV, SV and LVEF from these studies. However, note that the unsupervised features studied in this work are static and were extracted using only the end-diastolic phase.

The comparison can be seen in Fig. 3. For each locus in Table 1 (which all pass the Bonferroni threshold), this figure displays the association P value found in previous GWAS and on our own GWAS of handcrafted phenotypes. Shades of red represent non-genome-wide significant associations, whereas shades of blue represent genome-wide significant ones and white corresponds to the PGW threshold. The second column represents the best P value across all traditional phenotypes for the loci given in the columns. Therefore, a shade of red in this column means that the locus is novel in the context of LV structural phenotypes.

The leftmost column corresponds to the best association found for that locus across the ensemble of phenotypes, whereas the second column corresponds to the best P value for that locus across the previous GWAS, where the P values are two-sided and derived from a linear association t-statistic (no adjustments were made for multiple comparisons). The white colour corresponds to the genome-wide significance threshold of 5108, whereas the shades of red and blue correspond to weaker and stronger associations, respectively. SV denotes stroke volume. LVEDVi, LVESVi and SVi denote the indexed versions of the phenotypes, that is, the phenotype divided by the participants body surface area. Finally, LVFD sn stands for LV trabecular fractal dimension measured at the nth slice of the LV longitudinal axis (for details, refer to the original publication in ref. 11). Grey squares (NA, not applicable) mean that the genetic variant was not tested in the corresponding study.

Source data

Originally posted here:
Unsupervised ensemble-based phenotyping enhances discoverability of genes related to left-ventricular morphology - Nature.com

Related Posts

Comments are closed.