Multivariate statistical approach and machine learning for the evaluation of biogeographical ancestry inference in the forensic field | Scientific…

PCA, PLS-DA and XGBoost models at inter-continental level

As proposed in different papers28,69,70,71, PCA was first performed to preliminary investigate the available datasets involving the four selected AIMs panels for BGA inference. As expected, for the first level of BGA (i.e., inter-continental BGA) inference, several separate clusters corresponding to African, American, Asian, European, and Oceanian individuals were observed in the space of the first two PCs (Fig.1). This result turned straightforward for all the evaluated AIMs panels.

PCA Scores plots showing the PCA models obtained for the different evaluated AIMs panels.

After an initial PCA analysis with the Asian continent in its entirety, the Asian was subdivided into its regions due to its breadth -within our dataset, individuals were belonging to different regions of Asia- and the fact that the prediction of the biogeographical origin within the Asian continent has been and is a subject extensively studied in the forensic field25,72,73,74,75,76.

As in our dataset, if considering Asia composed by Central, East, North, and South Asia populations, PCA plot highlights that African, East Asian, Oceanian, and (partially) North Asian and European individuals showed a better differentiation from the other tested individuals, while American, South Asian, Central Asian, and Middle East subjects provided an overlap in the PCA space. In addition, better separation of the evaluated populations can be observed in Additional file 1: Fig. S1 also involving three principal components (for a total amount of CEV% equal to 83%).

All the evaluated AIMs panels show a similar degree of separation among individuals belonging to different continental areas. However, only the African individuals reveal a separate cluster in all the panels, presumably due to the history of humans in Africa that is complex and includes demographic events that influenced patterns of genetic variation across the continent, and the fact that modern humans first appeared in Africa roughly 250,000350,000years before present and subsequently migrated to other parts of the world77.

As shown in Additional file 1: Fig. S1a,b, the African individuals generate an elongated cluster (dark yellow) that extends towards the gray one corresponding to the Middle East region. By evaluating the African individuals closest to the Middle East cluster, we observed that they belong to the populations of northern Africa. The Middle East cluster is in the middle of the European and South Asian clusters and partly overlaps. The light blue cluster that corresponds to the admixed and non-admixed American population is projected toward the European cluster and partly overlaps with it, suggesting that admixed American individuals have an important proportion of European ancestry78.

As it can be observed in Fig.1 and in Additional file 1: Fig. S1, the distribution of the populations in the space of the PCs perfectly reflects the distribution of the populations in the globe: indeed, geographically distant populations are located distantly in the PCA plot, while geographically close populations, regardless of whether they belong to a continent or another, are close in the PCA plot.

Similar PCA plots were obtained by Glusman et al.79 and Haber et al.80 using a significantly greater number of SNPs, 300,000 and 240,000 respectively than those tested in all the forensic panels. Therefore, as previously highlighted28,69,70, despite the limited number of SNPs, the performance of each panel across populations was generally consistent even if some genetic markers performed more than others.

However, although PCA analysis allows us to assign an individual to his/her population of origin through a visual, intuitive, and easy to interpret approach, it does not provide significant divergence between populations, and obviously, it cannot be used alone in forensic context because it does not provide an accurate statistical estimate of the weight of the evidence69.

PLS-DA was then applied to the same experimental sets based on PCA modeling results to develop more reliable discrimination models to classify the variables. As a result, for the first level of BGA (i.e., inter-continental BGA) inference, African, American, East Asian, South Asian, Central Asian, North Asian, European, and Oceanian individuals were effectively separated using models involving two latent variables (LVs) (Fig.2). This result turned noteworthy for all the evaluated panels.

PLS-DA Scores plots showing the models obtained for the different evaluated AIMs panels.

Even if the PCA and PLS-DA plots may seem similar, the obtained Receiver Operating Characteristic (ROC) curves, together with the values of sensitivity, specificity, and AUC highlight the importance of a statistical tool to infer BGA. PLS-DA models for African, American, Asian, European, and Oceanian individuals provided optimal predictions with the CEV% values higher than 98% for all populations in all panels investigated except OceaniaEuroforgen (CEV% 88%), ForenSeq (CEV% 79%), MAPlex (CEV% 86%) and Thermo Fisher (CEV% 79%), and America in ForenSeq (CEV% 95%) panel-. The Oceania population results might be affected by the small number of individuals in the dataset showing this ancestry. All the developed models provided a CEV% higher than 80%, and all the tested AIMs panels proved reliable results that remarked the necessity to use a proper classification model, rather than PCA modeling, to infer BGA robustly.

In addition, through the PLS-DA model, the MaPlex panel ability to differentiate the set of individuals from South Asian to others was estimated with a high degree of accuracy (AUC=0.9828). As expected from the preliminary assessment of MaPlex29, no other panel considered in this study was found to be comparable with it in enhancing South Asian differentiation (Fig.3). Outstanding discrimination was obtained for East Asian populations in all panels considered associated with less discrimination for Central and North Asian probably due to the limited number of Asian population samples in our dataset, the use of unsuitable markers to discriminate these areas, and the fact that Asia has been a critical hub of human migration and population admixture81,82,83.

ROC curves, sensitivity, specificity, and AUC values for the tested continental populations.

As shown in Fig.3, there are some populations showing poor sensitivity and specificity values. As an example, South Asian individuals have low values for EUROFORGEN, ForenSeq and Thermo Fisher panels, while they are classified with promising results using the MAPlex panel. Similar behaviours are also observed for Middle East and Oceania individuals. These results reflect the fact that some panels, like MAPlex, have been developed to deeply investigate specific populations (i.e., AsiaPacific populations) and their classification might be prone to better identify such individuals29. On the other hand, some populations (like Oceanian and Middle East subjects) showed a lower number of available individuals, compared to the other tested populations, so that the classification performance are not optimal and might be improved by raising the number of investigated subjects.

In accordance with Phillips et al.29, our results indicated enhanced South Asian differentiation (AUC=0.98) using MaPlex panel compared to other forensic panels (Fig.4), but no increased differentiation between West Eurasian and East Asian populations was detected.

Comparison between AUC values of different populations obtained from PLS-DA and XGBoost model at inter-continental level considering Asian divided into regions.

Afterward, the best XGBoost model obtained after the grid search approach provided the following performances (Table 1) in terms of sensitivity, specificity, and AUC. XGBoost algorithm was tested to compare its performances with those from PLS-DA to evaluate another ML model aimed to obtain optimal and feasible inference models for BGA prediction.

As it can be seen by the values reported in Table 1, XGBoost model provides interesting results, but slightly lower than those of PLS-DA models, especially when comparing the AUC values (Fig.4).

As shown in Fig.4, optimal AUC values (close to 1) were observed for African, American, East Asian, and European populations using PLS-DA method, while lower results (around 0.8) were obtained for Central Asia, Middle East, North Asia, Oceania, and South Asia (with the exception of MAPlex panel involving a PLS-DA model) areas. The best results were achieved when using PLS-DA modeling, showing AUC values substantially higher than those obtained by XGBoost. The worst predictions were those involving the South Asian populations overall with AUC values around 0.6. In parallel, STRUCTURE software was tested as a benchmark comparison. The AUC of STRUCTURE was calculated by comparing the ancestry predictions from STRUCTURE software with the real ancestry origins of the tested populations and individuals. Firstly, the number of K clusters (i.e., populations) we selected for our comparison with STRUCTURE was equal to the number of ancestry populations we tested for the different PLS-DA and XGBoost models at inter-continental and inter-continental levels. Then, using CLUMPP together with STRUCTURE, we were able to obtain the Q-matrices containing the membership coefficients for each individual in each cluster. Therefore, each individual was assigned to the ancestry (k-th cluster) showing the highest membership coefficient: this approach allowed us to obtain ROC curves and AUC values for comparing STRUCTURE approach to the predictions and the performance provided by PLS-DA and XGBoost models.

Comparison between AUC values of different populations obtained from PLS-DA, XGBoost and STRUCTURE model at inter-continental level is reported in the Fig.5. As it can be observed in Fig.5, better performance was achieved when using PLS-DA modeling rather than STRUCTURE for diverse continents such as Africa, America, Europe and most of Asia (central, east and north Asian) for all panels investigated. Different results were observed in south Asia, Middle East and Oceania where STRUCTURE model seems to work best in almost all panels investigated with the exception of MaPlex panel in South Asia. The worst predictions were those involving XGBoost with AUC values on average lower than STRUCTURE except for Central Asian and North Asia.

Comparison of AUC values of different populations obtained from PLS-DA, XGBoost, and STRUCTURE at inter-continental level considering Asian divided into regions.

PCA model was assessed to infer BGA at continental level and, as expected28,69, unsatisfactory separations were observed (an example is shown in Fig.6 for MAPlex panel). In particular, the following countries and populations were evaluated for the different geographical areas:

Africa: African Caribbeans, Gambia, Kenya, Nigeria, Sierra Leone;

America: Colombia, Mexican Ancestry from Los Angeles, Mexico, Peru, Puerto Rico;

Asia: Bangladesh, China, India, Japan, Pakistan, Sri Lanka;

Europe: Finland, France, Great Britain, Italy, Spain, Israel.

PCA Scores plots showing the PCA models obtained for the different countries and populations tested using the Maplex panel.

These countries and populations were selected since they showed more than 80 genotyped individuals in the analyzed dataset; therefore, Oceanian individuals were not considered since the number of genotyped subjects was too limited. As observed in Fig.6, no significant differences or clusters were detected when using PCA exploratory strategy. Considering Asian population plot, Japan and China provided a different cluster when compared to the other Asian countries but despite the MAPlex panel was specifically developed to provide differentiation of Asian population, can discriminate South from East Asian populations but the sub-populations in these geographical areas cannot be separated from each other. Similar results were observed for all the other BGA AIMs panels (Additional file 1: Figs. S2, S3, S4, S5).

In summary, if this traditional multivariate approach allows us to suggest the BGA of known individuals at the inter-continental level, it fails at intra-continental level, presumably due to the statistical method that is incapable to classify the variables.

Therefore, the application of the PCA model can be considered inadequate for forensic BGA inference goals. For this reason, we adopted proper classification models, such as PLS-DA and XGBoost, to improve our models performance and obtain adequate separations among the populations.

Therefore, PLS-DA and XGBoost models were evaluated at intra-continental level. Figure7 reports the models and the performance results of the PLS-DA model built to discriminate among the African population.

ROC curves, sensitivity, specificity, and AUC values for African countries and populations.

In the African scenario, the best results were achieved by EUROFORGEN and Thermo Fisher panels, but also MAPlex panel provided interesting results.

The AUC values of the EUROFORGEN panel (Fig.7) between 0.8 and 0.9 for two out of five populations analyzed and greater than 0.9 for the remaining three, suggest an excellent capacity of discrimination and outstanding discrimination, respectively, of the SNPs in the panel. Thermo Fisher and MaPlex panel obtained similar results.

Presumably, due to the limited numbers of markers in the panel, the worst classification performances were provided by the ForenSeq panel with an average AUC value of 0.798, the lowest value compared to the other panels. These results can also be assessed from the scores plots reported in Additional file 1: Fig. S6 where several clusters are visible from the PLS-DA models built using the different AIMs panels.

The AUC value very close to 100% observed for the African population in all panels tested (Fig.3) highlights their outstanding discrimination at the inter-continental level and a slightly less capability, albeit excellent in most of the panels, at intra-continental level (Fig.7). Indeed, the average AUC values for all panels in African population range from an acceptable discrimination for Forenseq panel (average AUC value=0.798) to an outstanding discrimination for MaPlex and Thermo Fisher panel with the average AUC values equal to 0.92 and 0.91 respectively.

The XGBoost model was also performed, and Tables S1 in Additional file 1 shows the sensitivity, specificity, and AUC values for African populations.

AUC values of PLS-DA and XGBoost model were compared (Fig.8).

Comparison of AUC values obtained fromPLS-DA and XGBoost model for African population.

Interesting AUC values (around 0.9) were observed for African Carribean, Gambian, Kenyan, and Nigerian individuals, while the worst results (0.8 for PLS-DA, 0.6 for XGBoost) were obtained for the subjects from Sierra Leone presumably influenced by the lower number of individuals in the population. Again, the best performances were achieved using PLS-DA modeling.

In the American framework (Fig.9), no specific panel or model outperformed the others. Good discrimination results were observed using EUROFORGEN and MAPlex panels for the individuals from Mexico and Peru, and Puerto Rico (in all cases, AUC value is higher than 0.97), and Colombia (for MAPlex only with an AUC value of 0.85). On the other hand, the Thermo Fisher panel showed the best results in discriminating the individual of Mexican ancestry living in Los Angeles (US) (AUC value of 0.88), but also ForenSeq panel provided remarkable results (AUC value of 0.84). Thermo Fisher panel also provided reliable classification results (AUC value of 0.98) when dealing with subjects from Puerto Rico (as well as EUROFORGEN (0.97) and MAPlex (0.99) panels). These results can also be observed from the scores plots reported in Additional file 1: Fig. S7, showing several clusters among the tested countries and populations.

ROC curves, sensitivity, specificity, and AUC values for American countries and populations.

In addition, in the American scenario, all panels investigated except MaPlex show AUC values higher than 0.95 at inter-continental level (Fig.3), and a very slightly less capability of discrimination was observed at inter-continental level with the average AUC values higher than 0.90 for all panels (Fig.9). Therefore, particular attention should be paid with the MaPlex panel. In this case, the AUC value at inter-continental level is much lower (AUC=0.77) than the average value obtained at intra-continental level (AUC mean=0.93), showing a better discrimination at intra-continental level rather than at inter-continental one. This might be because there is a lower variability in the analyzed data (as well as in the number of tested populations) and, in this scenario, the algorithms are capable of predicting and inferring BGA with improved performances.

Tables S2 in additional file 1 shows the sensitivity, specificity, and AUC values of XGBoost model for American population. AUC values of PLS-DA and XGBoost models were compared (Fig.10).

Comparison of AUC values obtained from PLS-DA and XGBoost model for American population.

As shown in Fig.10, optimal AUC values (around 1 for PLS-DA) were observed when inferring the BGA for individuals from Mexico, Peru, and Puerto Rico, while lower performances (around 0.8 for PLS-DA) were obtained when evaluating Colombian and Mexican Ancestry from Los Angeles individuals. Again, the best performances were achieved using PLS-DA modeling.

In the Asian framework (Fig.11), similar results were obtained. On average, the best results were obtained when evaluating the Thermo Fisher and MAPlex panels, especially for the individuals from China, Japan, and Pakistan with AUC values equal to 0.99, 0.98 and 0.95, respectively, for Thermo panel and 0.98, 0.98 and 0.86 for MaPlex panel. Excellent discrimination was achieved also for India, Bangladesh and Sri Lanka with AUC greater than 0.80, showing the ability of these two panels to differentiate sub-populations.

ROC curves, sensitivity, specificity, and AUC values for Asian countries and populations.

The scores plot provided two separated clusters; the first one consists of China and Japan, while the second cluster reported the individuals from Bangladesh, India, Pakistan, and Sri Lanka (Additional file 1: Fig. S8).

Tables S3 in additional file 1 shows the sensitivity, specificity, and AUC values of XGBoost model for Asian population. AUC values of PLS-DA and XGBoost models were compared (Fig.12).

Comparison of AUC values obtained from PLS-DA and XGBoost model for Asian population.

The best AUC values (around 1 for PLS-DA) were obtained when inferring the BGA for individuals from China, Japan, and Puerto Rico, while lower results (around 0.8 for PLS-DA) were obtained when evaluating individuals from Bangladesh, India, and Sri Lanka. The lowest results were showed by the XGBoost model on Bangladesh subjects and, once again, the best performances were achieved with PLS-DA modeling.

Finally, no specific AIMs panel or model outperformed the others when evaluating the European countries and populations except for the ForenSeq panel that presents the worst results, presumably due to the low numbers of markers analyzed. The scores plot provided several separate clusters for all the evaluated populations, and these results can also be observed from the scores plots reported in Additional file 1: Fig. S9.

As shown in Fig.13, the best discrimination result was achieved for Finland populations (AUC0.93) for all panels investigated. It has to be noted that the best results for the French individuals were obtained with EUROFORGEN and MAPlex AIMs panels, while for the other groups (Italians, English, Spanish, and Finns) the results are comparable.

ROC curves, sensitivity, specificity, and AUC values for European countries and populations.

Tables S4 in additional file 1 shows the sensitivity, specificity, and AUC values of XGBoost model for European population. AUC values of PLS-DA and XGBoost models were compared (Fig.14).

Comparison of AUC values obtained from PLS-DA and XGBoost model for European population.

Optimal AUC values (around 0.91) were observed for all the PLS-DA models in this scenario, instead of the XGBoost models showing significantly lower results.

STRUCTURE approach was also compared with PLS-DA and XGBoost model at intra-continental level by evaluating the populations selected for the Africans, as an example. The results in terms of comparison of the AUC values are reported in the Fig.15. As already observed at inter-continental level, PLS-DA, and in most cases XGBoost, provided, on average, better performance in terms of accuracy when compared to STRUCTURE approach also at intra-continental level.

Comparison of AUC values obtained from PLS-DA, XGBoost and STRUCTURE model for African population.

Comparing the ROC curves of all forensic panels both at inter-continental level and at intra-continental level, a decrease in the accuracy in inferring BGA at intra-continental level was observed. This decrease may be explained by the natural geographical distribution of some populations: populations that share geographical borders and cultural practices are closely related genetically and these populations show similar genetic patterns72, by the SNPs in forensic panels, selected with the aim of discriminating populations at continental level28,69,73, and by their number which is relatively low compared to that used in other genetic fields through NGS technology.

PLS-DA and XGBoost at intra-continental level provided, on average, better performance in terms of accuracy when compared to STRUCTURE approach. In particular, the obtained results showed that PLS-DA performed better than STRUCTURE at both inter- and intra-continental level. Similar results were achieved by Jombart et al.43 when using a supervised classification approach like DAPC in comparison with STRUCTURE. Furthermore, both PLS-DA and STRUCTURE methods provide graphical outputs for interpreting the results of the obtained classification models. STRUCTURE provides the results in form of bar plot (being extremely helpful, for instance, when interpreting admixtures) while PLS-DA modelling shows a scatter plot for the tested populations, aimed to evaluate the goodness of the developed classification and allowing to project new individuals into the calculated Scores plots. On the other hand, GenoGeographer approach shows a brilliant use of Likelihood Ratio modelling, since it allows to compare the tested populations and the predictions in terms of Log10LR. Similarly, our XGBoost and PLS-DA approaches provide numerical results for the performance of the models (in terms of ROC curves) and the classifications of new individuals (in terms of probability of classification for the new tested individuals).

See the rest here:
Multivariate statistical approach and machine learning for the evaluation of biogeographical ancestry inference in the forensic field | Scientific...

Related Posts

Comments are closed.