Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN |…

Data collection

To prepare the training data and labels, we downloaded the whole-genome sequencing (WGS) data for 10,575 MTB isolates from the sequence read archive (SRA) database17 and obtained corresponding lineage and phenotypic drug susceptibility test (DST) data from CRyPTIC Consortium and the 100,000 Genomes project in an excel file, which is also available in the supplementary of their publication15. The phenotypic DST results for the drugs were used as labels when training and evaluating our ML models. All the data were collected and shared by the CRyPTIC Consortium and the 100,000 Genomes Project15. Like the datasets used by previous studies, this dataset is imbalanced in that most isolates are susceptible, and the minority of them are resistant for all the four first-line drugs (Fig.1) and four second-line drugs. The numbers of isolate samples with phenotypic DST results available are 7138, 7137, 6347 and 7081 for EMB, INH, PZA and RIF, respectively. There are 6291 shared isolates among the four sample sets. In addition, 6820 out of the 10,575 isolates have phenotypic DST result available for each of the four second-line drugs.

Phenotypic overview of the MTB isolates. This bar chart shows numbers of susceptible and resistant isolates with DST results available for each of the four first-line drugs.

To detect the potential genetic features that could contribute to MTB drug resistance classification, we used a command-line tool called ARIBA18. ARIBA is a very rapid, flexible and accurate AMR genotyping tool that generates detailed and customizable outputs from which we extracted genetic features. First, we downloaded all reference data from CARD, which included not only references from different MTB strains but also from other bacteria (e.g., Staphylococcus aureus). Secondly, we clustered reference sequences based on their similarity. Then we used this collection of reference clusters as our pan-genome reference and aligned read pairs of an isolate to them. For each cluster that had reads mapped, we ran local assemblies, found the closest reference, and identified variants. After running these steps, ARIBA generated files including a summary file for alignment quality, a report file containing information of detected variants and AMR-associated genes, and a read depth file. For each cluster, the read depth file provides counts of the four DNA bases on each locus of the closest reference where reads were mapped.

Next, we filtered out low-quality mappings that did not pass the match criteria defined in ARIBAs GitHub wiki18. From these high-quality mappings, we collected novel variants in coding regions, well-studied resistance-causing variants and AMR-associated gene presences that were detected from at least one out of the 10,575 isolates as 263 genetic features. In addition, we included indicator variables for each of the 19 lineages into our feature vector resulting in a total of 282 features.

We applied two traditional ML algorithms, RF and LR, on the sample sets labeled with phenotypic DST results (see Data collection section) to train MTB AMR classifiers for the eight drugs (first-line and second-line), where the feature vector for each sample consists of the 282 features mentioned in Genetic feature extraction section.

RF is an ensemble method and made up of tens or hundreds of estimators (decision trees) to compress overfitting19,20. A final prediction is an average or majority vote of the predictions of all trees. It is often used when there are large training datasets and a large number of input features. Moreover, RF is good at dealing with imbalanced data by using class weighting. Here we trained each RF classifier with 1000 estimators.

LR is a popular regression technique for modeling binary dependent variable21. By using a sigmoid function (logit), linear regression is transformed into logistic regression so that the prediction range is [0, 1] for outputting probabilities. Then, LR model is fitted using maximum likelihood estimation. During the training process, we applied L1 regularization on LR models for feature selection and to prevent overfitting22.

CNN is a class of deep neural networks that takes multi-dimensional data as input23. When we sayCNN, generally, we refer to a 2-dimensional CNN, which is often used for image classification. However, there are two other types of CNN used in practice: 1-dimensional and 3-dimensional CNNs. Conv1D is generally used for time-series data where the kernel moves on one dimension and the input and output data are 2-dimensional. Conv2d and 3D kernels move on two dimensions and three dimensions, respectively.

Because deep learning algorithms require substantial computational power, we performed feature selection to only keep relevant features as input for deep learning algorithms. First, we randomly selected 80 percent of samples to calculate the importance of each feature by using the scikit-learn RF feature importance function that averages the impurity decrease from each feature across the trees to determine the final importance of each variable24. Then, we tuned the feature importance cutoff to find the one that maximizes the F1-score of an RF model trained on the remaining 20 percent of samples. For each of the eight drugs, features were selected when their feature importance scores were bigger than the optimal cutoff. The tuning processes for first-line drugs are visualized in Fig.2.

Feature importance cutoff tuning. For the four first-line drugs, when the cutoff increases, the F1-score quickly increases to its maximum and then continues to decrease. The cutoffs maximized F1-scores are 0.0004 (EMB), 0.0006 (INH), 0.0008 (PZA) and 0.0016 (RIF).

After the relevant features were selected, we designed and built a multi-input CNN architecture with TensorFlow Keras25 that took N inputs of 421 matrices representing N selected SNP features into the first layer. Each 421 matrix consists of normalized DNA base counts for each locus within a 21-base reference sequence window centered on the focal SNP (Fig.3). We generated normalized counts based on the raw base counts extracted from the read depth file mentioned in Genetic feature extraction section. Our convolutional architecture starts with two 1D convolutional layers followed by a flattening layer for each SNP input. Then, it concatenates the N flattening layers with the inputs of AMR-associated gene presence and lineage features. Finally, we added three fully connected layers to complete the deep neural network architecture (Fig.4). It smoothly integrates sequential and non-sequential features.

Conversion of raw base counts at each locus of a 21-base reference window into normalized base counts as Conv1D input of each selected SNP feature. The raw base counts were derived from reference-reads alignment, as shown on the left of this figure. The center of the window is the locus of a selected SNP feature. The normalized base counts at each locus are the percentage of the four DNA bases (ACGT), respectively.

Flowchart of our 1D CNN architecture.

Read more:
Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN |...

Related Posts

Comments are closed.