Using machine learning to determine the time of exposure to … – Nature.com

In this section we describe the clinical data set, data preprocessing, feature selection process, classifiers used, and design for the machine learning experiments.

The data we study is a collection of seven clinical studies available via the NCBI Gene Expression Omnibus, Series GSE73072. The details of these studies can be found in1, but we briefly summarize here and in Fig.1.

Each of the seven studies enrolled individuals to be infected with one of four viruses associated with a common respiratory infection. Studies DEE2-DEE5 challenged participants with H1N1 or H3N2. Studies Duke and UVA challenged participants with HRV, while DEE1 challenged individuals with RSV.

In all cases, individuals had blood samples taken at regular intervals every 412h both prior to and after infection; see Fig.1 for details. Specific time points are measured as hours since infection and vary by study. In total, 148 human subjects were involved with approximately 20 sampled time points per person. Blood samples were run through undirected microarray assays. CEL data files available via GEO were read and processed using RMA (Robust Multi-array Average) normalization through use of several Bioconductor packages23 producing expression values across 22,277 microarray probes.

To address the time of infection question, we separate the training and test samples into 9 bins in time post-inoculation, each with a categorical label; see Fig.1. The first six categories correspond to disjoint 8-h intervals in the first 2days after inoculation, and the last three categories are disjoint 24-h intervals from hours 48 to 120h. In addition to this 9-class classification problem, we also studied a relaxed binary prediction problem of whether a subject belongs to the early phase of infection (time of inoculation (le ) 48h) or later phase (time of infection > 48h). Results for this binary classification are inferred from the 9-class problem, i.e., if a classified label is associated to a time in the first 2days, it is considered correctly labeled.

After the data is processed, we apply the following general pipeline for each of the 14 experiments enumerated in Fig.2 (top panel):

Partition the data into training and testing sets based on the classification experiment.

Normalize the data to correct for batch effects seen between subjects (e.g., using the linear batch normalization routine in the limma package24).

Identify comprehensive sets of predictive features using the Iterative Feature Removal (IFR) approach6, which aims to extract all discriminatory features in high-dimensional data with repeated application of sparse linear classifiers such as Sparse Support Vector Machines (SSVM).

Identify network architectures and other hyperparameters for Artificial Neural Networks (ANN) and Centroid Encoder (CE) by utilizing a five-fold cross-validation experiment on the training data.

Evaluate the features identified from the step 3 on the test data. This is done by training and evaluating a new model using the selected features with a leave-one-subject-out cross validation scheme on the test study. The metric used for evaluation is BSR; throughout this study we utilize BSR as a balanced representation of performance accounting for imbalanced class sizes while being easy to interpret.

For each of the training sets illustrated in Fig.2 (top panel; stripes), features are selected using the IFR algorithm, with an SSVM classifier. This is done separately for all pairwise combinations of categorical labels (time bins); a 9-class experiment leads to 9-choose-2 = 36 pairwise combinations. So, for each of these 36 combinations of time bins, features are selected using the following steps.

First, the input data to IFR algorithm is partitioned into training and validation set. Next, sets of features that produce high accuracy on the validation set are selected iteratively. In each iteration, features that have previously been selected are masked out, so that theyre not used again. The feature selection is halted once the predictive rates on the validation data drops below a specified threshold. This results in one feature-set for a particular training-validation set of the input data.

Next, more training-validation partitions are repeatedly sampled, and the feature selection, as described above, is repeated for each partition-set; this results in a different feature-set for each new partition-set. Then, these different feature-sets are combined by applying a set union operation and the frequency of each individual feature is tracked if they are discovered in multiple feature-sets. The feature frequency is used to rank the features; the more a particular feature is discovered, the more important it is.

The size of this combined feature-set, although about 520% of the original feature-set size, is still often large for classification, so a last step we reduce the size of this feature-set. This is done by performing a grid-search using a linear SVM (without sparsity penalty) on the training data, taking the top n features, ranked by frequency, which maximize the average of true positive rates on every class, or BSR. Once the features have been selected, we perform a more detailed leave-one-subject-out classification for the experiments described in the Results and visualized in Fig.2 using the classifiers described in Methods section.

Feature selection produced 36 distinct feature-sets, coming from all distinct choices of two time bins from the nine labels possible. To address the question of commonality or importance of features selected on a time bin for a specific pathway, we implemented a heuristic scoring system. For a fixed time bin (say, bin1) and a fixed feature-set (say, bin1_vs_bin2; quantities summarized in Table 2) the associated collection of features was referenced against the GSEA MSigDB. This database includes both canonical pathways and gene sets as the result of data miningwe refer to anything in this collection generically as a gene set. A score for each MSigDB gene set was assigned for a given feature-set (bin1_vs_bin2) based on the ratio of features in the feature-set which appear in the gene set. For instance, a score of 0.5 for hypothetical GENE_SET_A for feature-set bin1_vs_bin2 represents the fact that 50% of the features in GENE_SET_A are present in bin1_vs_bin2.

A score for pathway on a time bin by itself was defined as the sum of the scores for that pathway on all feature-sets related to it. Continuing the example, a score for GENE_SET_A on bin1 would be the sum of the scores for GENE_SET_A for feature-set bin1_vs_bin2, bin1_vs_bin3, all the way up to bin1_vs_bin9, with equal weighting.

Certainly, there are several subtle statistical and combinatorial questions relating to this procedure. Direct comparison of pathways and gene sets is challenging due their overlapping nature (features may belong to multiple gene sets). The number of features associated with a gene set can vary anywhere from under 10, to over 1000 and may complicate a scoring system based on percentage overlap, such as ours. Attempting to use a mathematically or statistically rigorous procedure to account for these, and other potential factors is a worthy exercise, but we believe our heuristic is sufficient for an explainable high-level summary of the composition of the feature-sets found.

In this section we describe the classifiers and how they are applied for the classification task. We also describe how the feature-sets are adapted to different classifiers.

After feature selection, we evaluate the features on test sets based on successful classification in the nine time bins. For each experiment shown in Fig.1, we use the feature-sets extracted on its training set and evaluate the models using leave-one-subject-out cross validation on the test set. Each experiment is repeated 25 times to capture variability. For the binary classifiersSSVM and linear SVMwe used a multiclass method, with each of its ({9 atopwithdelims ()2}) pairwise models using respective feature-sets. On the other hand, we used a single classification model for ANN and CE because these models can handle multiple classes. The feature-set for these models are created by taking a union of ({9 atopwithdelims ()2} = 36) pairwise feature-sets.

Balanced Success Rate (BSR) Throughout the Results section, we report predictive power in terms of BSR. This is a simple average of true positive rates for each of the categories. The BSR serves as a simple, interpretable metric especially when working with imbalanced data sets and gives a holistic view of classification performance that easily generalizes to multiclass problems. For example, if true positive rates in a 3-class problem were (TPR_1 = 95%), (TPR_2 = 50%), and (TPR_3 = 65%), the BSR for the multiclass problem would be ((TPR_1 + TPR_2 + TPR_3)/3 = 70%).

We implement a pairwise model (or one-vs-one model) for training and classification to extend the binary classifiers described below (ANN and CE do not require these). For a data set with c unique classes, c-choose-2 models are built using the relevant subsets of the data. Learned model parameters and features selected for each model are stored and later used when discriminatory features are needed in the test phase.

After training, classification is done by a simple voting scheme: a new sample is classified by all c-choose-2 classifiers and assigned the label that had the plurality of the vote. If a tie occurs, the class is decided by an unbiased coin flip between the winning labels. In a nine-class problem, this corresponds to 36 classifiers and feature-sets being selected.

Linear SVM For a plain linear SVM model, the implementation in the scikit-learn package in Python was used25. While scikit-learn also has built-in support to extend this binary classifier to multiclass problems, either by one-vs-one or one-vs-all approaches, we only use it for binary classification problems, or for binary sub-problems of a one-vs-one scheme for a multiclass problem. The optimization problem was introduced by26 and requires the solution to

$$begin{aligned} begin{aligned} textrm{min}_{w,b} ;&||w||_2^2 quad text {subject to} \&y^i ( w cdot x^i - b ) ge 1, quad text { for all } i end{aligned} end{aligned}$$

(1)

where (y^i) represent class labels assigned to (pm 1), (x^i) represent vector samples, w represents the weight vector and b represents a bias (a scalar shift). This approach has seen widespread use and success in biological feature extraction27,28.

Sparse SVM (SSVM) The SSVM problem replaces the 2-norm in the objective of equation 1 with a 1-norm, which is understood to promote sparsity (many zero coefficients) in the coefficient vector ({textbf{w}}). This allows one to ignore those features and is our primary tool for feature selection when coupled with Iterated Feature Removal6. Arbitrary p-norm SVM were introduced in29 and (ell _1)-norm sparse SVM were further developed for feature selection in6,30,31.

After a standard one-hot encoding scheme, inherently multiclass methods (here: neural networks) do not need to be adapted to handle a multiclass problem as with linear methods, nor is there a straightforward way to encode the use of time-dependent features in passing new data forward through the neural network; this would be begging the (time of infection) question. Instead, for these methods, we simply take the union of all pairwise features built to classify pairs of time bins, then allow the multiclass algorithm to learn any necessary relationships internally. The specifics of the neural networks are described below.

Artificial Neural Networks (ANN) We apply a standard feed-forward neural network trained to learn the labels of the training data. In all the classification tasks, we used two hidden layers with 500 ReLU activation in each layer. We used the whole training set to calculate the gradient of the loss function (Cross-entropy) while updating the network parameters using Scaled Conjugate Gradient Descent (SCG); see32.

Centroid-Encoder (CE) This is a variation of an autoencoder which can be used for both visualization and classification purposes. Consider a data set with N samples and M classes. The classes denoted (C_j, j = 1, dots , M) where the indices of the data associated with class (C_j) are denoted (I_j). We define centroid of each class as (c_j=frac{1}{|C_j|}sum _{i in I_j} x_i) where (|C_j|) is the cardinality of class (C_j). Unlike autoencoder, which maps each point (x_i) to itself, CE will map each point (x_i) to its class centroid (c_j) by minimizing the following cost function over the parameter set (theta ):

$$begin{aligned} begin{aligned} {mathscr {L}}_{ce}(theta )=frac{1}{2N}sum ^M_{j=1} sum _{i in I_j}Vert c_j-f(x_i; theta ))Vert ^2_2 end{aligned} end{aligned}$$

(2)

The mapping f is composed of a dimension reducing mapping g (encoder) followed by a dimension increasing reconstruction mapping h (decoder). The output of the encoder is used as a supervised visualization tool33, and attaching another layer to map to the one-hot encoded labels and further training by fine-tuning provides a classifier. For further details, see34. In all of the classification tasks, we used three hidden layers ((500 rightarrow 100 rightarrow 500)) with ReLU activation for centroid mapping. After that we attached a classification layer with one-hot-encoding to the encoder ((500 rightarrow 100)) to learn the class label of the samples. The model parameters were updated using SCG.

Continue reading here:
Using machine learning to determine the time of exposure to ... - Nature.com

Related Posts

Comments are closed.