In this section we describe the clinical data set, data preprocessing, feature selection process, classifiers used, and design for the machine learning experiments.
The data we study is a collection of seven clinical studies available via the NCBI Gene Expression Omnibus, Series GSE73072. The details of these studies can be found in1, but we briefly summarize here and in Fig.1.
Each of the seven studies enrolled individuals to be infected with one of four viruses associated with a common respiratory infection. Studies DEE2-DEE5 challenged participants with H1N1 or H3N2. Studies Duke and UVA challenged participants with HRV, while DEE1 challenged individuals with RSV.
In all cases, individuals had blood samples taken at regular intervals every 412h both prior to and after infection; see Fig.1 for details. Specific time points are measured as hours since infection and vary by study. In total, 148 human subjects were involved with approximately 20 sampled time points per person. Blood samples were run through undirected microarray assays. CEL data files available via GEO were read and processed using RMA (Robust Multi-array Average) normalization through use of several Bioconductor packages23 producing expression values across 22,277 microarray probes.
To address the time of infection question, we separate the training and test samples into 9 bins in time post-inoculation, each with a categorical label; see Fig.1. The first six categories correspond to disjoint 8-h intervals in the first 2days after inoculation, and the last three categories are disjoint 24-h intervals from hours 48 to 120h. In addition to this 9-class classification problem, we also studied a relaxed binary prediction problem of whether a subject belongs to the early phase of infection (time of inoculation (le ) 48h) or later phase (time of infection > 48h). Results for this binary classification are inferred from the 9-class problem, i.e., if a classified label is associated to a time in the first 2days, it is considered correctly labeled.
After the data is processed, we apply the following general pipeline for each of the 14 experiments enumerated in Fig.2 (top panel):
Partition the data into training and testing sets based on the classification experiment.
Normalize the data to correct for batch effects seen between subjects (e.g., using the linear batch normalization routine in the limma package24).
Identify comprehensive sets of predictive features using the Iterative Feature Removal (IFR) approach6, which aims to extract all discriminatory features in high-dimensional data with repeated application of sparse linear classifiers such as Sparse Support Vector Machines (SSVM).
Identify network architectures and other hyperparameters for Artificial Neural Networks (ANN) and Centroid Encoder (CE) by utilizing a five-fold cross-validation experiment on the training data.
Evaluate the features identified from the step 3 on the test data. This is done by training and evaluating a new model using the selected features with a leave-one-subject-out cross validation scheme on the test study. The metric used for evaluation is BSR; throughout this study we utilize BSR as a balanced representation of performance accounting for imbalanced class sizes while being easy to interpret.
For each of the training sets illustrated in Fig.2 (top panel; stripes), features are selected using the IFR algorithm, with an SSVM classifier. This is done separately for all pairwise combinations of categorical labels (time bins); a 9-class experiment leads to 9-choose-2 = 36 pairwise combinations. So, for each of these 36 combinations of time bins, features are selected using the following steps.
First, the input data to IFR algorithm is partitioned into training and validation set. Next, sets of features that produce high accuracy on the validation set are selected iteratively. In each iteration, features that have previously been selected are masked out, so that theyre not used again. The feature selection is halted once the predictive rates on the validation data drops below a specified threshold. This results in one feature-set for a particular training-validation set of the input data.
Next, more training-validation partitions are repeatedly sampled, and the feature selection, as described above, is repeated for each partition-set; this results in a different feature-set for each new partition-set. Then, these different feature-sets are combined by applying a set union operation and the frequency of each individual feature is tracked if they are discovered in multiple feature-sets. The feature frequency is used to rank the features; the more a particular feature is discovered, the more important it is.
The size of this combined feature-set, although about 520% of the original feature-set size, is still often large for classification, so a last step we reduce the size of this feature-set. This is done by performing a grid-search using a linear SVM (without sparsity penalty) on the training data, taking the top n features, ranked by frequency, which maximize the average of true positive rates on every class, or BSR. Once the features have been selected, we perform a more detailed leave-one-subject-out classification for the experiments described in the Results and visualized in Fig.2 using the classifiers described in Methods section.
Feature selection produced 36 distinct feature-sets, coming from all distinct choices of two time bins from the nine labels possible. To address the question of commonality or importance of features selected on a time bin for a specific pathway, we implemented a heuristic scoring system. For a fixed time bin (say, bin1) and a fixed feature-set (say, bin1_vs_bin2; quantities summarized in Table 2) the associated collection of features was referenced against the GSEA MSigDB. This database includes both canonical pathways and gene sets as the result of data miningwe refer to anything in this collection generically as a gene set. A score for each MSigDB gene set was assigned for a given feature-set (bin1_vs_bin2) based on the ratio of features in the feature-set which appear in the gene set. For instance, a score of 0.5 for hypothetical GENE_SET_A for feature-set bin1_vs_bin2 represents the fact that 50% of the features in GENE_SET_A are present in bin1_vs_bin2.
A score for pathway on a time bin by itself was defined as the sum of the scores for that pathway on all feature-sets related to it. Continuing the example, a score for GENE_SET_A on bin1 would be the sum of the scores for GENE_SET_A for feature-set bin1_vs_bin2, bin1_vs_bin3, all the way up to bin1_vs_bin9, with equal weighting.
Certainly, there are several subtle statistical and combinatorial questions relating to this procedure. Direct comparison of pathways and gene sets is challenging due their overlapping nature (features may belong to multiple gene sets). The number of features associated with a gene set can vary anywhere from under 10, to over 1000 and may complicate a scoring system based on percentage overlap, such as ours. Attempting to use a mathematically or statistically rigorous procedure to account for these, and other potential factors is a worthy exercise, but we believe our heuristic is sufficient for an explainable high-level summary of the composition of the feature-sets found.
In this section we describe the classifiers and how they are applied for the classification task. We also describe how the feature-sets are adapted to different classifiers.
After feature selection, we evaluate the features on test sets based on successful classification in the nine time bins. For each experiment shown in Fig.1, we use the feature-sets extracted on its training set and evaluate the models using leave-one-subject-out cross validation on the test set. Each experiment is repeated 25 times to capture variability. For the binary classifiersSSVM and linear SVMwe used a multiclass method, with each of its ({9 atopwithdelims ()2}) pairwise models using respective feature-sets. On the other hand, we used a single classification model for ANN and CE because these models can handle multiple classes. The feature-set for these models are created by taking a union of ({9 atopwithdelims ()2} = 36) pairwise feature-sets.
Balanced Success Rate (BSR) Throughout the Results section, we report predictive power in terms of BSR. This is a simple average of true positive rates for each of the categories. The BSR serves as a simple, interpretable metric especially when working with imbalanced data sets and gives a holistic view of classification performance that easily generalizes to multiclass problems. For example, if true positive rates in a 3-class problem were (TPR_1 = 95%), (TPR_2 = 50%), and (TPR_3 = 65%), the BSR for the multiclass problem would be ((TPR_1 + TPR_2 + TPR_3)/3 = 70%).
We implement a pairwise model (or one-vs-one model) for training and classification to extend the binary classifiers described below (ANN and CE do not require these). For a data set with c unique classes, c-choose-2 models are built using the relevant subsets of the data. Learned model parameters and features selected for each model are stored and later used when discriminatory features are needed in the test phase.
After training, classification is done by a simple voting scheme: a new sample is classified by all c-choose-2 classifiers and assigned the label that had the plurality of the vote. If a tie occurs, the class is decided by an unbiased coin flip between the winning labels. In a nine-class problem, this corresponds to 36 classifiers and feature-sets being selected.
Linear SVM For a plain linear SVM model, the implementation in the scikit-learn package in Python was used25. While scikit-learn also has built-in support to extend this binary classifier to multiclass problems, either by one-vs-one or one-vs-all approaches, we only use it for binary classification problems, or for binary sub-problems of a one-vs-one scheme for a multiclass problem. The optimization problem was introduced by26 and requires the solution to
$$begin{aligned} begin{aligned} textrm{min}_{w,b} ;&||w||_2^2 quad text {subject to} \&y^i ( w cdot x^i - b ) ge 1, quad text { for all } i end{aligned} end{aligned}$$
(1)
where (y^i) represent class labels assigned to (pm 1), (x^i) represent vector samples, w represents the weight vector and b represents a bias (a scalar shift). This approach has seen widespread use and success in biological feature extraction27,28.
Sparse SVM (SSVM) The SSVM problem replaces the 2-norm in the objective of equation 1 with a 1-norm, which is understood to promote sparsity (many zero coefficients) in the coefficient vector ({textbf{w}}). This allows one to ignore those features and is our primary tool for feature selection when coupled with Iterated Feature Removal6. Arbitrary p-norm SVM were introduced in29 and (ell _1)-norm sparse SVM were further developed for feature selection in6,30,31.
After a standard one-hot encoding scheme, inherently multiclass methods (here: neural networks) do not need to be adapted to handle a multiclass problem as with linear methods, nor is there a straightforward way to encode the use of time-dependent features in passing new data forward through the neural network; this would be begging the (time of infection) question. Instead, for these methods, we simply take the union of all pairwise features built to classify pairs of time bins, then allow the multiclass algorithm to learn any necessary relationships internally. The specifics of the neural networks are described below.
Artificial Neural Networks (ANN) We apply a standard feed-forward neural network trained to learn the labels of the training data. In all the classification tasks, we used two hidden layers with 500 ReLU activation in each layer. We used the whole training set to calculate the gradient of the loss function (Cross-entropy) while updating the network parameters using Scaled Conjugate Gradient Descent (SCG); see32.
Centroid-Encoder (CE) This is a variation of an autoencoder which can be used for both visualization and classification purposes. Consider a data set with N samples and M classes. The classes denoted (C_j, j = 1, dots , M) where the indices of the data associated with class (C_j) are denoted (I_j). We define centroid of each class as (c_j=frac{1}{|C_j|}sum _{i in I_j} x_i) where (|C_j|) is the cardinality of class (C_j). Unlike autoencoder, which maps each point (x_i) to itself, CE will map each point (x_i) to its class centroid (c_j) by minimizing the following cost function over the parameter set (theta ):
$$begin{aligned} begin{aligned} {mathscr {L}}_{ce}(theta )=frac{1}{2N}sum ^M_{j=1} sum _{i in I_j}Vert c_j-f(x_i; theta ))Vert ^2_2 end{aligned} end{aligned}$$
(2)
The mapping f is composed of a dimension reducing mapping g (encoder) followed by a dimension increasing reconstruction mapping h (decoder). The output of the encoder is used as a supervised visualization tool33, and attaching another layer to map to the one-hot encoded labels and further training by fine-tuning provides a classifier. For further details, see34. In all of the classification tasks, we used three hidden layers ((500 rightarrow 100 rightarrow 500)) with ReLU activation for centroid mapping. After that we attached a classification layer with one-hot-encoding to the encoder ((500 rightarrow 100)) to learn the class label of the samples. The model parameters were updated using SCG.
Continue reading here:
Using machine learning to determine the time of exposure to ... - Nature.com
- What Is Machine Learning? | How It Works, Techniques ... [Last Updated On: September 5th, 2019] [Originally Added On: September 5th, 2019]
- Start Here with Machine Learning [Last Updated On: September 22nd, 2019] [Originally Added On: September 22nd, 2019]
- What is Machine Learning? | Emerj [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- Microsoft Azure Machine Learning Studio [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- Machine Learning Basics | What Is Machine Learning? | Introduction To Machine Learning | Simplilearn [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- What is Machine Learning? A definition - Expert System [Last Updated On: October 2nd, 2019] [Originally Added On: October 2nd, 2019]
- Machine Learning | Stanford Online [Last Updated On: October 2nd, 2019] [Originally Added On: October 2nd, 2019]
- How to Learn Machine Learning, The Self-Starter Way [Last Updated On: October 17th, 2019] [Originally Added On: October 17th, 2019]
- definition - What is machine learning? - Stack Overflow [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Artificial Intelligence vs. Machine Learning vs. Deep ... [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning in R for beginners (article) - DataCamp [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning | Udacity [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning Artificial Intelligence | McAfee [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- AI-based ML algorithms could increase detection of undiagnosed AF - Cardiac Rhythm News [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- The Cerebras CS-1 computes deep learning AI problems by being bigger, bigger, and bigger than any other chip - TechCrunch [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- Can the planet really afford the exorbitant power demands of machine learning? - The Guardian [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- New InfiniteIO Platform Reduces Latency and Accelerates Performance for Machine Learning, AI and Analytics - Business Wire [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- How to Use Machine Learning to Drive Real Value - eWeek [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- Machine Learning As A Service Market to Soar from End-use Industries and Push Revenues in the 2025 - Downey Magazine [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Rad AI Raises $4M to Automate Repetitive Tasks for Radiologists Through Machine Learning - - HIT Consultant [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Machine Learning Improves Performance of the Advanced Light Source - Machine Design [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Synthetic Data: The Diamonds of Machine Learning - TDWI [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- The transformation of healthcare with AI and machine learning - ITProPortal [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Workday talks machine learning and the future of human capital management - ZDNet [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Machine Learning with R, Third Edition - Free Sample Chapters - Neowin [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Verification In The Era Of Autonomous Driving, Artificial Intelligence And Machine Learning - SemiEngineering [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Podcast: How artificial intelligence, machine learning can help us realize the value of all that genetic data we're collecting - Genetic Literacy... [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The Real Reason Your School Avoids Machine Learning - The Tech Edvocate [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Siri, Tell Fido To Stop Barking: What's Machine Learning, And What's The Future Of It? - 90.5 WESA [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Microsoft reveals how it caught mutating Monero mining malware with machine learning - The Next Web [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The role of machine learning in IT service management - ITProPortal [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Global Director of Tech Exploration Discusses Artificial Intelligence and Machine Learning at Anheuser-Busch InBev - Seton Hall University News &... [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The 10 Hottest AI And Machine Learning Startups Of 2019 - CRN: The Biggest Tech News For Partners And The IT Channel [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Startup jobs of the week: Marketing Communications Specialist, Oracle Architect, Machine Learning Scientist - BetaKit [Last Updated On: November 30th, 2019] [Originally Added On: November 30th, 2019]
- Here's why machine learning is critical to success for banks of the future - Tech Wire Asia [Last Updated On: December 2nd, 2019] [Originally Added On: December 2nd, 2019]
- 3 questions to ask before investing in machine learning for pop health - Healthcare IT News [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Machine Learning Answers: If Caterpillar Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Measuring Employee Engagement with A.I. and Machine Learning - Dice Insights [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Amazon Wants to Teach You Machine Learning Through Music? - Dice Insights [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Machine Learning Answers: If Nvidia Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- AI and machine learning platforms will start to challenge conventional thinking - CRN.in [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If Twitter Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If Seagate Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If BlackBerry Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Amazon Releases A New Tool To Improve Machine Learning Processes - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Another free web course to gain machine-learning skills (thanks, Finland), NIST probes 'racist' face-recog and more - The Register [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Kubernetes and containers are the perfect fit for machine learning - JAXenter [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- TinyML as a Service and machine learning at the edge - Ericsson [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- AI and machine learning products - Cloud AI | Google Cloud [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning | Blog | Microsoft Azure [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning in 2019 Was About Balancing Privacy and Progress - ITPro Today [Last Updated On: December 25th, 2019] [Originally Added On: December 25th, 2019]
- CMSWire's Top 10 AI and Machine Learning Articles of 2019 - CMSWire [Last Updated On: December 25th, 2019] [Originally Added On: December 25th, 2019]
- Here's why digital marketing is as lucrative a career as data science and machine learning - Business Insider India [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Dell's Latitude 9510 shakes up corporate laptops with 5G, machine learning, and thin bezels - PCWorld [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Finally, a good use for AI: Machine-learning tool guesstimates how well your code will run on a CPU core - The Register [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Cloud as the enabler of AI's competitive advantage - Finextra [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Forget Machine Learning, Constraint Solvers are What the Enterprise Needs - - RTInsights [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Informed decisions through machine learning will keep it afloat & going - Sea News [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- The Problem with Hiring Algorithms - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- New Program Supports Machine Learning in the Chemical Sciences and Engineering - Newswise [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- AI-System Flags the Under-Vaccinated in Israel - PrecisionVaccinations [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- New Contest: Train All The Things - Hackaday [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- AFTAs 2019: Best New Technology Introduced Over the Last 12 MonthsAI, Machine Learning and AnalyticsActiveViam - www.waterstechnology.com [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Educate Yourself on Machine Learning at this Las Vegas Event - Small Business Trends [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Seton Hall Announces New Courses in Text Mining and Machine Learning - Seton Hall University News & Events [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Looking at the most significant benefits of machine learning for software testing - The Burn-In [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Leveraging AI and Machine Learning to Advance Interoperability in Healthcare - - HIT Consultant [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Adventures With Artificial Intelligence and Machine Learning - Toolbox [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Five Reasons to Go to Machine Learning Week 2020 - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Uncover the Possibilities of AI and Machine Learning With This Bundle - Interesting Engineering [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Learning that Targets Millennial and Generation Z - HR Exchange Network [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- Red Hat Survey Shows Hybrid Cloud, AI and Machine Learning are the Focus of Enterprises - Computer Business Review [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- Vectorspace AI Datasets are Now Available to Power Machine Learning (ML) and Artificial Intelligence (AI) Systems in Collaboration with Elastic -... [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- What is Machine Learning? | Types of Machine Learning ... [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- How Machine Learning Will Lead to Better Maps - Popular Mechanics [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- Jenkins Creator Launches Startup To Speed Software Testing with Machine Learning -- ADTmag - ADT Magazine [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- An Open Source Alternative to AWS SageMaker - Datanami [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- Machine Learning Could Aid Diagnosis of Barrett's Esophagus, Avoid Invasive Testing - Medical Bag [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- OReilly and Formulatedby Unveil the Smart Cities & Mobility Ecosystems Conference - Yahoo Finance [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]