Dataset
The dataset was originally reported by Ravel et al.16. The study was registered at clinicaltrials.gov under ID NCT00576797. The protocol was approved by the institutional review boards at Emory University School of Medicine, Grady Memorial Hospital, and the University of Maryland School of Medicine. Written informed consent was obtained by the authors of the original study.
Samples were taken from 394 asymptomatic women. 97 of these patients were categorized as positive for BV, based on Nugent score. In the preprocessing of the data, information about community group, ethnicity, and Nugent score was removed from the training and testing datasets. Ethnicity information was stored to be referenced later during the ethnicity-specific testing. 16S rRNA values were listed as a percentage of the total 16S rRNA sample, so those values were normalized by dividing by 100. pH values ranged on a scale from 1 to 14 and were normalized by dividing by 14.
Each experiment was run 10 times, with a different random seed defining the shuffle state, to gauge variance of performance.
Four supervised machine learning models were evaluated. Logistic regression (LR), support vector machine (SVM), random forest (RF), and Multi-layer Perceptron (MLP) models were implemented with the scikit-learn python library. LR fits a boundary curve to separate the data into two classes. SVM finds a hyperplane that maximizes the margin between two classes. These methods were implemented to test whether boundary-based models can perform fairly among different ethnicities. RF is a model that creates an ensemble of decision trees and was implemented to test how a decision-based model would classify each patient. MLP passes information along nodes and adjusts weights and biases for each node to optimize its classification. MLP was implemented to test how a neural network-based approach would perform fairly on the data.
Five-fold stratified cross validation was used to prevent overfitting and to ensure that each ethnicity has at least two positive cases in the test folds. Data were stratified by a combination of ethnicity and diagnosis to ensure that each fold has every representation from each group with comparable distributions.
For each supervised machine learning model, hyper parameter tuning was performed by employing a grid search methodology from the scikit-learn python library. Nested cross validation with 4 folds and 2 repeats was used as the training subset of the cross validation scheme.
For Logistic Regression, the following hyper-parameters were tested: solver (newton-cg, lbfgs, liblinear) and the inverse of regularization strength C (100, 10, 1.0, 0.1, 0.01).
For SVM, the following hyper-parameters were tested: kernel (polynomial, radial basis function, sigmoid) and the inverse regularization parameter C (10, 1.0, 0.1, 0.01).
For Random Forest, the following hyper-parameters were tested: number of estimators (10, 100, 1000) and maximum features (square root and logarithm to base 2 of the number of features).
For Multi-layer perceptron, the following hyper-parameters were tested: hidden layer size (3 hidden layers of 10,30, and 10 neurons and 1 hidden layer of 20 neurons), solver (stochastic gradient descent and Adam optimizer), regularization parameter alpha (0.0001, or .05), and learning rate (constant and adaptive).
The models were evaluated using the following metrics: balanced accuracy, average precision, false positive rate (FPR), and false negative rate (FNR). Balanced accuracy was chosen to better capture the practical performance of the models while using an unbalanced dataset. Average precision is an estimate of the area under the precision recall curve, similar to AUC which is the area under the ROC curve. The precision-recall curve is used instead of a receiver operator curve to better capture the performance of the models on an unbalanced dataset39. Previous studies with this dataset reveal particularly good AUC scores and accuracy, which is to be expected with a highly unbalanced dataset.
The precision-recall curve was generated using the true labels and predicted probabilities from every fold of every run to summarize the overall precision-recall performance for each model. Balanced accuracy and average precision were computed using the corresponding functions found in the sklearn.metrics package. FPR and FNR were calculated computed and coded using Equations below39.
Below are the equations for the metrics used to test the Supervised Machine Learning models:
$${Precision}=frac{{TP}}{{TP}+{FP}}$$
(1)
$${Recall}=frac{{TP}}{{TP}+{FN}}$$
(2)
$${Balanced},{Accuracy}=frac{1}{2}left(frac{{TP}}{{TP}+{FN}}+frac{{TN}}{{TN}+{FP}}right)$$
(3)
$${FPR}=frac{{FP}}{{FP}+{TN}}$$
(4)
$${FNR}=frac{{FN}}{{FN}+{TP}}$$
(5)
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.
$${Average},{Precison}=sum _{n}left({R}_{n}-{R}_{n-1}right){P}_{n}$$
(6)
where R denotes recall, and P denotes precision.
The performance of the models were tested against each other as previously stated. Once the model made a prediction, the stored ethnicity information was used to reference which ethnicity each predicted label and actual label belonged to. These subsets were then used as inputs for the metrics functions.
To see how training on data containing one ethnicity affects the performance and fairness of the model, an SVM model was trained on subsets that each contained only one ethnicity. Information on which ethnicity each datapoint belonged to was not given to the models.
To increase the performance and accuracy of the model, several feature selection methods were used to reduce the 251 features used to train the machine learning models. These sets of features were then used to achieve similar or higher accuracy with the machine learning models used. The feature selection methods used included the ANOVA F-test, two-sided T-Test, Point Biserial correlation, and the Gini impurity. The libraries used for these feature selection tests were the statistics and scikit learn packages in Python. Each feature test was performed with all ethnicities, then only the white subset, only Black, only Asian, and only Hispanic.
The ANOVA F-Test was used to select 50 features with the highest F-value. The function used calculates the ANOVA F-value between the feature and target variable using variance between groups and within the groups. The formula used to calculate this is defined as:
$$F=frac{{SSB}/(k-1)}{{SSW}/(n-k)}$$
(7)
Where k is the number of groups, n is the total sample size, SSB is the variance between groups, and SSW is the sum of variance within each group. The two-tailed T-Test was used to compare the BV negative versus BV positive groups rRNA data against each other. The two-tailed T-Test is used to compare the means of two independent groups against each other. The null hypothesis in a two-tailed T-Test is defined as the means of the two groups being equal while the alternative hypothesis is that they are not equal. The dataset was split up into samples that were BV negative and BV positive which then compared the mean of each feature against each other to find significant differences. A p-value <0.05 allows us to reject the null hypothesis that the mean between the two groups is the same, indicating there is a significant difference between the positive and negative groups for each feature. Thus, we use a p-value of less than 0.05 to select important features. The number of features selected were between 40 and 75 depending on the ethnicity group used. The formula for finding the t-value is defined as:
$$t=frac{left({bar{x}}_{1}-{bar{x}}_{2}right)}{sqrt{frac{({{s}_{1}})^{2}}{{n}_{1}}+frac{({{s}_{2}})^{2}}{{n}_{2}}}}$$
(8)
({bar{{rm{x}}}}_{1,2}) being the mean of the two groups. ({{rm{s}}}_{1,2}) as the standard deviation of the two groups. ({{rm{n}}}_{1,2}) being the number of samples in the two groups. The p-value is then found through the t-value by calculating the cumulative distribution function. This defines probability distribution of the t-distribution by the area under the curve. The degrees of freedom are also needed to calculate the p-value. They are the number of variables used to find the p-value with a higher number being more precise. The formulas are defined as:
$${rm{df}}={n}_{1}+{n}_{2}{{{-}}}2$$
(9)
$${p}=2* left(1-{rm{CDF}}left(left|tright|,{rm{df}}right)right)$$
(10)
where ({df}) denotes the degrees of freedom and ({{rm{n}}}_{1,2}) being the number of samples in the group. The Point Biserial correlation test is used to compare categorical against continuous data. For our dataset was used to compare the categorical BV negative or positive classification against the continuous rRNA bacterial data. Each feature has a p-value and correlation value associated with it which was then restricted by an alpha of 0.2 and further restricted by only correlation values >0.5 showing a strong correlation. The purpose of the alpha value is to indicate the level of confidence of a p-value being significant. An alpha of 0.2 was chosen because the Point Biserial test tends to return higher p-values. This formula is defined as:
$${{r}}_{{pb}}=frac{left({M}_{1}-{M}_{0}right)}{{rm{s}}},sqrt{{pq}}$$
(11)
where M1 is the mean of the continuous variable for the categorical variable with a value of 1; M0 is the mean of the continuous variable for the categorical variable with a value of 0; s denotes the standard deviation of the continuous variable; p is the proportion of samples with a value of 1 to the sample set; and q is the proportion of samples with a value of 0 to the sample set.
Two feature sets were made from the Point Biserial test. One feature set included only the features that were statistically significant using a p-value of <0.2 which returned 60100 significant features depending on the ethnicity set used. The second feature set included features that were restricted by a p-value<0.2 and greater than a correlation value of 0.5. This second feature set contained 815 features depending on the ethnicity set used.
Features were also selected using Gini impurity. Gini impurity defines the impurity of the nodes which will return a binary split at a node. It will calculate the probability of misclassifying a randomly chosen data point. The Gini impurity model fitted a Random Forest model with the dataset and took the Gini scores for each feature based on the largest reduction of Gini impurity when splitting nodes. The higher the reduction of Gini value, the impurity after the split, the more important the feature is used in predicting the target variable. The Gini impurity value varies between 0 and 1. Using Gini, the total number of features were reduced to 310 features when using the ethnicity-specific sets and 20 features when using all ethnicities. The formula is defined as:
$${Gini}=1-sum {{p}_{i}}^{2}$$
(12)
where ({{rm{p}}}_{{rm{i}}}) is the proportion of each class in the node. The five sets of selected features from each of the five ethnicities were used to train a model using four supervised machine learning algorithms (LR, MLP, RF, SVM) with the full dataset using our nested cross-validation schemed as previously described. All features were selected using the training sets only, and they were applied to the test sets after being selected for testing. Five-fold stratified cross validation was used for each model to gather including means and confidence intervals.
Originally posted here:
Ethnic disparity in diagnosing asymptomatic bacterial vaginosis ... - Nature.com
- What Is Machine Learning? | How It Works, Techniques ... [Last Updated On: September 5th, 2019] [Originally Added On: September 5th, 2019]
- Start Here with Machine Learning [Last Updated On: September 22nd, 2019] [Originally Added On: September 22nd, 2019]
- What is Machine Learning? | Emerj [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- Microsoft Azure Machine Learning Studio [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- Machine Learning Basics | What Is Machine Learning? | Introduction To Machine Learning | Simplilearn [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- What is Machine Learning? A definition - Expert System [Last Updated On: October 2nd, 2019] [Originally Added On: October 2nd, 2019]
- Machine Learning | Stanford Online [Last Updated On: October 2nd, 2019] [Originally Added On: October 2nd, 2019]
- How to Learn Machine Learning, The Self-Starter Way [Last Updated On: October 17th, 2019] [Originally Added On: October 17th, 2019]
- definition - What is machine learning? - Stack Overflow [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Artificial Intelligence vs. Machine Learning vs. Deep ... [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning in R for beginners (article) - DataCamp [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning | Udacity [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning Artificial Intelligence | McAfee [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- AI-based ML algorithms could increase detection of undiagnosed AF - Cardiac Rhythm News [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- The Cerebras CS-1 computes deep learning AI problems by being bigger, bigger, and bigger than any other chip - TechCrunch [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- Can the planet really afford the exorbitant power demands of machine learning? - The Guardian [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- New InfiniteIO Platform Reduces Latency and Accelerates Performance for Machine Learning, AI and Analytics - Business Wire [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- How to Use Machine Learning to Drive Real Value - eWeek [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- Machine Learning As A Service Market to Soar from End-use Industries and Push Revenues in the 2025 - Downey Magazine [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Rad AI Raises $4M to Automate Repetitive Tasks for Radiologists Through Machine Learning - - HIT Consultant [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Machine Learning Improves Performance of the Advanced Light Source - Machine Design [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Synthetic Data: The Diamonds of Machine Learning - TDWI [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- The transformation of healthcare with AI and machine learning - ITProPortal [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Workday talks machine learning and the future of human capital management - ZDNet [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Machine Learning with R, Third Edition - Free Sample Chapters - Neowin [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Verification In The Era Of Autonomous Driving, Artificial Intelligence And Machine Learning - SemiEngineering [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Podcast: How artificial intelligence, machine learning can help us realize the value of all that genetic data we're collecting - Genetic Literacy... [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The Real Reason Your School Avoids Machine Learning - The Tech Edvocate [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Siri, Tell Fido To Stop Barking: What's Machine Learning, And What's The Future Of It? - 90.5 WESA [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Microsoft reveals how it caught mutating Monero mining malware with machine learning - The Next Web [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The role of machine learning in IT service management - ITProPortal [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Global Director of Tech Exploration Discusses Artificial Intelligence and Machine Learning at Anheuser-Busch InBev - Seton Hall University News &... [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The 10 Hottest AI And Machine Learning Startups Of 2019 - CRN: The Biggest Tech News For Partners And The IT Channel [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Startup jobs of the week: Marketing Communications Specialist, Oracle Architect, Machine Learning Scientist - BetaKit [Last Updated On: November 30th, 2019] [Originally Added On: November 30th, 2019]
- Here's why machine learning is critical to success for banks of the future - Tech Wire Asia [Last Updated On: December 2nd, 2019] [Originally Added On: December 2nd, 2019]
- 3 questions to ask before investing in machine learning for pop health - Healthcare IT News [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Machine Learning Answers: If Caterpillar Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Measuring Employee Engagement with A.I. and Machine Learning - Dice Insights [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Amazon Wants to Teach You Machine Learning Through Music? - Dice Insights [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Machine Learning Answers: If Nvidia Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- AI and machine learning platforms will start to challenge conventional thinking - CRN.in [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If Twitter Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If Seagate Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If BlackBerry Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Amazon Releases A New Tool To Improve Machine Learning Processes - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Another free web course to gain machine-learning skills (thanks, Finland), NIST probes 'racist' face-recog and more - The Register [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Kubernetes and containers are the perfect fit for machine learning - JAXenter [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- TinyML as a Service and machine learning at the edge - Ericsson [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- AI and machine learning products - Cloud AI | Google Cloud [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning | Blog | Microsoft Azure [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning in 2019 Was About Balancing Privacy and Progress - ITPro Today [Last Updated On: December 25th, 2019] [Originally Added On: December 25th, 2019]
- CMSWire's Top 10 AI and Machine Learning Articles of 2019 - CMSWire [Last Updated On: December 25th, 2019] [Originally Added On: December 25th, 2019]
- Here's why digital marketing is as lucrative a career as data science and machine learning - Business Insider India [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Dell's Latitude 9510 shakes up corporate laptops with 5G, machine learning, and thin bezels - PCWorld [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Finally, a good use for AI: Machine-learning tool guesstimates how well your code will run on a CPU core - The Register [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Cloud as the enabler of AI's competitive advantage - Finextra [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Forget Machine Learning, Constraint Solvers are What the Enterprise Needs - - RTInsights [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Informed decisions through machine learning will keep it afloat & going - Sea News [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- The Problem with Hiring Algorithms - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- New Program Supports Machine Learning in the Chemical Sciences and Engineering - Newswise [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- AI-System Flags the Under-Vaccinated in Israel - PrecisionVaccinations [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- New Contest: Train All The Things - Hackaday [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- AFTAs 2019: Best New Technology Introduced Over the Last 12 MonthsAI, Machine Learning and AnalyticsActiveViam - www.waterstechnology.com [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Educate Yourself on Machine Learning at this Las Vegas Event - Small Business Trends [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Seton Hall Announces New Courses in Text Mining and Machine Learning - Seton Hall University News & Events [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Looking at the most significant benefits of machine learning for software testing - The Burn-In [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Leveraging AI and Machine Learning to Advance Interoperability in Healthcare - - HIT Consultant [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Adventures With Artificial Intelligence and Machine Learning - Toolbox [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Five Reasons to Go to Machine Learning Week 2020 - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Uncover the Possibilities of AI and Machine Learning With This Bundle - Interesting Engineering [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Learning that Targets Millennial and Generation Z - HR Exchange Network [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- Red Hat Survey Shows Hybrid Cloud, AI and Machine Learning are the Focus of Enterprises - Computer Business Review [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- Vectorspace AI Datasets are Now Available to Power Machine Learning (ML) and Artificial Intelligence (AI) Systems in Collaboration with Elastic -... [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- What is Machine Learning? | Types of Machine Learning ... [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- How Machine Learning Will Lead to Better Maps - Popular Mechanics [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- Jenkins Creator Launches Startup To Speed Software Testing with Machine Learning -- ADTmag - ADT Magazine [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- An Open Source Alternative to AWS SageMaker - Datanami [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- Machine Learning Could Aid Diagnosis of Barrett's Esophagus, Avoid Invasive Testing - Medical Bag [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- OReilly and Formulatedby Unveil the Smart Cities & Mobility Ecosystems Conference - Yahoo Finance [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]