In this subsection, we present details on how we process the dataset, turn it into a network graph and finally how we produce, and process features that belong to the graph. Topics to be covered are:
splitting the data,
preprocessing,
feature importance and selection,
computation of similarity between samples, and
generating of the raw graph.
After preprocessing the data, the next step is to split the dataset into training and test samples for validation purposes. We selected cross-validation (CV) as the validation method since it is the de facto standard in ML research. For CV, the full dataset is split into k folds; and the classifier model is trained using data from (k-1) folds then tested on the remaining kth fold. Eventually, after k iterations,, average performance scores (like F1 measure or ROC) of all folds are used to benchmark the classifier model.
A crucial step of CV is selecting the right proportion between the training and test subsamples, i.e., number of folds. Determining the most appropriate number of folds k for a given dataset is still an open research question17, besides de facto standard for selecting k is accumulated around k=2, k=5, or k=10. To address the selection of the right fold size, we have identified two priorities:
Priority 1Class Balance: We need to consider every split of the dataset needs to be class-balanced. Since the number of class types has a restrictive effect on selecting enough similar samples, detecting the effective number of folds depends heavily on this parameter. As a result, whenever we deal with a problem which has low represented class(es), we selected k=2.
Priority 2High Representation: In our model, briefly, we build a network from the training subsamples. Efficient network analysis depends on the size (i.e., number of nodes) of the network. Thus, maximize training subsamples with enough representatives from each class (diversity) is our priority as much as we can when splitting the dataset. This way we can have more nodes. In brief, whenever we do not cross priority 1, we selected k=5.
By balancing these two priorities, we select efficient CV fold size by evaluating the characteristics of each datasets in terms of their sample size and the number of different classes. The selected fold value for each dataset will be specified in the Experiments and results section. To fulfill the class balancing priority, we employed stratified sampling. In this model, each CV fold contains approximately the same percentage of samples of each target class as the complete set.
Preprocessing starts with the handling of missing data. For this part, we preferred to omit all samples which have one or more missing feature(s). By doing this, we have focused merely on developing the model, skipping trivial concerns.
As stated earlier, GSNAc can work on datasets that may have both numerical and categorical values. To ensure proper processing of those data types, as a first step, we separate numerical and categorical features18. First, in order to process them mathematically, categorical (string) features are transformed into unique integers for each unique category by a technique called labelization. It is worth noting that, against the general approach, we do not use the one-hot-encoding technique for transforming categorical features, which is the method of creating dummy binary-valued features. Labelization does not generate extra features, whereas one-hot-encoding extend the number of features.
For the numerical part, as a very important stage of preprocessing, scaling19 of the features follows. Scaling is beneficial since the features may have a very different range and this might affect scale-dependent processes like distance computation. We have two generally accepted scaling techniques which are normalization and standardization. Normalization transforms features linearly into a closed range like [0, 1], which does not affect the variation of values among features. On the other hand, standardization transforms the feature space into a distribution of values that are centered around the mean with a unit standard deviation. This way, the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation. Since GSNAc is heavily dependent on vectorial distances, we do not prefer to lose the structure of the variation within features and this way our choice for scaling the features becomes normalization. Here, it is worth mentioning that all the preprocessing is applied on the training part of the data and transformed on the test data, ensuring no data leakage occurs.
Feature Importance (FI) broadly refers to the scoring of features based on their usefulness in prediction. It is obvious that in any problem some features might be more definitive in terms of their predictive capability of the class. Moreover, a combination of some features may have a higher effect than others in total than the sum of their capacity in this sense. FI models, in general, address this type of concern. Indeed, almost all ML classification algorithms use a FI algorithm under the hood; since this is required for the proper weighting of features before feeding data into the model. It is part of any ML classifier and GSNAc. As a scale-sensitive model, vectorial similarity needs to benefit much from more distinctive features.
For computing feature importance, we preferred to use an off-the-shelf algorithm, which is a supervised k-best feature selection18 method. The K-best feature selection algorithm simply ranks all features by evaluating features ANOVA analysis against class labels. ANOVA F-value analyzes the variance between each feature and its respective class and computes F-value which is the ratio of the variation between sample means, over the variation within the samples. This way, it assigns F values as features importance. Our general strategy is to keep all features for all the datasets, with an exception for genomic datasets, that contain thousands of features, we practiced omitting. For this reason, instead of selecting some features, we prefer to keep all and use the importance learned at this step as the weight vector in similarity calculation.
In this step, we generate an undirected network graph G, its nodes will be the samples and its edges will be constructed using the distance metrics20 between feature values of the samples. Distances will be converted to similarity scores to generate an adjacency matrix from the raw graph. As a crucial note, we state that since we aim to predict test samples by using G, in each batch, we only process the training samples.
In our study for constructing a graph from a dataset we defined edge weights as the inverse of the Euclidean distance between the sample vectors. Simply, Euclidean distance (also known as L2-norm) gives the unitless straight line (shortest) distance between two vectors in space. In formal terms, for f-dimensional vectors u and v, Euclidean distance is defined as:
$$dleft(u,vright)=sqrt[2]{sum_{f}{left({u}_{i}-{v}_{i}right)}^{2}}$$
A slightly modified use of the Euclidean distance is introducing the weights for dimensions. Recall from the discussion of the feature importance in the former sections, some features may carry more information than others. So, we addressed this factor by computing a weighted form of L2 norm based on distance which is presented as:
$${dist_L2}_{w}left(u,vright)=sqrt[2]{sum_{f}{{w}_{i}({u}_{i}-{v}_{i})}^{2}}$$
where w is the n-dimensional feature importance vector and i iterates over numerical dimensions.
The use of the Euclidean distance is not proper for the categorical variables, i.e. it is ambiguous and not easy to find how much a canarys habitat sky is distant from a sharks habitat sea. Accordingly, whenever the data contains categorical features, we have changed the distance metric accordingly to L0 norm. L0 norm is 0 if categories are the same; it is 1 whenever the categories are different, i.e., between the sky and the sea L0 norm is 1, which is the maximum value. Following the discussion of weights for features, the L0 norm is also computed in a weighted form as ({dist_L0}_{w}left(u,vright)=sum_{f}{w}_{j}(({u}_{j}ne {v}_{j})to 1)), where j iterates over categorical dimensions.
After computing the weighted pairwise distance between all the training samples, we combine numerical and categorical parts as: ({{dist}_{w}left(u,vright)}^{2}={{dist_L2}_{w}left(u,vright)}^{2}+ {{dist_L0}_{w}left(u,vright)}^{2}). With pairwise distances for each pair of samples, we get a n x n square and symmetric distance matrix D, where n is the number of training samples. In matrix D, each element shows the distance between corresponding vectors.
$$D= left[begin{array}{ccc}0& cdots & d(1,n)\ vdots & ddots & vdots \ d(n,1)& cdots & 0end{array}right]$$
We aim to get a weighted network, where edge weights represent the closeness of its connected nodes. We need to first convert distance scores to similarity scores. We simply convert distances to similarities by subtracting the maximum distance on distances series from each element.
$$similarity_s(u,v)=mathrm{max}_mathrm{value}_mathrm{of}(D)-{dist}_{w}left(u,vright)$$
Finally, after removing self-loops (i.e. setting diagonal elements of A to zero), we use adjacency matrix A to generate an undirected network graph G. In this step, we delete the lower triangular part (which is symmetric to the upper triangular part) to avoid redundancy. Note that, in transition from the adjacency matrix to a graph, the existence of a (positive) similarity score between two samples u and v creates an edge between them, and of course, the similarity score will serve as the vectorial weight of this particular edge in graph G.
$$A= left[begin{array}{ccc}-& cdots & s(1,n)\ vdots & ddots & vdots \ -& cdots & -end{array}right]$$
The raw graph generated in this step is a complete graph: that is, all nodes are connected to all other nodes via an edge having some weight. Complete graphs are very complex and sometimes impossible to analyze. For instance, it is impossible to produce some SNA metrics such as betweenness centrality in this kind of a graph.
- What Is Machine Learning? | How It Works, Techniques ... [Last Updated On: September 5th, 2019] [Originally Added On: September 5th, 2019]
- Start Here with Machine Learning [Last Updated On: September 22nd, 2019] [Originally Added On: September 22nd, 2019]
- What is Machine Learning? | Emerj [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- Microsoft Azure Machine Learning Studio [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- Machine Learning Basics | What Is Machine Learning? | Introduction To Machine Learning | Simplilearn [Last Updated On: October 1st, 2019] [Originally Added On: October 1st, 2019]
- What is Machine Learning? A definition - Expert System [Last Updated On: October 2nd, 2019] [Originally Added On: October 2nd, 2019]
- Machine Learning | Stanford Online [Last Updated On: October 2nd, 2019] [Originally Added On: October 2nd, 2019]
- How to Learn Machine Learning, The Self-Starter Way [Last Updated On: October 17th, 2019] [Originally Added On: October 17th, 2019]
- definition - What is machine learning? - Stack Overflow [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Artificial Intelligence vs. Machine Learning vs. Deep ... [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning in R for beginners (article) - DataCamp [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning | Udacity [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning Artificial Intelligence | McAfee [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- Machine Learning [Last Updated On: November 3rd, 2019] [Originally Added On: November 3rd, 2019]
- AI-based ML algorithms could increase detection of undiagnosed AF - Cardiac Rhythm News [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- The Cerebras CS-1 computes deep learning AI problems by being bigger, bigger, and bigger than any other chip - TechCrunch [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- Can the planet really afford the exorbitant power demands of machine learning? - The Guardian [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- New InfiniteIO Platform Reduces Latency and Accelerates Performance for Machine Learning, AI and Analytics - Business Wire [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- How to Use Machine Learning to Drive Real Value - eWeek [Last Updated On: November 19th, 2019] [Originally Added On: November 19th, 2019]
- Machine Learning As A Service Market to Soar from End-use Industries and Push Revenues in the 2025 - Downey Magazine [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Rad AI Raises $4M to Automate Repetitive Tasks for Radiologists Through Machine Learning - - HIT Consultant [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Machine Learning Improves Performance of the Advanced Light Source - Machine Design [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Synthetic Data: The Diamonds of Machine Learning - TDWI [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- The transformation of healthcare with AI and machine learning - ITProPortal [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Workday talks machine learning and the future of human capital management - ZDNet [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Machine Learning with R, Third Edition - Free Sample Chapters - Neowin [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Verification In The Era Of Autonomous Driving, Artificial Intelligence And Machine Learning - SemiEngineering [Last Updated On: November 26th, 2019] [Originally Added On: November 26th, 2019]
- Podcast: How artificial intelligence, machine learning can help us realize the value of all that genetic data we're collecting - Genetic Literacy... [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The Real Reason Your School Avoids Machine Learning - The Tech Edvocate [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Siri, Tell Fido To Stop Barking: What's Machine Learning, And What's The Future Of It? - 90.5 WESA [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Microsoft reveals how it caught mutating Monero mining malware with machine learning - The Next Web [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The role of machine learning in IT service management - ITProPortal [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Global Director of Tech Exploration Discusses Artificial Intelligence and Machine Learning at Anheuser-Busch InBev - Seton Hall University News &... [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- The 10 Hottest AI And Machine Learning Startups Of 2019 - CRN: The Biggest Tech News For Partners And The IT Channel [Last Updated On: November 28th, 2019] [Originally Added On: November 28th, 2019]
- Startup jobs of the week: Marketing Communications Specialist, Oracle Architect, Machine Learning Scientist - BetaKit [Last Updated On: November 30th, 2019] [Originally Added On: November 30th, 2019]
- Here's why machine learning is critical to success for banks of the future - Tech Wire Asia [Last Updated On: December 2nd, 2019] [Originally Added On: December 2nd, 2019]
- 3 questions to ask before investing in machine learning for pop health - Healthcare IT News [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Machine Learning Answers: If Caterpillar Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Measuring Employee Engagement with A.I. and Machine Learning - Dice Insights [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Amazon Wants to Teach You Machine Learning Through Music? - Dice Insights [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- Machine Learning Answers: If Nvidia Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 8th, 2019] [Originally Added On: December 8th, 2019]
- AI and machine learning platforms will start to challenge conventional thinking - CRN.in [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If Twitter Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If Seagate Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning Answers: If BlackBerry Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Amazon Releases A New Tool To Improve Machine Learning Processes - Forbes [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Another free web course to gain machine-learning skills (thanks, Finland), NIST probes 'racist' face-recog and more - The Register [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Kubernetes and containers are the perfect fit for machine learning - JAXenter [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- TinyML as a Service and machine learning at the edge - Ericsson [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- AI and machine learning products - Cloud AI | Google Cloud [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning | Blog | Microsoft Azure [Last Updated On: December 23rd, 2019] [Originally Added On: December 23rd, 2019]
- Machine Learning in 2019 Was About Balancing Privacy and Progress - ITPro Today [Last Updated On: December 25th, 2019] [Originally Added On: December 25th, 2019]
- CMSWire's Top 10 AI and Machine Learning Articles of 2019 - CMSWire [Last Updated On: December 25th, 2019] [Originally Added On: December 25th, 2019]
- Here's why digital marketing is as lucrative a career as data science and machine learning - Business Insider India [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Dell's Latitude 9510 shakes up corporate laptops with 5G, machine learning, and thin bezels - PCWorld [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Finally, a good use for AI: Machine-learning tool guesstimates how well your code will run on a CPU core - The Register [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Cloud as the enabler of AI's competitive advantage - Finextra [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Forget Machine Learning, Constraint Solvers are What the Enterprise Needs - - RTInsights [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- Informed decisions through machine learning will keep it afloat & going - Sea News [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- The Problem with Hiring Algorithms - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- New Program Supports Machine Learning in the Chemical Sciences and Engineering - Newswise [Last Updated On: January 13th, 2020] [Originally Added On: January 13th, 2020]
- AI-System Flags the Under-Vaccinated in Israel - PrecisionVaccinations [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- New Contest: Train All The Things - Hackaday [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- AFTAs 2019: Best New Technology Introduced Over the Last 12 MonthsAI, Machine Learning and AnalyticsActiveViam - www.waterstechnology.com [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Educate Yourself on Machine Learning at this Las Vegas Event - Small Business Trends [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Seton Hall Announces New Courses in Text Mining and Machine Learning - Seton Hall University News & Events [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Looking at the most significant benefits of machine learning for software testing - The Burn-In [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Leveraging AI and Machine Learning to Advance Interoperability in Healthcare - - HIT Consultant [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Adventures With Artificial Intelligence and Machine Learning - Toolbox [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Five Reasons to Go to Machine Learning Week 2020 - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Uncover the Possibilities of AI and Machine Learning With This Bundle - Interesting Engineering [Last Updated On: January 22nd, 2020] [Originally Added On: January 22nd, 2020]
- Learning that Targets Millennial and Generation Z - HR Exchange Network [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- Red Hat Survey Shows Hybrid Cloud, AI and Machine Learning are the Focus of Enterprises - Computer Business Review [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- Vectorspace AI Datasets are Now Available to Power Machine Learning (ML) and Artificial Intelligence (AI) Systems in Collaboration with Elastic -... [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- What is Machine Learning? | Types of Machine Learning ... [Last Updated On: January 23rd, 2020] [Originally Added On: January 23rd, 2020]
- How Machine Learning Will Lead to Better Maps - Popular Mechanics [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- Jenkins Creator Launches Startup To Speed Software Testing with Machine Learning -- ADTmag - ADT Magazine [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- An Open Source Alternative to AWS SageMaker - Datanami [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- Machine Learning Could Aid Diagnosis of Barrett's Esophagus, Avoid Invasive Testing - Medical Bag [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]
- OReilly and Formulatedby Unveil the Smart Cities & Mobility Ecosystems Conference - Yahoo Finance [Last Updated On: January 30th, 2020] [Originally Added On: January 30th, 2020]