Detection of diabetic patients in people with normal fasting glucose … – BMC Medicine

Data collection and processing

The physical examination data were derived from three hospitals, First Affiliated Hospital of Wannan Medical College, Beijing Luhe Hospital of Capital Medical University, and Daqing Oil field General Hospital. The three datasets were named as D1, D2, and D3, respectively. The first step was data cleaning, in which samples with missing values and abnormal values were excluded. According to the criteria for diagnosing prediabetes and diabetes from WHO, we screened the samples with normal fasting glucose (6.1mmol/L) and classified these samples into two groups by HbA1c level with threshold of 6.5%, diabetes patients (HbA1c6.5%) and normal/healthy samples. After preprocessing, 61,059, 369, and 3247 samples were retained in D1, D2, and D3, which separately contained 603, 3, and 21 subjects with diabetes, that is, the positive samples. Then, we split D1 into training set, validation set, and test set by 6:1:3 using randomly stratified sampling. D2 and D3 were used as newly recruited independent test sets.

All datasets contained 27 physical examination characteristics, including sex, age, height, body mass index (BMI), fasting blood glucose (FBG), white blood cell count (WBC), neutrophil (NEU), absolute neutrophil count (ANC), lymphocyte (LYM), absolute lymphocyte count (ALC), monocyte (MONO), absolute monocyte count (AMC), eosinophil (EOS), absolute eosinophil count (AEC), basophil (BASO), absolute basophil count (ABC), hemoglobin (HGB), hematocrit (HCT), mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), red cell distribution width (RDW), platelets (PLT), mean platelet volume (MPV), platelet distribution width (PDW), thrombocytopenia (PCT), red blood cell count (RBC), and mean corpuscular hemoglobin concentration (MCHC).

Given the severe class-imbalance of all datasets, the SMOTE (synthetic minority over-sampling technique) method was employed on training set for oversampling the positive samples. SMOTE could generate new samples for the minority class by interpolation based on k-nearest neighbors [22], which could make positive samples as large as negative samples on training set. The process was implemented by imblearn package in Python. Finally, we conducted Z-score normalization on all datasets, in which the mean and standard deviation values were calculated by the data of training set.

With the physical examination data, we presented a computational framework for identifying the diabetic patients with NFG, as shown in Fig.1. At first, we preprocessed three datasets of D1, D2, and D3 as introduced above, in which D1 was divided into training set, validation set, and test set by 6:1:3, while D2 and D3 as independent test set were used for the evaluation of final model. In view of the class-imbalance of datasets, we used an oversampling method on the training set. Then, multiple widely used machine learning methods including logistic regression (LR), random forest (RF), supported vector machine (SVM), and deep neural network (DNN) were exploited to construct the predictor. Next, we applied feature selection methods on the most superior one of four predictors to improve the feasibility of tool and assessed the performance with independent test sets. Finally, feature importance analysis was used to screen relevant variables with the incidence of diabetes. And we devised a framework for identifying the risk factors of diabetes at individual level and developed an online tool for boosting its clinical practice.

Overview of the DRING approach

In preliminary, in order to build the predictive model, four machine learning methods were employed, including LR, RF, SVM, and DNN. LR is a variation of linear regression prominently used in classification tasks [23], which finds the best fit to describe the linear relationship between responsible variables and input features and then covert the output to the probability by a sigmoid function. RF is composed of numerous decision trees, which are practically a collection of ifthen condition [24] The decision tree recursively split the data into subset based on the best feature and criterion until the stopping criterion is met. In RF, each decision tree is independently trained on random subset of samples and features, which reduces the risk of overfitting. The final decision is voted by all trees improving the overall accuracy and the robustness of the model. SVM, one of the most popular machine learning methods, classifies the samples by finding a hyperplane on the feature space to maximize the margin of points from different classes [25]. It can handle non-linearly separable data by using various kernels such as linear, polynomial, and radial basis function realizing the original feature space into high-dimensional space. The LR, RF, and SVM models were constructed by scikit-learn package in Python 3.8. And default parameters were used in the process of training models. DNN [26] contains input layer, hidden layer, and output layer, where there are plenty of neurons in each layer and the neurons from different layers are connected. For DNN, the connection is generally linear transformation followed by an activation function. Here, we used the ReLU function to activate the linear neurons and softmax function to output the prediction result. In addition, we used the dropout and L2 regularization strategy in the hidden layers to prevent the presence of overfitting. Moreover, the residual blocks also were added into the DNN for simplifying the training process. The DNN was implemented by Pytorch package. In this study, DNN model achieved the best performance when the number of layers at 6 and initial learning rate with 0.0018. Loss on the training set and validation set was depicted in Additional file 1: Fig. S1. And we chose the model with the best performance on validation set for further optimization.

Currently, machine learning models for classification task are evaluated by multiple well-established metrics, for example, sensitivity, accuracy, and area under the receiver operating characteristic curve (AUC), etc. Given the seriously unbalanced classes of validation set and test set, here, we exploited sensitivity, specificity, balanced accuracy, AUC, and area under the precision-recall curve (PR-AUC) to evaluate models, which were calculated as following formulas.

$$mathrm{Sensitivity}=mathrm{TPR}= frac{TP}{TP+FN}$$

(1)

$$mathrm{Specificity}=mathrm{TNR}=frac{TN}{TN+FP}$$

(2)

$$mathrm{Balanced accuracy}= frac{TPR+TNR}{2}$$

(3)

(TP), that is, true positive, is the number of correctly classified diabetes patients. (FP), false positive, denotes the number of normal subjects who were predicted as diabetes. (TN), true negative, represents the number of correctly classified health subjects. (FN), false negative, is the number of diabetes patients who were classified as health individuals. And all above metrics range from 0 to 1.

Although the predictive model based on 27 features had a considerable performance, there still exist several possible redundant information or noise features affecting the decision making. To maximize the effective information of features and simplify the model, we used manual curation and max relevance and min redundancy (mRMR) [27] to extract key features for the final model. Towards manual curation, we firstly selected the features with significant difference between the positive samples and the negative samples. To enhance the stability of the predictive model, we removed the features resulting in severe collinearity. As a result, 13 features were retained. For consistencys sake, the number of feature subset was set to 13 when performing mRMR analysis. In addition, feature selection was executed on the training set for reducing the risk of overfitting. Analysis of feature importance can interpret the prediction model and discover the most relevant features with diabetes. Here, the importance of each feature was measured by its corresponding weight coefficient of the LR model.

We developed an online tool, DRING (http://www.cuilab.cn/dring), based on the predictive models with 13 features filtered by manual curation and mRMR, where the former is the preferred option. The backend development of website was implemented by Python 2.7, and the interactive pages were constructed on the combination of HTML5, Boostrap 4, and JavaScript.

Feature importance analysis can help to explain the model; however, it fails to explore the risk factors for incident diabetes at individual level. To find out the potential risk factor for a specific individual, we learnt from the permutation feature importance (PFI) algorithm [24, 28], which is designed for quantifying the importance for each of the variables of a dataset. Here, we adapted PFI to assess the contributions of the features derived from an individual. Specifically, it contains the following 4 steps: (1) given a feature vector, we firstly create a series of random permutation for one of features based on the input dataset; (2) then, we calculate the prediction results for each of new feature vectors; (3) the contribution of the permutated feature is defined as formula 4:

$${P =| P}_{r}-frac{1}{k}{sum }_{i=1}^{k}{P}_{i} |$$

(4)

({P}_{r}) is the risk score for diabetes calculated with the initial feature vector, here referred to the predictive probability of diabetes; ({P}_{i}) is the prediction result of ith permutation, and (k) is the number of permutations; (4) perform the above steps iteratively for each of features. Here, we set k to 100,000. The feature with a higher value implies more contribution to the risk of diabetes.

More here:
Detection of diabetic patients in people with normal fasting glucose ... - BMC Medicine

Related Posts

Comments are closed.