This study was designed to use freely available open TB CXR datasets as training data for our AI algorithm. Subsequent accuracy analyses were performed using independent CXR datasets and actual TB cases from our hospital. All image data were de-identified to ensure privacy. This study was reviewed and approved by institutional review board (IRB) of Kaohsiung Veterans General Hospital, which waived the requirement for informed consent (IRB no.: KSVGH23-CT4-13). This study adheres to the principles of the Declaration of Helsinki.
The flowchart of the study design is shown in Fig.1. Due to a high prevalence of TB and varied imaging presentation, TB cannot be entirely excluded in case of CXR presenting with pneumonia or other entities. Our preliminary research indicated that training a model solely on TB vs. normal resulted in bimodally distributed predictive values. Therefore, CXRs that were abnormal but not indicative of TB usually had predictive value too high or too low, and failed to effectively differentiate abnormal cases from normal or TB. For common CXR abnormalities such as pneumonia and pleural effusion, the TB risk is lower, but not zero. Thus, we trained two models using 2 different training datasets, one for TB detection and another for abnormality detection. Then the output predictive values were averaged.
Flow chart of model training and validations.
The features of the CXR datasets for training is summarized in Table 1. The inclusion criteria are CXR of TB, other abnormality, or normal. Both posteroanterior view and anteroposterior view CXRs are included. The exclusion criteria are CXR with poor quality, lateral view CXR, children CXR, and those with lesions too small to detect at 224224 pixels size). All the CXR images were confirmed by C.F.C. to ensure both image quality and correctness.
Training dataset 1 is used for training algorithms to detect typical TB pattern on CXR. 348 TB CXRs and 3806 normal CXRs were collected from various open datasets for training, including the Shenzhen dataset from Shenzhen No. 3 Peoples Hospital, the Montgomery dataset19,20, and Kaggle's RSNA Pneumonia Detection Challenge21,22.
Training dataset 2 is used for training algorithms to detect CXR abnormalities. A total of 1150 abnormal CXRs and 627 normal CXRs were collected from the ChestX-ray14 dataset23. The abnormal CXRs consisted of consolidation: 185, cardiomegaly: 235, pulmonary edema 139, pleural effusion: 230, pulmonary fibrosis 106, and mass: 255.
In this study, we employed GoogleTM18, a free online AI software dedicated to image classification. GoogleTM provides a user-friendly web-based graphical interface that allows users to execute deep neural network computations and train image classification models with minimal coding requirements. By utilizing the power of transfer learning, GoogleTM significantly reduces the computational time and data amount required for deep neural network training. Within GoogleTM, the base model for transfer learning was MobileNet, a model pretrained by Google on the ImageNet dataset featuring 14 million images and capable of recognizing 1,000 classes of images. Transfer learning is achieved by modifying the last 2 layers of the pre-trained MobileNet, and then keep subsequent specific image recognition training18,24. In GoogleTM , all images are adjusted and cropped to 224224 pixels for training. 85% of the image is automatically divided into training dataset, and the remaining 15% into validation dataset to calculate the accuracy.
The hardware employed in this study included a 12th-generation Intel Core i9-12900K CPU with 16 cores, operating at 3.25.2 GHz, an NVIDIA RTX A5000 GPU equipped with 24GB of error-correction code (ECC) graphics memory, 128 GB of random-access memory (RAM), and a 4TB solid-state disk (SSD).
To evaluate the accuracy of the algorithms, we collected clinical CXR data for TB, normal cases, and pneumonia/other disease from our hospital.
Validation dataset 1 included 250 de-identified CXRs retrospectively collected from VGHKS. The CXRs dates were between January 1, 2010 and February 27, 2023. This dataset included 83TB (81 confirmed by microbiology, and 2 confirmed by pathology), 84 normal, and 83 abnormal other than TB cases (73 pneumonia, 14 pleural effusion, 10 heart failure, and 4 fibrosis. Some cases had combined features). The image size of these CXRs ranged from width: 17604280 pixels and height: 19314280 pixels.
Validation dataset 2 is a smaller dataset derived from validation dataset 1, for comparison of algorithm and physicians performance, and included 50 TB, 33 normal and 22 abnormal other than TB cases (22 pneumonia, 5 pleural effusion, 1 heart failure, and 1 fibrosis) CXRs. The features of the two validation datasets are provided in Table 1.
Data collected from clinical CXR cases included demographic data (such as age and sex), radiology reports, clinical diagnoses, microbiological reports, and pathology reports. All clinical TB cases included in the study had their diagnosis confirmed by microbiology or pathology. Their CXR was performed within 1 month of TB diagnosis. Normal CXRs were also reviewed by C.F.C. and radiology reports were considered. Pneumonia/other disease cases were identified by reviewing medical records and examinations, with diagnoses made by clinical physicians judgement, and without evidence of TB detected within three months period.
We employed validation dataset 2 to evaluate the accuracy of TB detection of 5 clinical physicians (five board-certified pulmonologists, average experience 10 years, range 516 years). Each physician performed the test without additional clinical information, and was asked to estimate the probability of TB in each CXR, consider whether sputum TB examinations were needed, and make a classification from three categories: typical TB pattern, normal pattern, or abnormal pattern (less like TB).
We also collected radiology reports from validation dataset 2 to evaluate their sensitivity for detecting TB. Reports mentioning suspicion of TB or mycobacterial infection were classified as typical TB pattern. Reports indicating abnormal patterns such as infiltration, opacity, pneumonia, effusion, edema, mass, or tumor (but without mentioning tuberculosis, TB, or mycobacterial infection) were classified as abnormal pattern (less like TB). Reports demonstrating no evident abnormalities were classified as normal pattern. Furthermore, by analyzing the pulmonologists decisions regarding sputum TB examinations, we estimate the sensitivity of TB detection in pulmonologists actual clinical practice.
Continuous variables are represented as meanstandard deviation (SD) or median (interquartile range [IQR]), while categorical variables are represented as number (percentage). For accuracy analysis, the receiver operating characteristic (ROC) curve was used to compute the area under the curve (AUC). Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), likelihood ratio (LR), overall accuracy, and F1 score were calculated. A confusion matrix was used to illustrate the accuracy of each AI model. Boxplots were used to evaluate the distribution of the predicted values of the AI models for each etiology subgroup.
The formulas for each accuracy calculation are as follows:
(TP is true positives, TN is true negatives, FP is false positives, FN is false negatives, P is all positives, and N is all negatives.)
$$begin{gathered} {text{P }} = {text{ TP}} + {text{FN}}, hfill \ {text{N }} = {text{ TN}} + {text{FP}}, hfill \ {text{Sensitivity }} = {text{ TP}}/{text{P }} times {1}00, hfill \ {text{Specificity }} = {text{ TN}}/{text{N }} times {1}00, hfill \ {text{PPV }} = {text{ TP}}/left( {{text{TP}} + {text{FP}}} right) , times {1}00, hfill \ {text{NPV }} = {text{ TN}}/left( {{text{TN}} + {text{FN}}} right) , times {1}00, hfill \ {text{LR}} + , = {text{ sensitivity}}/left( {{1} - {text{specificity}}} right), hfill \ {text{LR}} - , = , left( {{1} - {text{sensitivity}}} right)/{text{specificity}}, hfill \ {text{Overall accuracy }} = , left( {{text{TP }} + {text{ TN}}} right)/left( {{text{P}} + {text{N}}} right) , times {1}00, hfill \ {text{F1 score }} = , left( {{2 } times {text{ sensitivity }} times {text{ PPV}}} right)/left( {{text{sensitivity }} + {text{ PPV}}} right) , times {1}00, hfill \ end{gathered}$$
Continue reading here:
A deep learning-based algorithm for pulmonary tuberculosis detection in chest radiography | Scientific Reports - Nature.com