This research complied with all relevant ethical regulations. The study that produced the AI assistance dataset29 used in this study was determined by the Massachusetts Institute of Technology (MIT) Committee on the Use of Humans as Experimental Subjects to be exempt through exempt determination E-2953.
This study used 324 retrospective patient cases from Stanford Universitys healthcare system containing chest X-rays and clinical histories, which include patients indication, vitals and labs. In this study, we analyzed data collected from a total of 140 radiologists participating in two experiment designs. The non-repeated-measure design included 107 radiologists in a non-repeated-measure setup (Supplementary Fig. 1). Each radiologist read 60 patient cases across four subsequences that each contained 15 cases. Each subsequence corresponded to one of four treatment conditions: with AI assistance and clinical histories, with AI assistance and without clinical history, without AI assistance and with clinical histories and without AI assistance and clinical histories. The four subsequences and associated treatment conditions were organized in a random order. The 60 patient cases were randomly selected and randomly assigned to one of the treatment conditions. This design included across-subject and within-subject variations in the treatment conditions; it did not allow within-case-subject comparisons because a case was encountered only once for a radiologist38. Order effects were mitigated by the randomization of treatment conditions. The repeated-measure design included 33 radiologists in a repeated-measure setup (Supplementary Fig. 2). Each radiologist read a total of 60 patient cases, each under each of the four treatment conditions and producing a total of 240 diagnoses. The radiologist completed the experiment in four sessions, and the radiologist read the same 60 randomly selected patient cases in each session under each of the various treatment arms. In each session, 15 cases were read in each treatment arm in batches of five cases. Treatments were randomly ordered. This resulted in the radiologist reading each patient case under a different treatment condition over the four sessions. There was a 2-week washout period15,39,40 between every session to minimize order effects of radiologists reading the same case multiple times. This design included across-subject and within-subject variations as well as across-case-radiologist and within-case-radiologist variations in treatment conditions. Order effects were mitigated by the randomization of treatment conditions. No enrichment was applied to the data collection process. We combined data from both experiment designs from the clinical history conditions. Further details about the data collection process are available in a separate study29, which focuses on establishing a Bayesian framework for defining optimal humanAI collaboration and characterizing actual radiologist behavior in incorporating AI assistance. The study was determined exempt by the MIT Committee on the Use of Humans as Experimental Subjects through exempt determination E-2953.
There are 15 pathologies with corresponding AI predictions: abnormal, airspace opacity, atelectasis, bacterial/lobar pneumonia, cardiomediastinal abnormality, cardiomegaly, consolidation, edema, lesion, pleural effusion, pleural other, pneumothorax, rib fracture, shoulder fracture and support device hardware. These pathologies, the interrelations among these pathologies and additional pathologies without AI predictions can be visualized in a hierarchical structure in Supplementary Fig. B.1. Radiologists were asked to familiarize themselves with the hierarchy before starting, had access to the figure throughout the experiment and had to provide predictions for pathologies following this hierarchy. This aimed to maximize clarity on the specific pathologies referenced in the experiment. When radiologists received AI assistance, they were simultaneously presented with the AI predictions for these 15 pathologies along with the patients chest X-ray and, if applicable, their clinical history. The AI predictions were presented in the form of prediction probabilities on a 0100 scale. The AI predictions were generated by the CheXpert model8, which is a DenseNet121 (ref. 41)-based model for chest X-rays that has been shown to perform similarly to board-certified radiologists. The model generated a single prediction for fracture that was used as the AI prediction for both rib fracture and shoulder fracture. Authors of the CheXpert model8 decided on the 14 pathologies (with a single prediction for fracture) based on the prevalence of observations in radiology reports in the CheXpert dataset and clinical relevance, conforming to the Fleischner Societys recommended glossary42 whenever applicable. Among the pathologies, they included Pneumonia (corresponding to bacterial/lobar pneumonia) to indicate the diagnosis of primary infection and No Finding (corresponding to abnormal) to indicate the absence of all pathologies. These pathologies were set in the creation of the CheXpert labeler8, which has been applied to generate labels for reports in the CheXpert dataset and MIMIC-CXR43, which are among the largest chest X-ray datasets publicly available.
The ground truth probabilities for a patient case were determined by averaging the continuous predicted probabilities of five board-certified radiologists from Mount Sinai Hospital with at least 10years of experience and chest radiology as a subspecialty on a 0100 scale. For instance, if the predicted probabilities of the five board-certified radiologists are 91, 92, 92, 100 and 100, respectively, the ground truth probability is 95. The prevalence of the pathologies based on a ground truth probability threshold of 50 of a pathology being present is shown in Supplementary Table 1.
The participating radiologists represent a diverse set of institutions recruited through two means. Their primary affiliations include large, medium and small clinical settings and non-clinical settings. Additionally, some radiologists are affiliated with an academic hospital, whereas others are not. Radiologists in the non-repeated-measure design were recruited from teleradiology companies. Radiologists in the repeated-measure design were recruited from the Vinmec health system in Vietnam. Details about the participating radiologists and recruitment process can be found in Supplementary Note | Participant recruitment and affiliation.
The experiment interface and instructions presented to participating radiologists can be found in Supplementary Note | Experiment interface and instructions. Before entering the experiment, radiologists were instructed to walk through the experiment instructions, the hierarchy of pathological findings, basic information and performance of the AI model, video demonstration of the experiment interface and examples, consent clauses, comprehension check questions, information on bonus payment that incentivizes effort and practice patient cases covering four treatment conditions and showing example AI predictions from the AI model used in the experiment.
Sex and gender statistics of the participating radiologists and patient cases are available in Supplementary Tables 39 and 40, respectively. Sex and gender were not considered in the original data collection procedures. Disaggregated information about sex and gender at the individual level was collected in the separate study and will be made available29.
We used the empirical Bayes method30 to shrink the raw mean heterogeneous treatment effects and performance metrics of individual radiologists measured on the dataset toward the grand mean to ameliorate overestimating heterogeneity due to sampling error. The values include AIs treatment effects on error, sensitivity and specificity and performance metrics on unassisted error, sensitivity and specificity.
Assume that ({t}_{r}) is radiologist rs true mean treatment effect from AI assistance or any metric of interest. We observe
$$tilde{t}_{r}={t}_{r}+{{{eta }}}_{r}$$
(1)
which differs from ({t}_{r}) by ({{{eta }}}_{r}). We use a normal distribution as the prior distribution over the metric of interest. The mean of the prior distribution can be computed as
$$Eleft[tilde{t}_{r}right]=Eleft[{t}_{r}right],$$
(2)
the mean of the observed mean metric of interest of radiologists. The variance of the prior distribution can be computed as
$$Eleft[{Big({t}_{r}-Eleft[{t}_{r}right]Big)}^{2}right]=Eleft[{left(tilde{t}_{r}-Eleft[tilde{t}_{r}right]right)}^{2}right]-Eleft[{{{eta }}}_{r}^{2}right],$$
(3)
the variance of the observed mean metric of interest of radiologists minus the estimated (Eleft[{{{eta }}}_{r}^{2}right]). We can estimate (Eleft[{{{eta }}}_{r}^{2}right]) with
$$Eleft[{{{eta }}}_{r}^{2}right]=Eleft[{left(frac{1}{{N}_{r}}mathop{sum }limits_{i}{t}_{{ir}}-Eleft[{t}_{{ir}}right]right)}^{2}right]=Eleft[frac{{sum }_{i}{left({t}_{{ir}}-Eleft[{t}_{{ir}}right]right)}^{2}}{{N}_{r}}right]=Eleft[s.e.{left(tilde{t}_{r}right)}^{2}right].$$
(4)
Denote the estimated mean and variance of the prior distribution as ({{rm{mu }}}_{0}) and ({{rm{sigma }}}_{0}^{2}). We can compute the mean of the posterior distribution for radiologist (r) as
$$frac{{{rm{sigma }}}_{r}^{2}{{rm{mu }}}_{0}+{{rm{sigma }}}_{0}^{2}{{rm{mu }}}_{r}}{{{rm{sigma }}}_{0}^{2}+{{rm{sigma }}}_{r}^{2}}$$
(5)
where ({{rm{mu }}}_{r}=widetilde{{t}}_{t}) and ({{rm{sigma }}}_{r}=s.e.left(widetilde{{t}}_{r}right)); we can compute the variance of the posterior as
$$frac{{{rm{sigma }}}_{0}^{2}{{rm{sigma }}}_{r}^{2}}{{{rm{sigma }}}_{0}^{2}+{{rm{sigma }}}_{r}^{2}}$$
(6)
where ({{rm{sigma }}}_{r}=s.e.left(widetilde{{t}}_{r}right)). The updated mean of the posterior distribution is the radiologists metric of interest after shrinkage.
For the analysis on treatment effects on absolute error, we focus on high-prevalence pathologies with prevalence greater than 10%, because radiologists baseline performance without AI assistance is generally highly accurate on low-prevalence pathologies, where they correctly predict that a pathology is not present, and, as a result, there is little variation in radiologists errors. This is especially true when computing each individual radiologists treatment effect. When there is zero variance in the performance of a radiologist under a treatment condition, the associated standard error estimate is zero, making it impossible to perform inference on this radiologists treatment effect.
The combined characteristics model was fitted on a training set of half of the radiologists (n=68) to predict treatment effects of the test set of the remaining half (n=68). The treatment effect predictions on the test set were used as the combined characteristics score for splitting the test set radiologists into binary subgroups (based on whether a particular radiologists combined characteristics score was smaller than or equal to the median treatment effect of radiologists computed from all available reads). Then, the same procedure was repeated after flipping the training set and test set radiologists to split the other set of radiologists into binary subgroups. The experience-based characteristics of radiologists in the randomly split training set and test set were balanced: one set contained 27 radiologists with less than or equal to 6years of experience and 41 radiologists with more than 6years of experience, and the other set contained 41 and 27, respectively. One set contained 47 radiologists who did not specialize in thoracic radiology and 21 radiologists who did, and the other set contained 54 and 14 radiologists, respectively. One set contained 32 radiologists without experience with AI tools and 36 radiologists with experience, and the other set contained 31 and 37, respectively.
To compute a radiologists observed mean treatment effect and the corresponding standard errors and the overall treatment effect of AI assistance across subgroups, we built a linear regression model with the following formulation using the statsmodels library: error1+C(treatment). Here, error refers to the absolute error of a radiologist prediction; 1 refers to an intercept term; and treatment refers to a binary indicator of whether the prediction is made with or without AI assistance. This formulation allows us to compute the treatment effect of AI assistance for both non-repeated-measure and repeated-measure data.
For the analyses on experience-based radiologist characteristics and AI error, we computed the treatment effects of subgroups split based on the predictor of interest by building a linear regression model with the following formulation using the statsmodels library: error1+C(subgroup)+C(treatment):C(subgroup). Here, error refers to the absolute error of a radiologist prediction; 1 refers to an intercept term; subgroup refers to an indicator of the subgroup that the radiologist is split into; and treatment refers to a binary indicator of whether the prediction is made with or without AI assistance. This formulation allows us to compute the subgroup-specific treatment effect of AI assistance for both non-repeated-measure data and repeated-measure data.
To account for correlations of observations within patient cases and radiologists, we computed cluster-robust standard errors that are two-way clustered at the patient case and radiologist level for all inferences unless otherwise specified44,45. With the statsmodels librarys ordinary least squares (OLS) class, we used a clustered covariance estimator as the type of robust sandwich estimator and defined two-way groups based on identifiers of the patient cases and radiologists. The approach assumes that regression model errors are independent across clusters defined by the patient cases and radiologists and adjusts for correlations within clusters.
The reversion to the mean effect and the mechanism of split sampling in avoiding reversion to the mean are explained in the following derivation:
Suppose that ({u}_{i,r}^{* }) and ({a}_{i,r}^{* }) are the true unassisted and assisted diagnostic error of radiologist (r) on patient case i. Suppose that we measure ({u}_{i,r}={u}_{i,r}^{* }+{e}_{i,r}^{u}) and ({a}_{i,r}={a}_{i,r}^{* }+{e}_{i,r}^{a}) where ({e}_{i,r}^{u}) and ({e}_{i,r}^{a}) are measurement errors. Assume that the measurement errors are independent of ({u}_{i,r}^{* }) and ({a}_{i,r}^{* }).
To study the relationship between unassisted error and treatment effect, we intend to build the following linear regression model:
$${u}_{r}^{* }-{a}_{r}^{* }={{beta }}{u}_{r}^{* }+{e}_{r}^{* }$$
(7)
where the error is independent of the independent variable, and ({u}_{r}^{* }) and ({a}_{r}^{* }) are the mean unassisted and assisted performance of radiologist (r). Here, the moment condition
$$Eleft[{e}_{i,r}^{* }times {u}_{i,r}^{* }right]=0$$
(8)
is as desired. This univariate regression estimates the true value of ({{beta }}), which is defined as
$$frac{{rm{Cov}}({{rm{u}}}_{{rm{r}}}^{ast }-{{rm{a}}}_{{rm{r}}}^{ast },,{{rm{u}}}_{{rm{r}}}^{ast })}{{rm{Var}}({{rm{u}}}_{{rm{r}}}^{ast })}$$
(9)
However, because we have access only to noisy measurements ({u}_{r}) and ({a}_{r}), consider instead an approach that builds the model
$${u}_{r}-{a}_{r}={{beta }}{u}_{r}+{e}_{r}$$
(10)
and assumes the moment condition
$$Eleft[{e}_{r}times {u}_{r}right]=0.$$
(11)
This linear regression model using noisy measurements instead generates the following estimate of ({{beta }}):
$$frac{{Cov}left({u}_{r}-{a}_{r},{u}_{r}right)}{{Var}left({u}_{r}right)}=frac{{Cov}left({u}_{r}^{* }-{a}_{r}^{* },{u}_{r}^{* }right)+{Var}left({e}_{r}^{u}right)}{{Var}left({u}_{r}^{* }right)+{Var}left({e}_{r}^{u}right)}$$
(12)
which is incorrect because of the additional ({{V}},{{ar}}left({{{e}}}_{{{r}}}^{{{u}}}right)) terms in the numerator and the denominator. The additional term in the denominator represents attenuation bias, which we address in detail in a later subsection. The term in the numerator represents the reversion to the mean issue, which we now discuss in further detail.
As the equation shows, the bias caused by reversion to the mean is positive. This term exists because the moment condition (Eleft[{e}_{r}times {u}_{r}right]=0), equation (11), is not valid at the true value of ({{beta }}) as shown in the following derivation:
$$begin{array}{c}Eleft[left({u}_{r}-{a}_{r}-{{beta }}{u}_{r}right)times {u}_{r}right]=Eleft[left(left(1-{{beta }}right){u}_{r}-{a}_{r}right)times {u}_{r}right]\ begin{array}{c}=Eleft[left(left(1-{{beta }}right)left({u}_{r}^{* }+{e}_{r}^{u}right)-left({a}_{r}^{* }+{e}_{r}^{a}right)right)times {u}_{r}right]\ begin{array}{c}=Eleft[left(left(left(1-{{beta }}right){u}_{r}^{* }-{a}_{r}^{* }right)+left(1-{{beta }}right){e}_{r}^{u}-{e}_{r}^{a}right)times {u}_{r}right]\ begin{array}{c}=Eleft[left({e}_{r}^{* }+left(1-{{beta }}right){e}_{r}^{u}-{e}_{r}^{a}right)times {u}_{r}right]\ begin{array}{c}=left(1-{{beta }}right)Eleft[{e}_{r}^{u}times {u}_{r}right]\ =left(1-{{beta }}right){Var}left({e}_{r}^{u}right)ne 0.end{array}end{array}end{array}end{array}end{array}$$
Split sampling solves this bias by using separate patient cases for computing unassisted error and treatment effect. A simple construction of split sampling is to use a separate case i for computing the treatment effect and using the remaining cases to compute unassisted error. With this construction, we obtain the following estimate of ({{beta }}):
$$frac{{Cov}left({u}_{i,r}-{a}_{i,r},{u}_{ne i,r}right)}{{Var}left({u}_{ne i,r}right)}$$
(13)
where ({u}_{i,r}) is the unassisted performance on case i for radiologist (r), and ({u}_{ne i,r}) is the mean unassisted performance computed on all unassisted cases other than i. If the errors on each case used to compute ({u}_{r}^{* }) and ({a}_{r}^{* }) are independent, the estimate of ({{beta }}) is equal to
$$frac{{Cov}left({u}_{r}^{* }-{a}_{r}^{* },{u}_{r}^{* }right)}{{Var}left({u}_{ne i,r}right)}$$
(14)
The remaining discrepancy in the denominator again represents attenuation bias and is addressed in a later subsection.
To study unassisted error as a predictor of treatment effect, we built a linear regression model with the following formulation using the statsmodels library: treatment effect1+unassisted error. We designed the following split sampling construction to maximize data efficiency when computing the independent and dependent variables in the linear regression.
Let i index a patient case and (r) index a radiologist. Assume that a radiologist reads ({N}_{u}) cases unassisted and ({N}_{a}) cases assisted. Recall that the unassisted and assisted cases are disjoint for the non-repeated-measure data; they overlap exactly for the repeated-measure data.
For the non-repeated-measure design, we adopt the following construction:
$${u}_{i,r}-{a}_{r}={{beta }}{x}_{ne i,r}+{{rm{varepsilon }}}_{{u}_{i,r}}+{{rm{varepsilon }}}_{{a}_{r}}$$
(15)
where ({x}_{ne i,r}=frac{1}{{N}_{u}-1}{sum }_{kne i}{u}_{k,r}) and ({a}_{r}=frac{1}{{N}_{a}}{sum }_{k}{a}_{k,r}). Here, ({x}_{ne i,r}) is the mean unassisted performance computed on all unassisted cases other than i; ({u}_{{i},{r}}) is the unassisted performance on case i for radiologist (r); and ({a}_{r}) is the mean assisted performance on all assisted cases for radiologist (r).
For the repeated-measure design, we adopt the following construction:
$${u}_{i,r}-{a}_{i,r}={{beta }}{x}_{ne i,r}+{{rm{varepsilon }}}_{{u}_{i,r}}+{{rm{varepsilon }}}_{{a}_{i,r}}$$
(16)
where ({x}_{ne i,r}=frac{1}{{N}_{u}-1}{sum }_{kne i}{u}_{k,r}). Here, ({x}_{ne i,r}) is the mean unassisted performance computed on all cases other than i; ({u}_{i,r}) is the unassisted performance on case i for radiologist (r); and ({a}_{i,r}) is the assisted performance on case i for radiologist (r).
To study unassisted error as a predictor of assisted error, we built a linear regression model with the following formulation using the statsmodels library: assisted error1+unassisted error. We designed the following split sampling construction that maximizes data efficiency when computing the independent and dependent variables in the linear regression.
For the non-repeated-measure design, we adopt the following construction:
$${a}_{i,r}={{beta }}{x}_{r}+{{rm{varepsilon }}}_{i,r}$$
(17)
where ({x}_{r}=frac{1}{{N}_{u}}{sum }_{k},{x}_{k,r}). Here, ({x}_{r}) is the mean unassisted performance computed on all unassisted cases, and ({a}_{i,r}) is the assisted performance on case i for radiologist (r).
For the repeated-measure design, we adopt the following construction:
$${a}_{i,r}={{beta }}{x}_{ne i,r}+{{rm{varepsilon }}}_{i,r}$$
(18)
where ({x}_{ne i,r}=frac{1}{{N}_{u}-1}{sum }_{kne i}{u}_{k,r}). Here, ({x}_{ne i,r}) is the mean unassisted performance computed on all unassisted cases other than i and ({a}_{i,r}) is the assisted performance on case i for radiologist (r).
The constructions above again emphasize the necessity for split sampling. Without split sampling, the mean unassisted performance, which is the independent variable of the linear regression, will be correlated with the error terms due to overlapping patient cases, leading to a bias in the regression.
We adjusted for attenuation bias for the split sampling linear regression formulations.
We want to estimate regressions of the form
$${Y}_{r}={{{beta }}}_{0}+{{{beta }}}_{1}Eleft[{x}_{r}right]+{{rm{varepsilon }}}_{r}$$
(19)
where ({Y}_{r}) is an outcome for radiologist (r) and (Eleft[{x}_{r}right]) is radiologist (r)s average unassisted performance. We observe
$$widetilde{{x}}_{r}=frac{1}{{N}_{r}}mathop{sum }limits_{i}{x}_{{ir}}=Eleft[{x}_{r}right]+{{{eta }}}_{r}$$
(20)
where ({{{eta }}}_{r}=frac{1}{{N}_{r}}mathop{sum }limits_{i}{x}_{{ir}}-Eleft[{x}_{r}right]) and (Eleft[{{{eta }}}_{r}{x}_{r}right]=0) and (Eleft[{{{eta }}}_{r}{{rm{varepsilon }}}_{r}right]=0), which are justified by independent and identically distributed (i.i.d.) sampling of cases and split sampling, respectively.
Using observations from the experiment, we estimate the following regression:
$${Y}_{r}={{rm{gamma }}}_{0}+{{rm{gamma }}}_{1}tilde{x}_{r}+{{rm{varepsilon }}}_{r}$$
(21)
Recall that
$$begin{array}{rcl}{{hat{rm{gamma }}}_{1}}{to }^{p}frac{Eleft[left({x}_{r}+{{{eta }}}_{r}-Eleft[{x}_{r}right]right)left({Y}_{r}-Eleft[{Y}_{r}right]right)right]}{Eleft[{left({x}_{r}+{{{eta }}}_{r}-Eleft[{x}_{r}right]right)}^{2}right]} =\ frac{Eleft[left({x}_{r}-Eleft[{x}_{r}right]right)left({Y}_{r}-Eleft[{Y}_{r}right]right)right]}{Eleft[{left({x}_{r}-Eleft[{x}_{r}right]right)}^{2}right]+Eleft[{{{eta }}}_{r}^{2}right]}={{{beta }}}_{1}{rm{lambda }}end{array}$$
(22)
where ({rm{lambda }}=frac{Eleft[{left({x}_{r}-Eleft[{x}_{r}right]right)}^{2}right]}{Eleft[{left({x}_{r}-Eleft[{x}_{r}right]right)}^{2}right]+Eleft[{{{eta }}}_{r}^{2}right]}) and ({{{beta }}}_{1}=frac{Eleft[left({x}_{r}-Eleft[{x}_{r}right]right)left({Y}_{r}-Eleft[{Y}_{r}right]right)right]}{Eleft[{left({x}_{r}-Eleft[{x}_{r}right]right)}^{2}right]}). We can estimate ({rm{lambda }}) using a plug-in estimator for each term in the data: (1)
$$begin{array}{rcl}Eleft[{{{eta }}}_{r}^{2}right]=Eleft[{left(frac{1}{{N}_{r}}mathop{sum }limits_{i}{x}_{{ir}}-Eleft[{x}_{{ir}}right]right)}^{2}right]\=Eleft[frac{{sum }_{i}{left({x}_{{ir}}-Eleft[{x}_{{ir}}right]right)}^{2}}{{N}_{r}}right]=Eleft[s.e.{left(tilde{x}_{r}right)}^{2}right].end{array}$$
(23)
This is the standard error of the mean estimator. (2)
Read the rest here:
Heterogeneity and predictors of the effects of AI assistance on radiologists - Nature.com
Read More..