A pipeline for the fully automated estimation of continuous reference … – Nature.com

Overview of automated pipeline

Our developed pipeline leverages routine measurements and sophisticated statistical tools to estimate continuous RIs. For this purpose, we utilize a state-of-the-art indirect method for RI estimation (refineR28) in combination with a statistical method to estimate smooth curves (GAMLSS29). The pipeline is highly modular and operates in a sequential procedure. It consists of four main steps (Fig.1):

Flowchart of the developed pipeline for the estimation of continuous reference intervals from routine data. Based on the raw input data (a), overlapping age groups with at least N=1000 samples are defined (b). An indirect method (here refineR) is applied to each defined age group to estimate a model describing the non-pathological distribution (c). This model is then used to compute a probability of being non-pathological along the concentration range (Eq.(1)), i.e. that a data point with that concentration originated from the non-pathological distribution (d). The light green colored area represents a probability of 1 (most likely non-pathological), and the dark blue colored region a probability of 0 (most likely pathological) (d). These estimated weights are then assigned to each data point (e). Following that, a statistical method to estimate smooth curves (here GAMLSS) is applied to the weighted input data set. The resulting parametric model enables the derivation of continuous reference intervals and percentile charts (f). To interpret and analyze pediatric data in more detail, percentile charts can also be presented with alternative scaling of the covariate age (f inset).

First, the raw input data (Fig.1a) are assigned to overlapping age groups with a high temporal resolution (1) (Fig.1b). For each of these groups, an indirect method (here refineR28) is applied to estimate a model describing the non-pathological distribution (2) (Fig.1c). These estimated models are then used to assign a weight to each raw data point describing the probability of being non-pathological, i.e. originating from the non-pathological distribution (3) (Fig.1d, e). Following that, a statistical method to estimate parametric, smooth curves (here GAMLSS29) is applied to the raw input data while taking the assigned probabilities into account. This parametric model then enables the derivation of age-specific percentiles representing estimates of continuous RIs and percentile charts (4) (Fig.1f). In addition, the model allows for conversion of test results into z-scores.

A more detailed description of the individual steps is given in the following sub-sections.

The first step in the pipeline is to assign the data to overlapping age groups. In order to model the dynamics observed in early infancy and puberty appropriately, we define age groups with high temporal resolution (Fig.1b). After birth, age groups are constructed for each day of life. To ensure a sufficient amount of data for the application of the indirect method, we expand each group equally to the left and the right until we reach a minimum of N=1000 samples per group. For this minimum sample size, refineR was previously shown to achieve robust results for fractions of pathological samples20%30. If there are multiple samples per subject in one age group, only the measurement with the most central age is used and the other samples are removed in this group to avoid intra-patient correlation effects31. To reduce computation time, the width of the groups is linearly expanded with increasing age. We employ a 1% increase of width based on the age in days, as it is assumed that physiological dynamics are most pronounced and rapid in the first days and months of life5,14 (e.g. for a group with age at 1000days, the expanded age group would cover the range 9951005days). This 1% increase of age range reduces the total number of age groups from 6570 to 479 in our case.

For each of the age groups defined in step 1, an indirect method is applied to estimate age-group specific RIs and models describing the underlying non-pathological distribution (Fig.1c). Here, we apply the refineR algorithm28 (refineR v1.5.1) to the different groups.

The refineR algorithm, as all indirect methods, operates on routine measurements and assumes that these consist of a mixture of pathological and non-pathological samples with the latter being in the majority. Further, it is assumed that the distribution of the non-pathological samples can be modeled with a (1- or 2-parameter) Box-Cox transformed normal distribution. After a pre-processing step to determine the central concentration region, a multi-level grid search is carried out to find the model that best describes the histogram of the raw routine measurements. refineR was recently described and evaluated in detail28,30 and could demonstrate convincing performance for sample sizes as small as N=1000 with a pathological fraction of up to 20% as well as for different distribution types, reaching from normal over skewed to heavily skewed distributions30. The algorithm is provided as an open-source R-package on CRAN (https://cran.r-project.org/package=refineR).

In a next step, the estimated models describing the non-pathological distribution for each age group are utilized to compute and assign a weight to each raw data point reflecting its probability of being non-pathological.

For each age group and thus each parametric model, the weights are computed as the ratio of densities of the estimated model and the total distribution of the raw data within this group (Eq.(1), Fig.1d).

$$Pleft(conc, aright)=mathrm{min}left(1,{text{max}}left(0,frac{{D}_{np}left(conc, aright)}{{D}_{total}left(conc, aright)}right)right)$$

(1)

with conc=concentration, a=age group, Dnp=density of non-pathological distribution, Dtotal=density of total distribution of raw input data. To increase robustness for the estimation of P, slight Gaussian smoothing is applied to the concentration-dependent ratio.

A probability P of 1 denotes that samples are assumed to originate from the non-pathological distribution (e.g. located within peak region) (Fig.1d light green region), while a probability of 0 means that samples are most likely pathological (e.g. located in outer regions) (Fig.1d dark blue region). Between these two extreme points, there is a transition region where continuous probabilities between 0 and 1 are assigned to the samples (Fig.1d light blue region). Overall, the samples within the peak region contribute more to the GAMLSS estimation than those at the outer regions (Fig.1e).

To account for the fact that one subject can contribute with multiple samples to the whole dataset leading to potential bias caused by intra-subject correlation in distinct age-classes, the data points are weighted down if the subject is represented multiple times as described in Eq.(2):

$$P_corrleft(subject, conc, aright)=frac{P(conc,a)}{{N}_{subject}}$$

(2)

with Nsubject=number of samples for a subject in the whole dataset, and P(conc, a) as described in Eq.(1).

As the age groups are occasionally expanded and thus overlap, the weight within each age group is assigned only to the central age. Hence, exactly one weight is set to each data point and is used in the following step to indicate the probability of individual samples being non-pathological (Fig.1e).

Having assigned the weight or probability of being non-pathological to each data point (Fig.1e), the next step is to estimate a parametric model for the continuous RIs. Here, we utilize the generalized additive models for location, scale, and shape (GAMLSS)29.

Specifically, we apply the LMS (lambda-mu-sigma) method with extensions, with the basic assumption that the response variable follows a distribution that depends on the covariate of interest, here the age of the subject. Similar to the refineR model assumptions, the LMS method, or also known as BoxCox Cole and Green distribution, fits three parameters: skewness (lambda), location (mu), and scale (sigma)32. In order to allow for more flexibility in model fitting, additional distribution families, which can be seen as extensions to the LMS method, are tested. Namely, the Box-Cox power exponential and the Box-Cox t-distribution. Both allow for fitting a forth parameter that models the kurtosis of the distribution33. These different distribution models were recently shown to achieve good performance in estimating continuous RIs for non-pathological datasets34. The best model out of these different distribution families is then selected based on the Bayesian Information Criterion (BIC).

The GAMLSS framework (https://cran.r-project.org/package=gamlss) is utilized to estimate a parametric model for the continuous RI curves. It results in a single model representing the age-dependent RI curves.

After computing the parametric model, it can be used to estimate percentile charts and thus continuous RIs over the range of the covariate, here from birth to adulthood (Fig.1f). Additionally, the parametric model allows for conversion of test results into z-scores. The 2.5th and 97.5th percentiles, i.e. the typical reference limits, would correspond to a z-score of approximately 1.96 and+1.96. Making use of z-scores corresponds to normalizing RIs and laboratory test results and thus facilitates the interpretation of the latter13,35.

In order to provide CIs for the estimated continuous RIs we use bootstrapping. Thus, we randomly sample from the input dataset with replacement and use the randomly sampled data set as input for the pipeline. The pipeline is then executed 100 times resulting in 100 parametric GAMLSS models. These are then used to predict the different percentiles for each day of life and the CIs are computed as the central 95% region of the estimated percentile points. We found that the estimated CIs can occasionally be strongly asymmetric causing the original model estimated on the whole data set to lie outside of the central 95% region. In such situations, the CI was expanded to include the estimate of the original model.

To showcase the application of the developed pipeline, we evaluate pediatric datasets of three biomarkers that are known for their extensive dynamics during physiological development: Alkaline phosphatase (ALP), Creatinine (CREA), and Hemoglobin (HB). We retrieved pseudonymized test results for boys and girls with ages 018years measured during routine care at a tertiary care center (Department of Pediatrics and Adolescent Medicine, University Hospital Erlangen, Germany). The measurements were performed on Roche cobas instruments (ALP, CREA) and SYSMEX instruments (HB) between 2010 and 2022 (Table 1). The datasets contained the numeric test results, sex of the subject, the age in days, and a non-traceable subject identifier. Both in- and outpatient test results were included in the datasets and no outlier exclusion was performed prior to applying the pipeline. Use of pseudonymized patient datasets obtained during patient care without patients explicit consent is in accordance with the applicable German/Bavarian regulations. The performed analyses of pediatric datasets have been approved and the need for informed consent was waived by the Ethical Review Boards of the University Hospital Erlangen, reference numbers 97_17 Bc and 216_21 Bc.

To evaluate the quality of results of the developed pipeline, we compare the results obtained to continuous and discrete RIs, previously estimated using other populations and methods.

For the comparison, we include RIs established as part of the CALIPER project (CAnadian Laboratory Initiative in PEdiatric Reference Intervals)21,36, the PEDREF study (Next-Generation Pediatric Reference Intervals)13,14, and extracted from package inserts for the different assays provided by the manufacturers23,24,25,26,27. For PEDREF and for CALIPER we received continuous RIs, while the package inserts only contain discrete, age-group specific RIs.

The CALIPER project has established age- and sex-specific RIs for children from birth to adulthood for over 170 biomarkers using population-based methods. For a subset of these, continuous RIs are reported as well21,22,37,38. However, due to a limited number of samples from healthy neonates and young children, and the extensive dynamics during these early stages of life, the first 6 or 12months of life are excluded. For ALP, CREA, and HB, continuous RIs were generated using nonparametric quantile regression with penalized splines, and the resulting RIs (2.5th and 97.5th percentiles) are reported for each half year of life21,22. Additionally, for ALP and CREA, continuous RIs have recently been established using LMS-based models38. As our datasets for ALP and CREA were measured on Roche cobas devices, but the CALIPER RIs are based on samples measured on Abbott ARCHITECT c8000 instruments, we transferred the reference limits (RLs) to Roche devices using the previously established equation for discrete RLs39.

PEDREF is a multi-center, data-driven project utilizing data from laboratory information systems from 13 German university hospitals to establish pediatric RIs13. For ALP and CREA fractional polynomial functions are provided to compute the RIs for boys and girls using age as an input parameter14. For HB, the RIs were established using penalized splines and the specific values are provided for each day of life from 0 to 18 years13.

See the article here:

A pipeline for the fully automated estimation of continuous reference ... - Nature.com

Related Posts

Comments are closed.