Maxwell Mansolf played lead role in formal analysis, investigation, methodology, project administration, validation, visualization, writing of original draft, and writing of review and editing and equal role in conceptualization and data curation. Courtney K. Blackwell played supporting role in writing of original draft and writing of review and editing and equal role in conceptualization, investigation, and project administration. Peter Cummings played lead role in data curation and supporting role in software, writing of original draft, and writing of review and editing. Seohyun Choi played supporting role in writing of original draft and equal role in data curation and project administration. David Cella played lead role in conceptualization, funding acquisition, and supervision; supporting role in methodology, writing of original draft, and writing of review and editing; and equal role in project administration.
Correspondence concerning this article should be addressed to Maxwell Mansolf, Department of Medical Social Sciences, Feinberg School of Medicine, Northwestern University, 625 N. Michigan Ave Fl 27 Chicago, IL 60611, United States. ude.nretsewhtron@flosnam.llewxam
The publisher's final edited version of this article is available at Psychol AssessThe Child Behavior Checklist (CBCL) and Strengths and Difficulties Questionnaire (SDQ) both measure emotional and behavioral problems in children and adolescents, and scores on the two instruments are highly correlated. When administrative needs compel practitioners to change the instrument used or data from the two measures are combined to perform pooled analyses, it becomes necessary to compare scores on the two instruments. To enable such comparisons, we score linked three domains (Internalizing, Externalizing, and Total Problems) of the CBCL and SDQ in three age groups spanning 2–17 years. After assessing linking assumptions, we compared item response theory (IRT) and equipercentile linking methods to identify the most statistically justifiable link, ultimately selecting equipercentile linking with loglinear smoothing due to its minimal bias and the ability to link raw SDQ scores with both T-scores and raw scores from the CBCL. We derived crosswalk conversion tables to convert scores on one measure to the metric of the other and discuss the use of these tables in research and practice.
Keywords: Child Behavior Checklist, Strengths and Difficulties Questionnaire, linking, equating, assessment
Childhood mental health assessment is important in clinical practice to screen for emotional and behavioral problems and in research to assess severity and monitor change. In the U.S., 13%–25% of children are diagnosed annually (Egger & Angold, 2006; Perou et al., 2013), and diagnostic comorbidity is common (Ghandour et al., 2019). Early identification of problems can help address contributing factors and reduce downstream negative effects on school readiness, academic achievement and attainment, peer relationships, substance use, and criminal behavior (Child Mind Institute, 2015; Perou et al., 2013). The importance of screening early and often is highlighted by the American Academy of Pediatrics, Committee on Psychosocial Aspects of Child and Family Health and Task Force on Mental Health (2009), which recommendation that clinicians be equipped to identify and address pediatric mental health concerns during well-care visits. Practical limitations to administering pediatric mental health screeners may result in measure underutilization and thus underidentification of children who need additional services. Cost, time, and ease of use can prohibit implementing certain assessment tools despite their high reliability and validity (Cohen et al., 2008). Also, measures used can change between administrations, for example, due to a change in budget or strategic direction or a desire to reduce time burden (e.g., Zuckerbrot et al., 2007). Such a switch necessitates a way to compare scores from different instruments to ensure continuity of monitoring.
In addition, researchers may encounter situations in which multiple instruments were administered to different groups of children. Such situations may arise when data from multiple sites are combined in a multistudy program or when existing data were collected using one measure but prospective data will be collected using a different measure. These harmonization issues are common in large-scale consortium research (Collins & Manolio, 2007; Feigelson et al., 2006; Pedersen et al., 2013; Smith-Warner et al., 2006; Willett et al., 2007), and with a broader movement to standardize measurement across nationally funded research programs (Downey & Olson, 2013) and the proliferation of modern perspectives on data sharing (Fischer & Zigmond, 2010; Van Noorden, 2013), it is increasingly important to be able to place scores from different measures onto a common metric. Crosswalk conversion tables, while originally developed to compare scores on aptitude tests (Lord & Wingersky, 1984), have become popular tools for converting scores between instruments in psychosocial and health sciences, with recent research linking measures of depression (Choi et al., 2014), anxiety (Schalet et al., 2014), pain (Askew et al., 2013), psychological distress (Batterham et al., 2018), and fatigue (Lai et al., 2014), to name a few. While crosswalk conversions have limitations, including the introduction of error when the to-be-linked scores are not perfectly correlated (e.g., Lord, 1982) and potential breakdown of the linking relationship across subgroups (Dorans, 2004; Petersen, 2008), they are beneficial for comparing levels of functioning across samples and the results of studies which use different instruments. Crosswalk tables can also be used to harmonize data at the score level when the linked instruments do not share items or response formats. While integrative data analytic methods (IDA; Curran & Hussong, 2009; Hussong et al., 2013), in which joint latent variable modeling of multiple measures is used to analyze the corresponding constructs on a common metric, is theoretically better-suited than crosswalk conversion tables for such harmonization with respect to bias and precision of the resulting estimates, this approach requires the analyst to be trained in latent variable modeling, as well as raw co-administration data or derived item parameters from concurrent calibration. In contrast, crosswalk tables permit the resulting linked scores to be used directly by analysts with no training in latent variable modeling do not require overlapping administration. In the present study, we used established linking methodology (Choi et al., 2014) to link two popular measures of emotional and behavioral functioning, the Child Behavior Checklist (CBCL; Achenbach, 1991; Achenbach & Ruffle, 2000) and Strengths and Difficulties Questionnaire (SDQ; Goodman, 1997).
The CBCL has a long history of clinical and nonclinical use for assessing behavioral and emotional problems in children (Achenbach, 1991; Achenbach & Ruffle, 2000), with several age-based versions: preschool parent/teacher-report form (1.5–5 years); school-age parent/teacher-report form (6–18 years); and school-age self-report form (11–18 years). The CBCL has many scoring options, including examination of individual items, syndrome scales (e.g., Anxious/Depressed), Diagnostic and Statistical Manual of Mental Disorders; DSM-oriented scales (e.g., Depressive Problems, Anxiety Problems), domain scores (i.e., Internalizing, Externalizing), and/or the Total Problems score which encompasses all assessed problems. The SDQ assesses similar constructs to the CBCL and has a similar response format and question structure (Goodman, 1997). Like the CBCL, the SDQ has multiple forms: a preschool form (2–4 years), a younger child school-age form (4–10 years), an older child school-age form (11–17 years), and a self-report form (11–17 years). The versions of the SDQ have nearly identical items, and each measures five specific domains: Emotional Problems, Conduct Problems, Hyperactivity, Peer Problems, and Prosocial which measures positive functioning. Like the CBCL, the SDQ can also yield Internalizing, Externalizing, and Total Problems scores by combining the various subscales. As a considerably shorter measure (25 items in total), the SDQ is becoming popular as a simpler measure of child emotional and behavioral problems. The overlap in domains in the CBCL and SDQ make them good candidates for score linking.
As the CBCL and SDQ are both commonly used, the present study seeks to link or map scores from each instrument to the other to facilitate data pooling and enable practitioners and researchers to compare the scores across measures. To date, only one study has attempted to link the CBCL and SDQ. Stevens et al. (2021) linked the CBCL and SDQ Total scores in a sample of residential youth using equipercentile equating, yielding crosswalk tables which can be used to convert SDQ Total Difficulties scores to CBCL Total Problems T-scores or vice versa. However, a rigorous analysis of test linking assumptions was not performed: The correlation between CBCL and SDQ scores was not provided, and Cronbach’s α was only reported for each scale individually. Furthermore, only a single crosswalk table was constructed despite the separate CBCL norms for males and females, and no assessment of the quality of the linkage was conducted, either in the estimation sample or in key subsamples (e.g., male and female). Finally, Stevens et al. only linked the school-age forms in a sample consisting of youth in a residential care facility; therefore, results do not generalize to preschool-aged children and may not generalize to outpatient or nonpatient youth, the latter of which make up the broader demographic of those likely to be evaluated in clinical practice and research settings.
The authors recruited participants through an internet panel company, collecting data from three samples of 500 parents. Each sample was defined by the age of the index child: 2 years 0 months to 5 years 11 months (ages 2–5 sample), 6 years 0 months to 11 years 11 months (ages 6–11 sample), and 12 years 0 months to 17 years 11 months (ages 12–17 sample). Table 1 provides child and parent demographic information for each sample by child gender. The samples were collected with the goal of obtaining racial and ethnic representation consistent with national norms (~70%–80% White, 20% Hispanic or Latino, 10% Black or African American), allowing natural fallout on other demographic variables (e.g., education, income). With respect to race and ethnicity, these goals were generally achieved (72.7%–81.4% White across age and gender subgroups, 19.5%–26.7% Hispanic or Latino, 14.3%–23.1% Black or African American). Educational attainment, which was not selected for, was higher in the current samples than in the general U.S. population, with 94.3%–99.2% with high school diploma or higher and 36.3%–71.2% with Bachelor’s degree or higher compared to 88.6% and 33.1%, respectively, in the U.S. population (U.S. Census Bureau, 2019). Participants were parents or legal guardians of the index children, who were informed that they would be asked questions about their children’s health and well-being and provided informed consent prior to participation. The overall project has institutional review board approval and this specific substudy was deemed exempt. Each parent completed both the CBCL and SDQ, yielding a single-group design which is ideal for linking (Dorans, 2007). The order of the instruments was randomized for each participant, such that half received the CBCL first and the other half received the SDQ first. This study was not preregistered. Data and analysis code are not publicly available.
Child and Parent Demographic Information for Analysis Samples by Child Gender
Child age group | Ages 2–5 | Ages 6–11 | Ages 12–17 | |||
---|---|---|---|---|---|---|
Child gender | Female | Male | Female | Male | Female | Male |
Number of participants | 245 | 255 | 217 | 283 | 236 | 264 |
Child age (M) | 4.0 | 3.9 | 8.8 | 9.0 | 14.9 | 14.9 |
Child Hispanic/Latino (%) | 20.4% | 22.7% | 26.7% | 24.0% | 19.5% | 21.6% |
Child race (%) | ||||||
White | 72.7% | 72.9% | 79.7% | 80.6% | 75.8% | 81.4% |
Black or African American | 20.0% | 23.1% | 14.3% | 14.5% | 18.2% | 14.8% |
American Indian or Alaska Native | 4.9% | 0.8% | 3.2% | 1.1% | 2.5% | 0.8% |
Asian | 7.3% | 5.1% | 2.8% | 3.5% | 5.9% | 3.4% |
Native Hawaiian or Pacific Islander | 2.0% | 0.8% | 1.4% | 0.4% | 0.4% | 0.4% |
Other | 6.1% | 7.1% | 4.1% | 3.9% | 3.8% | 3.4% |
Parent age (M) | 31.3 | 30.9 | 36.4 | 37.8 | 44.3 | 43.0 |
Parent gender (%) | ||||||
Female | 78.4% | 65.9% | 64.1% | 44.9% | 68.2% | 48.9% |
Male | 21.2% | 34.1% | 35.9% | 55.1% | 31.8% | 51.1% |
Other | 0.4% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
Parent education (%) | ||||||
Some high school | 5.7% | 4.7% | 5.1% | 1.4% | 2.1% | 0.8% |
High school graduate | 31.0% | 24.3% | 10.6% | 13.1% | 16.5% | 9.8% |
Some college | 17.1% | 20.8% | 23.0% | 11.3% | 12.7% | 10.6% |
Associate degree | 9.8% | 9.8% | 8.3% | 8.8% | 13.6% | 7.6% |
Bachelor’s degree | 24.9% | 24.7% | 27.2% | 30.7% | 27.5% | 30.7% |
Master’s degree | 9.8% | 12.9% | 20.3% | 26.5% | 21.2% | 28.0% |
Professional or doctorate degree | 1.6% | 2.7% | 5.5% | 8.1% | 6.4% | 12.5% |
Table 2 provides the age range for index children and the instruments administered to each sample. The recommended age groups for the CBCL and SDQ do not fully overlap; specifically, while the CBCL norms range from 1.5 years to 5 years 11 months, 6 years to 11 years 11 months, and 12 years to 17 years 11 months, the SDQ versions are recommended for children 2 years to 4 years 11 months, 4 years to 10 years 11 months, and 11 years to 17 years 11 months, resulting in two discordant age ranges: 5 years to 5 years 11 months and 11 years to 11 years 11 months. Rather than add two additional, narrowly defined groups to our design to account for these two age ranges, we allowed the administered ages for the SDQ to not fully align with the corresponding SDQ version, administering the SDQ 2–4 years to parents of 5-year-old children (n = 127) and the SDQ 4–10 years to parents of 11-year-old children (n = 64). This yielded only three, rather than five, samples. We opted for this design because the SDQ forms have few differences: only 4 items were changed (out of 25) between SDQ 2–4 years and 4–10 years and 5 items were changed (out of 25) between SDQ 4–10 years and SDQ 11–17 years; most of these were minor wording changes that did not impact the item concept (e.g., SDQ 2–4 years: “Often fights with other children or bullies them” vs. SDQ 4–10 years: “Often fights with other youth or bullies them”; SDQ 4–10 years: “Nervous or clingy in new situations, easily loses confidence” vs. SDQ 11–17 years: “Nervous in new situations, easily loses confidence”). In addition, the CBCL has been normed to the general U.S. population, and we aimed to use these norms to calculate T-scores for linking and evaluate the average level of emotional and behavioral functioning in our samples with respect to normative levels. The availability of these norms added additional utility to aligning the age structure of our samples with the norming structure of the CBCL, rather than the nonnormed age groupings defined by SDQ versions.
Child Age and Instrument Version for Each Score-Linking Sample
Child age | CBCL version | SDQ version |
---|---|---|
2–5 years, 11 months | CBCL preschool 1.5–5 years | SDQ 2–4 years |
6–11 years, 11 months | CBCL school-age 6–17 years | SDQ 4–10 years |
12–17 years, 11 months | CBCL school-age 6–17 years | SDQ 11–17 years |
Note. CBCL = Child Behavior Checklist; SDQ = Strengths and Difficulties Questionnaire.
The items in the CBCL correspond to problematic behaviors observed within the past 6 months, rated on a 3-point scale: 0 = Not True, 1 = Somewhat or Sometimes True, and 2 = Very True or Often True. The preschool CBCL contains 100 items covering multiple syndrome scales which can be combined into two domain scores and one total score: Internalizing domain score (Emotionally Reactive, Anxious/Depressed, Somatic Complaints, and Withdrawn); Externalizing domain score (Attention Problems and Aggressive Behavior); and the Total Problems score (i.e., all Internalizing and Externalizing syndrome scales, and Sleep Problems and Other Problems scales). The school-age CBCL similarly contains 113 items and syndrome scales that are combined to produce an Internalizing score (Anxious/Depressed, Withdrawn/Depressed, and Somatic Complaints); Externalizing score (Rule-Breaking Behavior and Aggressive Behavior); and Total score (i.e., all Internalizing and Externalizing syndrome scales, and Social Problems, Thought Problems, Attention Problems, and Other Problems). The CBCL DSM-oriented scales are not included in the current analyses.
All versions of the SDQ contain 25 items, with 5 items measuring each of 5 dimensions: Emotional Problems, Conduct Problems, Hyperactivity, Peer Problems, and Prosocial Behavior, the last of which focuses on strengths of a child’s emotional and behavioral functioning. Like the CBCL, SDQ items are scored on a 3-point scale of 0 = Not True, 1 = Somewhat True, and 2 = Certainly True regarding the child’s behavior over the past 6 months. A Total Difficulties score can be calculated as the sum of the first four subscales. Additionally, the Emotional Problems and Peer Problems subscales can be combined to produce an Internalizing score, while Conduct Problems and Hyperactivity can be combined to produce an Externalizing score. Factor analyses have explored models including these higher order factors as an alternative latent structure of the SDQ; while results have been mixed, they generally favor the use of Internalizing and Externalizing dimensions over the less reliable subscale scores (Dickey & Blumberg, 2004; Koskelainen et al., 2001; Van Leeuwen et al., 2006). The Internalizing and Externalizing factors have also shown better convergent and discriminant validity across informants and with respect to clinical disorder than the component subscales, although the subscales can provide marginal utility over these higher order factors (Goodman et al., 2010).
All analyses were conducted separately for the Internalizing, Externalizing, and Total Problems domains in R (R Core Team, 2020). Confirmatory factor analyses (CFAs) were conducted using the lavaan package (Rosseel, 2012).
To assess the assumptions of test linking in the CBCL and SDQ, we followed the recommendations of Choi et al. (2014). As a preliminary step, we examined item content for the two measures and calculated the correlation between SDQ and CBCL summed scores to ensure that the two measures were essentially measuring the same constructs.
The next linking assumption we tested was unidimensionality: that a single latent variable can primarily account for the pattern of covariance among the combined item responses to the two measures. To this end, we calculated statistics from classical test theory (CTT: item-total correlations) and exploratory factor analysis (EFA: first to second eigenvalue ratio, number of factors identified by parallel analysis, number of eigenvalues greater than 1) using the psych package (Revelle, 2020) in R. We also calculated coefficient α to assess the internal consistency of the combined item sets.
Our unidimensionality assessment also included CFAs of the combined CBCL and SDQ item sets. CFA permits more specific tests of unidimensionality than CTT and EFA, including whether residual relationships between items exist after accounting for the common influence of the single latent variable. CFA also enables the calculation of coefficient omega (ω; McDonald, 1999) reliability which can be a better indicator of the reliability of the items as indicators of a single common factor than coefficient α (Revelle & Zinbarg, 2009). In particular, ω relies on the unidimensionality of a measure and, given that assumption, quantifies the proportion of variance in total scores attributable to the underlying latent variables using parameter estimates from a fitted confirmatory factor model. To estimate coefficient ω, we estimated a one-factor CFA model using the WLSMV estimator (Muthén, 1984; Muthén et al., 1997), which properly accounts for the ordinal level of measurement in the item responses and thus provides the most theoretically justifiable tests of the fit of the data to a unidimensional model. We assessed the fit of the resulting models using commonly used statistical indices and benchmark values (Hopwood & Donnellan, 2010; Lance et al., 2006), including the comparative fit index (CFI; >.90 = adequate fit, >.95 = very good fit), the Tucker–Lewis index (TLI; >.90 = adequate fit, >.95 = very good fit), and the root-mean-square error of approximation (RMSEA; z statistic (estimate/SE) for differences between the observed and model-implied covariance matrix to identify the locations of statistically meaningful residuals. Finally, we calculated coefficient ω using the method described in Green and Yang (2009) which uses the categorical CFA model estimated using WLSMV to determine the proportion of variance in total scores attributable to the latent variable in the categorical factor model.
Multiple linking methods were used and compared to identify potential problems with linking and to evaluate the sensitivity of the linkage to the assumptions underlying each method (Kolen & Brennan, 2004). To link the CBCL and SDQ, we used the following statistical techniques (Choi et al., 2010): IRT-based fixed-parameter calibration (Haebara, 1980) and nonparametric equipercentile linking (Kolen & Brennan, 2004; Lord, 1982). To conserve space, details on these methods, their statistical frameworks, and their strengths and weaknesses are described in Supplemental Methods. Linking analyses were conducted in R using the mirt package (Chalmers, 2012) for item response theory (IRT) parameter estimation and the PROsetta (Choi & Lim, 2020) and equate packages (Albano, 2016) to derive linking functions.
Each linking method yields a linking function, a (typically nonlinear) function relating scores on each instrument to a corresponding score on the other. To evaluate these functions, we statistically examined the relationships between linked and actual CBCL and SDQ scores, calculating the correlation between linked and observed scores, bias in linked scores, and standard deviation of differences between linked and observed scores. We also constructed Bland–Altman plots (Bland & Altman, 1999), which graph the average of each observed and linked score on the X-axis and the difference between these scores on the Y-axis, to assess linking bias across the score range. Based on the choice of linking function, crosswalk tables were constructed for converting SDQ raw scores to CBCL raw scores (and vice versa) and SDQ raw scores to CBCL T-scores (and vice versa) using the selected function. Crosswalk tables have been made publicly available at the American Psychological Association’s (APA) Open Science Framework (OSF) repository and can be accessed at https://osf.io/n5s9u/.
Once linking functions were derived, we assessed their invariance across subgroups (subgroup invariance; Choi et al., 2014) in the male and female preschool samples, repeating the assessments outlined above in these subsamples using the linking function for the combined preschool sample. Bland–Altman plots were constructed for each domain in the combined preschool sample, assessing the performance of the combined linking function in each subsample, and for the four linking functions for each domain in the school-age samples. Since the age groups to which the SDQ was administered did not exactly overlap with the recommended age groups for administration, we repeated the evaluation of the relationships between linked and actual scores using only these two ages as a final test of subgroup invariance.
In practice, desired conversions between scores may not coincide perfectly with the administrations presented herein; for example, an 11-year-old child’s SDQ 11–17 score may need to be evaluated according to the CBCL T-score standards for ages 6–11 years, but the CBCL for ages 6–11 years was co-administered with the CBCL 4–10 in this study. A finding of measurement invariance across SDQ age forms would justify the use of SDQ to CBCL crosswalk tables in such nonstandard settings. To evaluate this measurement invariance, we combined the three samples of SDQ item responses and estimated confirmatory factor models four times, imposing measurement invariance constraints by gender, age, neither, or both, with freed structural parameters (means, variance) across groups and constrained measurement parameters. These models were estimated using the CFA of polychoric correlations, as described above, to obtain structural equation modeling (SEM)-based fit statistics and residuals. To obtain deviance statistics, including Akaike’s information criterion (AIC), AIC with correction for small sample size (AICc), Bayesian information criterion (BIC), and sample size adjusted BIC (SABIC), we also estimated these models in an IRT framework. Fit comparisons and deviance-based tests of differences in fit between these models constitute omnibus tests of metric and scalar invariance: If applying constraints on measurement parameters (item slopes and intercepts in IRT; factor loadings and thresholds in CFA) does not meaningfully impact model fit, then scalar (factor loading/item slope) and metric (threshold/intercept) invariance can be assumed to hold across SDQ versions, justifying the application of SDQ–CBCL crosswalk tables in nonstandard settings. We assumed that all factor structures were unidimensional and identical across groups and, when constraining measurement parameters to equality across groups, we freely estimated the mean and variance of the latent variables in the two older age groups to account for potential group differences in the mean and variance of the corresponding latent trait; in short, we assumed configural invariance, but did not assume structural invariance (Vandenberg & Lance, 2000).
In practice, it may be necessary to convert SDQ scores to either CBCL raw scores or CBCL T-scores, necessitating the construction of two sets of conversion tables: SDQ raw score to CBCL raw score (and vice versa), and SDQ raw score to CBCL T-score (and vice versa). These conversion options permit two methods of calculating CBCL T-scores from SDQ raw scores: (a) convert SDQ raw scores to CBCL raw scores, then convert the resulting scores to CBCL T-scores using the CBCL’s T-score conversion tables or (b) convert SDQ raw scores directly to CBCL T-scores using a separate linking function derived for this purpose. To optimize the concordance between these two methods, we evaluated linking assumptions and derived linking functions in samples which matched the age–gender structure of the T-score conversion tables of the CBCL: combining male and female preschool-age children into a single preschool sample, while analyzing each school-age group (6–11, 12–17) and gender (male, female) combination separately. We also evaluated linking assumptions and derived and evaluated linking functions in the male and female preschool samples separately to assess subgroup invariance for these two groups.
The CBCL is far more comprehensive than the SDQ, with more items being included in each of Internalizing, Externalizing, and Total Problems domains. Generally, the set of CBCL items used to calculate a score includes, in a slightly less verbose form, the set of SDQ items used to calculate that score (e.g., school-age CBCL item Fears and SDQ 4–10 item Many fears, easily scared). However, many differences exist between the two measures. First, in the Internalizing domain, the SDQ includes only one item on somatic complaints (e.g., Often complains of headaches), while the CBCL includes a whole subscale of Somatic Complaints in its Internalizing score. The SDQ includes many items on social interactions (e.g., Has at least one good friend; Picked on or bullied by other children) in its Peer Problems subscale, which is included in Internalizing, but the CBCL does not include similar items in its Internalizing score. For Externalizing, the CBCL and SDQ differ in that the CBCL includes many more items on disobedience, illegal behavior, and moodiness, which are only represented by a small number of less severe items (e.g., Generally obedient; Often lies or cheats; Steals from home, school, or elsewhere) in the SDQ. Furthermore, while the SDQ Hyperactivity scale, included in the Externalizing score, includes items relating to both hyperactivity (e.g., Restless, overactive) and attention (e.g., Easily distracted, concentration wanders), such items are included on the CBCL Attention Problems subscale (e.g., Impulsive; Can’t concentrate) which is included in the Externalizing domain on the preschool CBCL but not the school-age CBCL. Finally, and in addition to these differences between the Internalizing and Externalizing items, the Total Problems score on the CBCL contains many items with no corresponding items in the SDQ, including Thought Problems (e.g., Twitching; Sees things) and the broad set of problems included in Other Problems (e.g., Cruel to animals; Wets the bed; Shows off). Any imperfections in the performance of the linking functions derived herein can potentially be attributed to these nontrivial content differences.
Figure 1 contains statistics used to evaluate unidimensionality, reliability, and between-test correlations in the combined CBCL–SDQ item sets. Across all samples and domains, the reliability of the combined item sets was high (α > .95, ω > 0.94). EFA and CFA metrics supported the unidimensionality of the combined Internalizing items across all samples: ratios of first to second eigenvalues in EFA were above 11, SDQ and CBCL scores were correlated above .83, and CFA model fit was very good (CFI > .96, TLI > 0.95, RMSEA < .047, SRMR < .08). For all samples except the preschool samples, the same trends held for the Total and Externalizing scores, albeit with less agreement between summed scores for Externalizing (.76 >r > .72) than Total (.86 > r > .83) scores. In the preschool samples (separate and combined), correlations between CBCL and SDQ summed scores were high for Externalizing and Total scores (.86 > r > .81); however, the first to second eigenvalue ratio was lower (>5.9) and CFA model fit was noticeably worse (CFI > .91, TLI > 0.91, RMSEA < .068, SRMR < .11), though still considered “adequate” according to our a priori benchmarks, suggesting some multidimensionality in these domains and samples.
Unidimensionality Assessment of Combined CBCL and SDQ ItemsNote. (a) Classical test theory and exploratory factor analysis (CTT/EFA) statistics; (b) confirmatory factor analysis (CFA) statistics. CBCL = Child Behavior Checklist; SDQ = Strengths and Difficulties Questionnaire; PA # Fact = number of factors identified in parallel analysis; Evals = eigenvalues.
A detailed analysis of EFA parameter estimates and CFA residuals revealed sizable local dependencies among positively valenced items in all versions of the SDQ, a phenomenon which has been reported elsewhere (e.g., van de Looij-Jansen et al., 2011). By “positively valenced” items, we refer to items which indicate a lack of emotional or behavioral problems (e.g., 7. Generally obedient), rather than the presence of emotional or behavioral problems (e.g., 5. Often loses temper). To account for these local dependencies, we performed a post hoc modification of our confirmatory factor models, reestimating them after adding a positive valence factor orthogonal to the general factor which loaded only on the positively valenced SDQ items (7, 11, 14, 21, and 25 in all SDQ forms), yielding a two-tier model (e.g., Cai, 2010). Model fit information for these models is presented in Figure 1 alongside the corresponding information for the unidimensional models. Because these factors were orthogonal, we were able to calculate coefficient ω hierarchical (ωh; Gignac, 2015), also reported in Figure 1 , allowing us to assess the proportion of item-level variance accounted for by the general factor in each categorical factor model. This modification resulted in some fit indices shifting from “adequate” to “very good” for the Externalizing scales in the 2–5 age group but resulted in little change to model-based reliability (ω ≈ ωh ≈ .99); this indicates that, even when valence effects are taken into account, an overwhelming majority of total score variance in all domains and samples is attributable to the general factor. In the estimated CFA models, threshold values were similar between the two measures, while factor loadings on the general factor in the SDQ were markedly lower for items with positive valence. This, combined with the larger number of items in the CBCL, suggests that the CBCL is to be preferred when the most reliable measurement is desired, but otherwise the SDQ is an adequate substitute. See Supplemental Results, Supplemental Figures S2a–S2e and S3a–S3e, and Supplemental Tables S2a–S2o for specific discussion and parameter estimates for the categorical CFA models.
Differences between linking functions (IRT-based and the two equipercentile methods) were negligible; however, the equipercentile linking function with loglinear smoothing exhibited two practical advantages over the other two linking functions, and we selected this method for constructing crosswalk tables. First, equipercentile linking was not possible when CBCL T-scores were used, and if different linking functions were used for raw scores and T-scores, the resulting conversion tables may cause confusion. 1 Thus, we found it most efficient to use equipercentile linking for all score types. Second, loglinear smoothing permits interpolation and extrapolation to some scores which were not observed in the data, allowing us to construct more complete crosswalk tables.
Crosswalk Conversion Statistics for Linking Samples (a) and Subsamples (b)Note. CBCL = Child Behavior Checklist; SDQ = Strengths and Difficulties Questionnaire; IRT = IRT-based fixed-parameter equating; Equip = equipercentile equating with no smoothing; EquipL = equipercentile equating with loglinear smoothing. Subsamples in (b) include gender subsamples for the preschool CBCL (ages 2–5) and samples which were administered a different version of the SDQ than recommended for their age group (ages 5 and 11). The Y-axis on the middle and right-hand panels is in standard deviation units with respect to the destination metric (e.g., SDQ score for CBCL-to-SDQ comparisons).
Not surprisingly, the linking functions performed more poorly in their subsamples than in the samples used for their estimation, as illustrated in panel (b) of Figure 2 , and the degree of deterioration depended on the subsample. The male and female preschool subsamples had comparable performance to the full preschool sample, with some increase in variability of differences between linked scores but low bias and high correlation between linked and observed scores. As in the full preschool sample, the Externalizing score linking performed more poorly than the Internalizing and Total linking in the preschool subsamples. The same pattern generally held in groups which were administered a version of the SDQ other than the recommended version for their age group. In these groups, the Externalizing factor exhibited poor performance, with low correlations between linked and observed scores, large mean differences between linking and observed scores, or both. In contrast, these statistics were much better for the Internalizing and Total scores in these subsamples, albeit not as good as those for the samples used to derive the linking functions. We accept the latter set of differences as minor in size (e.g., mean difference of 0.1 SDs or less, corresponding to a small effect size in the Cohen’s d metric) and conclude that, other than Externalizing, the crosswalk tables generated herein generalize well to these nonstandard subsamples.
Finally, we constructed crosswalk tables linking SDQ total scores to CBCL T-scores and vice versa using equipercentile linking with loglinear smoothing. Root mean squared differences between T-scores derived from SDQ summed scores directly and T-scores derived by first converting SDQ summed scores to CBCL summed scores were at most 1.5 T-score points, most of which is likely due to the rounding of CBCL summed scores required in the intermediate step of the latter approach. Thus, for each combination of sample (combined 2–5, male 6–11, female 6–11, male 12–17, female 12–17) and domain (Internalizing, Externalizing, Total Problems) used for deriving summed score-linking functions, we computed direct SDQ to CBCL T-score crosswalk tables, which are included with raw score conversion tables in Supplemental Tables S1a–S1o.
Figure 3 contains fit statistics from CFA and deviance statistics from IRT analyses for multiple group models of the SDQ, where groups are defined by gender (male and female) and age (2–5, 6–11, 12–17). Two-tier models with a positive valence factor are also included in Figure 3 ; these models generally fit better than those for the corresponding unidimensional models by both fit and deviance metrics, with generally unacceptable fit for the unidimensional models (except for Internalizing, where fit was generally acceptable) and large differences in deviance between unidimensional and two-tier models (~175 points for Internalizing; ~530 points for Externalizing; ~985 points for Total Problems). Within each set of models, there was little difference in relative fit according to SEM fit statistics for the different levels of measurement invariance, with a slight preference for the models with measurement invariance by both age and gender due to the larger number of degrees of freedom in these models. Models with measurement invariance by gender, but not by age, tended to have the best CFA-based model fit and lowest AIC and AICc, while models with measurement invariance by age and gender tended to have the lowest BIC and SABIC, although statistics for these models were generally similar regardless of the metric used. Based on these results, we concluded that the SDQ is essentially measurement invariant across its three forms, and that the crosswalk tables included herein can be used for nonstandard conversions. This also suggests, but does not prove, that the items with slight wording variations across SDQ forms function similarly across age ranges, and a more targeted study in which the same raters receive both variations of each item would be needed to verify that the unique items in each SDQ form have identical functioning in nonstandard age ranges.
Assessment of Measurement Invariance in the Strengths and Difficulties Questionnaire (SDQ)Note. The displayed deviance values represent the difference between each statistic and the minimum of that statistic, calculated by domain.
In this study, we linked three domains (Internalizing, Externalizing, and Total Problems) of the CBCL and SDQ according to the age and gender structure of the CBCL T-score conversion tables. The Internalizing and Total scores on the CBCL and SDQ were essentially unidimensional when combined into a single item set, with noticeable but not meaningful deviations from unidimensionality. In contrast, the Externalizing domain deviated slightly from unidimensionality in the 2–5 sample, albeit with acceptable unidimensional model fit according to the metrics used, and in the school-age samples, the Externalizing summed scores were not as strongly correlated between the CBCL and SDQ as the other domains. Results were essentially identical after combining males and females within the preschool (2–5) sample.
Across the three linking methods used (IRT-based fixed-parameter, equipercentile with and without loglinear smoothing), bias was minimal, but for the Externalizing domain in the 6–11 and 12–17 samples, correlations between linked and observed scores were lower than for other domains and samples. The three score-linking methods yielded roughly identical results, and therefore, to enable extrapolation and interpolation of linked scores, equipercentile equating was chosen to construct crosswalk tables. Aside from the Externalizing domain, linking functions yielded acceptable results when applied to the preschool gender subsamples and to subsamples in which children were administered a version of the SDQ other than the recommended version for their age. The three versions of the SDQ were essentially measurement invariant, justifying the conversion of any SDQ score to the metric of the preschool or school-age CBCL and other nonstandard conversions.
Equipercentile linking was used to construct conversions between SDQ summed scores and CBCL T-scores, with little difference in CBCL T-scores generated by first linking SDQ summed scores to CBCL summed scores or by directly linking SDQ summed scores to CBCL T-scores. The full set of crosswalk tables (Supplemental Tables S1a–S1o) can be used to convert SDQ to CBCL scores and vice versa, where CBCL scores are either provided/desired as T-scores or summed scores. While potentially unwieldy, this large number of tables provides a comprehensive map between scores on these two instruments across their many parent-report versions. Furthermore, these tables permit the placement of SDQ scores onto a normed metric, namely that of the CBCL, providing an ability to compare scores to their expected distribution in the general population.
As mentioned, score-level linking is one of several possible methods for IDA. Another approach is to estimate factor scores using item response models estimated on combined data from multiple instruments (Curran & Hussong, 2009; Hussong et al., 2013). For researchers interested in applying this approach instead, we include our combined CBCL–SDQ CFA parameter estimates in Supplemental Tables 2a–2o. These parameters can be used to estimate factor scores, which share the same generalizability limitations as our current sample, using lavaan, Mplus (Muthén & Muthén, 1998–2011), or similar software.
Compared to the crosswalk tables presented in a previous CBCL–SDQ linking study conducted in a residential care setting (Stevens et al., 2021; Table 2 ), the crosswalk tables accompanying this report predict lower CBCL Total Problems T-scores from lower SDQ Total Difficulties scores (mean difference of −2.56 for SDQ = 0), but higher CBCL scores from higher SDQ scores (mean difference of 19.5 for SDQ = 36). Interestingly, for the age groups overlapping between the two studies (ages 12–17), mean SDQ raw scores (9.9–10.7) and CBCL T-scores (54.1–56.8) reported herein were fairly similar to those reported in Stevens et al. (12.1 and 56.9, respectively), suggesting that differences in crosswalk conversions are a function of sample characteristics other than the distribution of observed scores. Specifically, the sample in Stevens et al. was recruited through a residential care facility and, relative to the current sample, had a higher percentage of male participants (60%) and of Black or African American participants (24.6%), with a smaller percentage of Hispanic or Latino (8.8%). While no link is expected to hold in all possible subpopulations (Newton, 2010), these differences suggest that future research may be needed to assess the generalizability of CBCL–SDQ crosswalk conversions across certain groups.
The current work and Stevens et al. (2021) share several broader limitations. Both studies targeted specific populations and the linking functions derived in these works may not generalize to new populations of interest. The choice of rater is also a potential methodological artifact of both studies; in the current work, all reporters were the parents of the target child, while in Stevens et al., reporters also included other caregivers, parole officers, and program officers. These reporting protocols limit the generalizability of the linking functions derived herein and in Stevens et al. to other reporters (e.g., medical professionals, teachers) not represented in these analyses. Given the differences between the linking functions derived herein and those in Stevens et al., users should acknowledge any differences between their population and/or rater and those used in deriving these linking functions and may consider conducting a targeted linking study for their population of interest.
For many analyses, not all SDQ or CBCL summed scores or CBCL T-scores were represented in our data; however, equipercentile linking with nonlinear smoothing allows for interpolation and extrapolation of linked scores, such that the scores presented in Tables S1–S45 represent the range of possible observed scores in the to-be-linked instruments. While the interpolated scores are likely as reliable as the scores which were directly linked, we repeat the general statistical recommendation to treat extrapolated scores with caution. Table 3 lists the observed mean and range of all linked scores (summed scores and T-scores), and values outside of this range should be treated as potentially unreliable, especially when interpreted for a given individual, for example, in clinical decision-making for an individual child.
Observed Score Ranges in Linking Samples
Dimension | Age | Gender | SDQ summed score | CBCL summed score | CBCL T-score | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min | Mean | Max | Min | Mean | Max | Min | Mean | Max | % Borderline | % Clinical | |||
Internalizing | 2–5 | Both | 0 | 4 | 17 | 0 | 12.5 | 64 | 29 | 53.2 | 95 | 7.7 | 14.6 |
6–11 | Female | 0 | 4.3 | 16 | 0 | 10.7 | 62 | 33 | 54.8 | 99 | 8.9 | 15.5 | |
Male | 0 | 5.4 | 16 | 0 | 14 | 64 | 34 | 58.9 | 100 | 8.4 | 25.1 | ||
12–17 | Female | 0 | 4.9 | 15 | 0 | 13.2 | 60 | 33 | 55 | 99 | 9.6 | 17 | |
Male | 0 | 5.3 | 17 | 0 | 13.2 | 64 | 34 | 58.2 | 100 | 8.9 | 24.3 | ||
Externalizing | 2–5 | Both | 0 | 7.1 | 20 | 0 | 14.2 | 47 | 28 | 42.4 | 63 | 0 | 0 |
6–11 | Female | 0 | 5.5 | 19 | 0 | 10.7 | 65 | 34 | 53.9 | 96 | 3.8 | 16 | |
Male | 0 | 6.7 | 19 | 0 | 15.8 | 68 | 33 | 58.3 | 98 | 5.1 | 25.8 | ||
12–17 | Female | 0 | 5.1 | 16 | 0 | 12.2 | 63 | 34 | 51.5 | 96 | 6.1 | 12.2 | |
Male | 0 | 5.4 | 20 | 0 | 12.4 | 70 | 34 | 54.8 | 100 | 5.4 | 21.2 | ||
Total | 2–5 | Both | 0 | 11.1 | 34 | 0 | 43 | 170 | 28 | 53.5 | 95 | 8.7 | 13.2 |
6–11 | Female | 0 | 9.8 | 30 | 0 | 39.9 | 224 | 25 | 54.4 | 97 | 6.6 | 16.9 | |
Male | 0 | 12.1 | 34 | 0 | 55.3 | 224 | 24 | 58.8 | 97 | 5.8 | 29.5 | ||
12–17 | Female | 0 | 9.9 | 31 | 0 | 45.9 | 214 | 24 | 54.1 | 97 | 10.5 | 15.7 | |
Male | 0 | 10.7 | 35 | 0 | 46.5 | 238 | 24 | 56.8 | 100 | 5.8 | 25.9 |
Note. CBCL = Child Behavior Checklist; SDQ = Strengths and Difficulties Questionnaire; min = minimum; max = maximum.
Table 3 also contains the percentages of individuals in each subgroup with CBCL T-scores classified as Borderline (between 65 and 70) or Clinically Meaningful (above 70). Despite being constructed as a general population sample, the percentages reported in Table 3 differ from what would be expected based on a normal distribution (6.7% Borderline and 9.1% Clinical). In particular, Externalizing scores were very low in the youngest subsample (ages 2–5), with no individuals scoring in the Borderline or Clinical range, while all other scores and subgroups had much higher percentages of Clinical scores (12.2%–29.5%) than expected in a general population sample. While these distributions could be due to idiosyncrasies of our sample, which was not collected through methods traditionally used in norming studies (e.g., stratified random sampling), they could also be indicative of outdated norms, as the current sample was collected in 2020 and the CBCL norms used herein were established in 2001. This time gap could influence standards for “normative” emotional and behavioral problems, similar to the Flynn effect in intelligence testing (Flynn, 1987), and an updated norming study for the CBCL could clarify this issue.
Due to the poor fit of the unidimensional Externalizing and Total model in the preschool forms and the relatively low correlation between observed and equated Externalizing scores in the school-age forms, we recommend exercising caution when applying the Externalizing conversions and preschool Total conversions. For preschool Externalizing and Total scores, while the linkages themselves were fairly stable, these composite scores on what are essentially multidimensional measures (e.g., Externalizing including conduct, hyperactivity, and inattention symptoms) can be difficult to interpret. The lower correlation and poor subscale invariance in the school-age Externalizing linkages are more concerning for placing scores onto a common metric, as they suggest that scores converted using these crosswalk tables may not be as similar as desired to the scores that would have been obtained on the instrument that was not administered. These differences could be due to aforementioned differences in the operationalizations of Externalizing in the two measures: while the Externalizing domain in the school-age CBCL focuses on conduct problems, half of the Externalizing items in the SDQ focus on attention and hyperactivity problems, an entirely different domain. We therefore recommend that Externalizing scores on these two measures not be treated, strictly speaking, as fully linked measures of the same underlying construct.
In clinical practice, the Externalizing and preschool Total crosswalk tables may be unsuitable for application at the level of the individual in high-stakes decisions, for instance, when assessing changes in functioning in an inpatient or residential care setting as a decision criterion for clinical or social welfare decisions. In general, these qualifications do not apply to the Internalizing conversions for any age group or Total scores for the school-age groups, for which unidimensionality and linking analyses resulted in more reliable linking functions. Correlations between CBCL and SDQ summed scores for these domains and samples fluctuated around .866, which is a commonly used cutoff for classifying linked scores as equated, indicating interchangeability of the two sets of scores (Dorans & Walker, 2007; Newton, 2010). Because these correlations are close to the cutoff, we would not recommend treating any linked scores as completely equivalent in high-stakes decisions for individuals; rather, these linked scores can be considered the most likely scores that would be obtained had the to-be-linked instrument been administered instead. As with almost any psychological test, high-stakes decisions should not be based on (linked) test scores alone but in combination with clinical severity of disorder and level of functioning as assessed by individuals trained in making such assessments.
That being said, the aforementioned issues with the Externalizing and preschool Total scores do not necessarily preclude the use of their associated crosswalk conversions in research that requires pooling data from these measures, for example, in transdisciplinary consortium research or post hoc data aggregation. Despite these issues, the linked scores remain the best estimate of the score that would have been obtained on the not administered measure given an observed score on the administered measure, and placing scores on a common metric, although imperfect, retains utility over conducting no harmonization at all (e.g., Skaggs, 2005). Rather, we recommend that analysts using these linking functions conduct sensitivity analyses to determine whether the instrument used to generate a score influences their results; in regression analyses, this would involve including an indicator denoting the administered test as a covariate and moderator of the effect(s) of interest. Assuming otherwise equivalent samples, significant moderation would indicate that conclusions may be sensitive to the measure used, any other differences between administrations (e.g., population differences) notwithstanding. While this sensitivity analysis is an important tool to use whenever scores are linked to a common metric, it is particularly important when linkages are not perfectly reliable, as with the Externalizing and preschool Total Problems crosswalks presented herein. These linkages, when properly incorporated into pooled data analyses, provide a valuable tool for increasing the statistical power available to answer important research questions in child psychosocial health while permitting the statistical assessment of method effects caused by differences in assessment.
Furthermore, it should be acknowledged that the correlation between linked and observed scores was not perfect and that there is inherent error in the process of converting scores on one instrument to equated scores on the other. This error is partly a function of sample size: while the sample sizes used to derive linking functions here (~250–500) are well beyond those needed to yield meaningful statistical gains from equating scores (50–200; Aşiret & Sünbül, 2016), they are smaller than in some large-scale test linking studies, for example, in high-stakes achievement test linking (e.g., Pommerich et al., 2000).
When both SDQ and CBCL scores must be used in the same analysis, researchers have an option of transforming CBCL scores into SDQ scores or vice versa. The CBCL has more items than the SDQ and is a more reliable measure; therefore, when roughly equal numbers of individuals complete both instruments, we recommend transforming SDQ scores into CBCL scores. However, if the number of SDQ scores is considerably higher than the number of CBCL scores, we recommend transforming the CBCL scores to SDQ scores instead, recognizing the tradeoff between the reliability of each score and the necessary loss of reliability when scores from one instrument are linked to the metric of another.
With an estimated 17 million children facing a mental health problem in the U.S. (Child Mind Institute, 2015), the need for robust pediatric screening is of utmost importance. Pragmatic approaches to assessing and monitoring children’s emotional and behavioral functioning are required for such practices to become commonplace. The present study addresses this need by providing health care providers and researchers a way to convert scores between two popular measures. While caution is noted when implementing such harmonized scores in practice, particularly at the individual level, the ability to score the CBCL and SDQ on a common metric via the crosswalks developed in this study provides a much-needed option to expand the utility of these measurement systems and ultimately pediatric mental health screening in general.
We applied modern score-linking methodology to create score conversion tables for the Child Behavior Checklist and the Strengths and Difficulties Questionnaire, two commonly used measures of emotional and behavioral problems in children, allowing scores on each instrument to be converted to the metric of the other.