Standards for Selecting Quality Tools

Quality tools require:

Standardization (meaning that a test is given in exactly the same way to large sample of children in order to, in the case of assessment-level and diagnostic tools) produce a normative set of age-equivalent scores and quotients). The standardization sample must be current, i.e., in the last 10 years, because national demographics change frequently. The sample must also be representative, meaning that it reflects national, contemporary demographics, e.g., parents’ levels of education, geographic areas of residence, income, languages spoken at home, etc.). Typically, Census Bureau data is used to illustrate usually via matching percentages, that the characteristics of a test’s sample reflect the country as a whole.
Standardization in the common prevalent languages (e.g., in the US in both English and Spanish at the very least). But, this does not mean that separate norms should be generated for each language–all children must be held to the same performance standards so that examiners can tell who is behind, i.e., likely to be unready for kindergarten where curricular goals and determination of school success are virtually universal. Still it is critical that the translation(s) be thoroughly vetted and carefully done so that dual-language learners and bilingual children are given an equal opportunity for success (or failure).
Proof of reliability. There are several kinds of reliability that should be included in every test manual:
1. Inter-rater, meaning that two different examiners can derive the same results when testing the same child in a short period of time. This illustrates that the directions are clear and that the test norms can be used confidently;
2. Test-retest, meaning that children’s performance is highly similar if tested again in a short period of time. This means that test stimuli and test directions are clear enough to both examiner and child;
3. Internal Consistency, meaning that performance on similar kinds of items “hang together” (e.g., that motor skills don’t cluster with language items-which would suggest that directions for motor tasks demand too many language skills to be a meaningful measure of motor skills).
Proof of validity of which there are various kinds, only the most critical of which are described below:
1. Concurrent validity, i.e., high correlations with diagnostic measures along with indicators of how age-equivalent scores or quotients compare with diagnostic measures;
2. Discriminant Validity. Ideally, but not always, manuals include proof of discriminant validity meaning there are unique patterns of test performance for children with unique disabilities; i.e., that children with cerebral palsy perform differently than children with language impairment, and that performance on each domain measured correlates highly with performance similar domains on diagnostic measures.
3. Predictive Validity. Rare but valuable is to find proof of predictive validity meaning that test results predict future outcomes and thus that current test results have meaningful long-term implications. Such longitudinal studies are expensive, time-consuming, and arduous to conduct, which is why they are uncommon. Nevertheless, some do exist and if not in the test manual, in the research literature (most particularly in the ERIC and PsychInfo databases). These are also a great source for finding a range of validity studies conducted by various authors.
Proof of accuracy (also known as criterion-related validity). This is a critical requirements for screening tests which must establish cutoff scores that determine whether a child is probably behind versus probably OK so that swift decisions can be made about whether referrals are need. To establish cutoffs, screening test scores are compared to concurrent diagnostic testing, informally known as the “gold standard”. Ideally the diagnostic battery is a comprehensive one that includes a range of measures determining the presence of common disabilities that require intervention. The common disabilities are, in order of prevalence, language impairment, learning disabilities, intellectually disabilities, and autism spectrum disorder. ADHD is not always considered a disability by early intervention or special education programs, but rather more of a barrier to success, in the same way that stairs rather than wheel-chair ramps are a barrier to those with physical disabilities.
Indicators of accuracy are most often defined as sensitivity and specificity, the percentage of children with and without problems correctly detected. False-positives (aka over-referrals) are somewhat less important because research shows that children with problematic performance on screens but who do not qualify for early intervention or special education tend to be below-average on diagnostic measures and have psychosocial risk factors that suggest on-going delays. So, false-positive results provide a helpful indicator that other kinds of services are needed (e.g., Head Start, parent-training, developmental promotion, mental health or social services). For more information please see our research pages and particularly the peer-reviewed paper called “Are Over-Referrals On Screening Tests Really a Problem.”

For assessment-level and diagnostic measures, accuracy indicators in the form of sensitivity and specificity are rarely reported, and not actually needed (because such tools are usually administered to children who have already failed screens and for whom determination of eligibility and progress monitoring are the more central task at hand). Nevertheless, it is important to know differences in results between assessment-level and diagnostic tools (e.g., Are the age-equivalent scores or quotients deflated or inflated? How well do they match with other tests?). So, occasionally and also optimally, assessment-level tools tie results to screening-type cutoff scores to illustrate when a developmental age equivalent is sufficiently behind chronological age to warrant a referral. This helps professionals make wise decisions about families’ needs.

Each (US) State has criteria for early intervention and special education elibility. Age-equivalent scores (used to render a percentage of delay) and quotients are a typical part of that definition. Examiners should refer to State standards to guide decisions about referrals.

For more information, on psychometric standards for measures including screening tests, please click here see our slide show “How to Spot a Quality Screen”.