Assessment Glossary
Core Assessment and Evaluation Terminology
Terminology from the National Council on Measurement in Education (NCME)
(Selected terms and their definitions are provided below. Full Glossary of terms)
- Ability/parameter - In item response theory (IRT), a theoretical value indicating the level of a test taker on the ability or trait measured by the test; analogous to the concept of true score in classical test theory.
- Ability testing - The use of tests to evaluate the current performance of a person in some defined domain of cognitive, psychomotor, or physical functioning.
- Accessible/accessibility - Degree to which the items or tasks on a test enable as many test takers as possible to demonstrate their standing on the target construct without being impeded by characteristics of the item that are irrelevant to the construct being measured.
- Achievement test - A test to measure the extent of knowledge or skill attained by a test taker in a content domain in which the test taker has received instruction.
- Assessment - Any systematic method of obtaining information from tests and other sources, used to draw inferences about characteristics of people, objects, or programs; a process designed to systematically measure or evaluate the characteristics or performance of individuals, programs, or other entities, for purposes of drawing inferences; sometimes used synonymously with test.
- Assessment literacy - Knowledge about testing that supports valid interpretations of test scores for their intended purposes, such as knowledge about test development practices, test score interpretations, threats to valid score interpretations, score reliability and precision, test administration, and use.
- Authentic assessment - An assessment containing items that are judged to be measuring the ability to apply and use knowledge in real-world contexts.
- Achievement levels/proficiency levels - Descriptions of a test taker's level of competency in a particular area of knowledge or skill, usually deifned as ordered categories on a continuum, often labeled from "basic" to "advanced," or "novice" to "expert," that constitute broad ranges for classifying performance.
- Benchmark assessments - Assessments administered in educational settings at specified times during a curriculum sequence, to evaluate students' knowledge and skills relative to an explicit set of longer-term learning goals. See Interim assessments
- Bias - 1. In test fairness, construct underrepresentation or construct-irrelevant components of test scores which differentially affect the performance of different groups of test takers and consequently the reliability/precisoion and validity of interpretations and uses of their test scores. 2. In statistics or measurement, systematic error in a test score. See predictive bias, construct underrepresentation, construct irrelevance, fairness.
- Certification - A process by which individuals are recognized (or certified) as having demonstrated some level of knowledge and skill in some domain. See licensing, credentialing.
- Classical test theory - A psychometric theory based on the view that an individual's observed score on a test is the sum of a true score component for the test taker and an independent random error component.
- Construct - Concept or characteristic the test is designed to measure.
- Construct-irrelevant variance - Variance in test-taker scores that is attributable to extraneous factors that distort the meaning of the scores, and thereby, decrease the validity of the proposed interpretation.
- Convergent evidence - Evidence based on the relationship between test scores and other measures of the same or related construct.
- Credentialing - Granting to a person, by some authority, a credential, such as a certificate, license, or diploma, that signifies an acceptable level of performance in some domain of knowledge or activity.
- Criterion-referenced score interpretation - The meaning of a test score for an individual or an average score for a defined group, indicating an individual’s or group’s level of performance in relationship to some defined criterion domain. Examples of criterion-referenced interpretations include comparison to cut scores, interpretations based on expectancy tables, and domain- referenced score interpretations (Contrast with norm-referenced score interpretation.)
- Cut score - A specified point on a score scale, such that scores at or above that point are reported, interpreted, or acted upon differently from scores below that point.
- Derived score - A score scale to which raw scores are converted to enhance their interpretation. Examples are percentile ranks, standard scores, and grade-equivalent scores.
- Empirical evidence - Evidence based on some form of data, as opposed to that based on logic or theory.
- Error of measurement - The difference between an observed score and the corresponding true score. See standard error of measurement, systematic error, random error and true score.
- Evaluation - The process of gathering information to make a judgment about the quality or worth of some program or performance. The term also is used to refer to the judgment itself, as in “My evaluation of his work is . . . .”
- Extraneous variance - The variability in test scores that occurs among individuals in a group because of differences in those persons that are irrelevant to what the test is intended to measure. For example, a science test that requires mathematics skills and a reading ability beyond what its content domain specifies will have two sources of extraneous variance. In this case, students’ science scores might differ, not only because of differences in their science achievement, but also because of differences in their (extraneous) mathematics and reading abilities. (See also construct irrelevance.)
- Formative assessment - An assessment process used by teachers and students during instruction that provides feedback to adjust ongoing teaching and learning with the goals of improving students' achievement of intended instructional outcomes.
- Gain score - In testing, the difference between two scores obtained by a test taker on the same test or two equated tests taken on different occasions, often before and after some treatment.
- Generalizability theory - Methodological framework for evaluating reliability/precision in which various sources error variance are estimated through the application of the statistical techniques of analysis of variance. The analysis indicates the generalizability of scores beyond the specific sample of items, persons, and observational conditions that were studied.
- High-stakes test - A test used to provide results that have important, direct consequences for individuals, programs, or institutions involved in the testing. Contrast with low-stakes tests.
- Inter-rater agreement/consistency - The level of consistency with which two or more judges rate the work or performance of test takers. See inter-rater reliability. inter-rater reliability: consistency in rank ordering of ratings across raters. See inter-rater agreement.
- Intra-rater reliability - The degree of agreement among repetitions of a single rater in scoring test takers’ responses. Inconsistencies in the scoring process resulting from influences that are internal to the rater rather than true differences in test takers’ performances result in low intra-rater reliability.
- Low-stakes test - A test used to provide results that have only minor or indirect consequences for individuals, programs, or institutions involved in the testing. Contrast with high-stakes test.
- Mastery/mastery test - a test designed to indicate whether a test taker has or has not attained a prescribed level of competence in a domain. See cut score, computer-based mastery test.
- Moderator variable - A variable that affects the direction or strength of the relationship between two other variables.
- Norm-referenced score interpretation - A score interpretation based on a comparison of a test taker's performance to the distribution of performance in a specified reference population. Contrast criterion-referenced score interpretation.
- Objective test - A test containing items that can be scored without any personal interpretation (subjectivity) required on the part of the scorer. Tests that contain multiple choice, true-false, and matching items are examples.
- Performance assessments - Assessments for which the test taker actually demonstrates the skills the test is intended to measure by doing tasks that require those skills.
- Performance standards - Descriptions of levels of knowledge and skill acquisition contained in content standards, as articulated through performance level labels (e.g., “basic”, “proficient”, “advanced”), statements of what test takers at different performance levels know and can do, and cut scores or ranges of scores on the scale of an assessment that differentiate levels of performance. See cut scores, performance level, performance level descriptor.
- Random error - A non-systematic error; a component of test scores that appears to have no relationship to other variables.
- Raw score - The score on a test that is often calculated by counting the number of correct answers, but more generally a sum or other combination of item scores.
- Reliability/precision - The degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable, and consistent for an individual test taker; the degree to which scores are free of random errors of measurement for a given group. See generalizability theory, classical test theory, precision of measurement.
- Reliability coefficient - A unit-free indicator that reflects the degree to which scores are free of random measurement error. See generalizability theory.
- Response bias - A test taker's tendency to respond in a particular way or style to items on a test (e.g., acquiesce, choice of socially desirable options , choice of 'true' on a true-false test) that yields systematic, construct-irrelevant error in test scores.
- Scoring rubric - The established criteria, including rules, principles, and illustrations, used in scoring constructed responses to individual tasks and clusters of tasks
- Standard error of measurement - The standard deviation of an individual's observed scores from repeated administrations of a test (or parallel forms of a test) under identical conditions. Because such data cannot generally be collected, the standard error of measurement is usually estimated from group data. See error of measurement.
- Standard setting - The process, often judgment-based, of setting cut scores using a structured procedure that seeks to determine cut scores that define different levels of performance as specified by performance levels and performance level descriptors.
- Standardization - 1. In test administration, maintaining a consistent testing environment and conducting the test according to detailed rules and specifications, so that testing conditions are the same for all test takers on the same and multiple occasions. 2. In test development, establishing norms based on the test performance of a representative sample of individuals from the population with which the test is intended to be used.
- Summative assessment - The assessment of a test taker’s knowledge and skills typically carried out at the completion of a program of learning, such as the end of an instructional unit.
- True score - In classical test theory, the average of the scores that would be earned by an individual on an unlimited number of strictly parallel forms of the same test.
- Validation - The process through which the validity of the proposed interpretation of test scores for their intended uses is investigated.
- Validity - The degree to which accumulated evidence and theory support a specific interpretation of test scores for a given use of a test. If multiple interpretations of a test score for different uses are intended, validity evidence for each interpretation is needed.
- Weighted scores/scoring - A method of scoring a test in which a number of points is awarded for a correct (or diagnostically relevant) response. In some cases, the scoring formula awards more points for one response to an item than for another response.