A researcher administers one form of a test on one day, and then administers an equivalent form to the same group of people at a later date/time. Alternate forms reliability (or "coefficient of equivalence;" parallel-forms reliability) of reliability is being sought in this example. When correlations are obtained among individual test items, Internal consistency (or "coefficient of internal consistency") reliability is being assessed; the 3 methods for obtaining this reliability include split-half (involves dividing test into 2 parts then correlating responses from the 2 parts), Kuder-Richardson Formula 20 (used when test items are dichotomously scored- e.g., "true/false"), and Cronbach's coefficient alpha (used for tests with multiple-scored items- e.g., "never/rarely/sometimes/always").
While the split-half reliability coefficient usually lowers the reliability coefficient artificially, the Spearman-Brown formula can be used to correct for the effects of shortening the measure. Speed tests, as the correlation would be spuriously inflated are measures of internal consistency not good at assessing reliability for.
Instruments that rely on rater judgments would be best to have high Inter-rater (interscorer) reliability, which is increased when scoring categories are mutually exclusive (a particular behavior belongs to a single category) and exhaustive (categories cover all possible responses/behaviors). The Measurement estimates the amount of error to be expected in an individual test score and is used to determine a range, referred to as a/an Standard Error of confidence interval, within which an examinee's true score will likely fall. The formula for the standard error of the measurement is SEmeas = SDx (standard deviation of test scores) / rxx (reliability coefficient).
The probability that a person's true score lies within a range of plus or minus 1 standard error of measurement (SEM) of their obtained score and plus or minus 1.96 (2) SEM, and finally, plus or minus 2.58 (2.5) SEM is 68% of the time, 95% of the time, and 99% of the time. Hypothetically, a test with a reliability coefficient of +1.0 would have a standard error of measurement of 0.0. A test with perfect reliability will have no error.
The standard error of measurement is inversely related to the reliability coefficient (rxx) and positively related to the standard deviation of test scores (SDx). Alternate-forms is the reliability coefficient, when practical, that is best to use. Classical test theory states that an observed score reflects true score variance plus random error variance. Methods of recording behaviors include duration recording (elapsed time that behavior occurs is recorded), frequency recording (number of times behavior occurs is recorded), interval recording (rater notes whether subject engages in behavior during given time period), and continuous recording (all behavior during an observation session is recorded). Simply put, validity refers to the degree a test measures what it purports to measure.
A depression scale that only assesses the affective aspects of depression but fails to account for the behavioral aspects would be lacking Content validity, which refers to the extent to which test items represent all facets of the content area being measured (e.g., EPPP). Content validity assessment requires a degree of agreement between experts in the subject matter, thus it includes an element of subjectivity. In addition, tests should correlate highly with other tests that measure the same content domain. In contrast to content validity, Face validity occurs when a test appears to valid by examinees, administrators, and other untrained observers; it is not technically a type of test validity. A personality test that effectively predicts the future behavior of an examinee has Criterion validity-related validity, which is obtained by correlating scores on a predictor test to some external criterion (e.g., academic achievement, job performance).