# Classical Measurement Theory and Reliability

**What is Reliability**

Reliability refers to the consistency of responses on **self-report,
norm-referenced** measures of **attitudes and behavior**. Reliability arises from**
classical measurement theory,** which holds that any score obtained from an
instrument will be a composite of the individual's true pattern and error
variability.

The error is made up of random and systematic components.
Maximizing the instrument's reliability helps to reduce the random error
associated with the scores, although the validity of the instrument helps to minimize
systematic error** (see “Validity”)**.

The **“true”** score or variance in measurement
relies on the consistency of the instrument as reflected by form and content,
the stability of the responses over time, and the freedom from response bias or
differences that could contribute to error. Error related to content results
from the way questions are asked and the mode of instrument administration.

Time can contribute to error by the frequency of measurement and the time frame
imposed by the questions asked. Error due to response differences results from
the state or mood of the respondent, wording of questions that may lead to a
response bias, and the testing or conceptual experience of the subject.

## Forms of Reliability

There are generally two forms of reliability assessment designed to deal with random error: stability and equivalence. Stability is the reproducibility of responses over time. Equivalence is the consistency of responses across a set of items so that there is evidence of a systematic pattern. Both of these forms apply to self-report as well as to observations made by a rater.

For self-report measures, stability is examined through
test-retest procedures; Equivalence is assessed through alternative forms and
internal consistency techniques. For observational measurement **intra and
interrater** techniques assess the two forms of reliability respectively.

## Stability Reliability

Stability reliability is considered by some to be the only true way to measure the consistency of responses on an instrument. In fact, stability was the primary manner in which early instruments were examined for reliability. Stability is measured primarily through test-retest procedures in which the same instrument is given to the same subjects at two different points in time, commonly 2 weeks apart.

The scores are then correlated, or compared for consistency, using some form of agreement score that depends on the level of measurement. Typically, data are continuous; Thus, correlation coefficients and difference between mean scores are usually assessed.

A correlation tells the
investigator whether individuals who scored high on the first administration
also scored high on the second. It does not provide information on whether the
scores are the same. Only a test that looks at the difference in mean scores
will give that information.

## Stability and Its Problems

The problem with stability is that it is not always reasonable to assume that the concept will remain unchanged over time. If the person's true score on a concept change within 2 weeks, instability and high random error will be assumed-when, in effect, it is possible that the instrument is consistently measuring change across time.

Reliance on a 2-week interval for
measuring stability may be faulty. The time interval chosen must directly
relate to the theoretical understanding of the concept being measured.

A special case of stability occurs with instruments that are
completed by raters on the basis of their observations. Interrater reliability
refers to the need for ratings to remain stable across the course of data
collection and not change due to increased familiarity and practice with the
instrument. The same assessment procedures are used for interrater reliability
as for test-retest reliability.

## Equality and Its Evaluation

Equivalence is evaluated in two major ways. The first of these predated the availability of high-speed computers and easily accessed statistical packages. This set of techniques deals with the comparison of scores on alternate or parallel forms of the instrument to which the subject responds at the same point in time.

Parallelism means an item on one form has a comparable item on the second form, indexing the same aspect of the concept, and that the means and variances of these items are equal. These scores are compared through correlation or mean differences in a similar manner to stability.

Consistency is assumed if the scores are equivalent. Assessment with alternative/parallel forms is not comparison with two different measures of the concept. It is comparison of two essentially identical tests that were developed at the same time through the same procedures.

Therefore, a difficulty
with this approach to equivalent reliability is obtaining a true parallel or
alternative form of an instrument.

## Equivalence and Internal Consistency

A more common way to look at equivalence is through internal consistency procedures. The assumption underlying internal consistency is that the response to a set of scale items should be equivalent. All internal consistency approaches are based on correlational procedures.

** An earlier form
of internal consistency is split-half reliability, in which responses to half the
items on a scale are randomly selected and compared to responses on the other
half.**

Cronbach's (1951) Alpha to Measure Reliability

Currently Cronbach's (1951) alpha reliability coefficient is the most prevalent technique for assessing internal consistency. Developed in the 1950s, the formula basically computes the ratio of variability between individual responses to the total variability in responses, with total variability being a composite of the individual variability and the measurement error.

** As a ratio, the values obtained can range from 0 to 1, with 1 indicating
perfect reliability and no measurement error.** The ratio then reflects the
proportion of the total variance in the response that is due to real
differences between subjects.

A general guideline for use of Cronbach's alpha to assess an instrument is that well-established instruments must demonstrate a coefficient value above .80, whereas newly developed instruments should reach values of .70 or greater.

This should not be taken to indicate that the higher the coefficient, the better the instrument. Excessively high coefficients indicate redundancy and unnecessary items. A special case of alpha is the Kuder-Richardson 20, which is essentially alpha for dichotomous data.

Cronbach's alpha is based on correlational analysis, which is highly influenced by the number of items and sample size. It is possible to increase the reliability coefficient of a scale by increasing the number of items. A small sample size can result in a reduced reliability coefficient that is a biased estimate.

A limitation of alpha is that items are considered to be parallel, which means they have identical true scores. When this is not the case, alpha is a lower bound to reliability; and other coefficients for internal consistency, based within models of principal components and common factor analysis (eg, Theta and Omega), are more appropriate.

Obtaining an adequate alpha does not mean that examination of internal consistency is complete. Item analysis must be accomplished and focused on the fit of individual items with the other items and the total instrument.

Again, observational measures are a special case and require different formulas for the determination of equivalence. Interrater reliability refers to the need for ratings to be essentially equivalent across data collectors and not to differ due to individual rater variability.

The most common assessment procedure, kappa, is based on percent agreement and controlling for chance.

Any discussion of reliability as approached through classical test theory should note more recent proposals for test consistency. Of these proposals, generalizability theory (G theory) has received the most attention.

Unlike classical test theory reliability, G theory can estimate several sources
of random error in one analysis; in the process a generalizability coefficient
is computed. Proponents of G theory believe that its concentration on
dependability rather than reliability offers a more global and flexible
approach to estimating measurement error.

Give your opinion if have any.