Reliability and Classical Measurement Theory

Afza.Malik GDA

Classical Measurement Theory and Reliability

Reliability and Classical Measurement Theory

What is Reliability,Forms of Reliability,Stability Reliability,Stability and Its Problems,Equality and Its Evaluation,Equivalence and Internal Consistency,Cronbach's (1951) Alpha to Measure Reliability.

What is Reliability 

    Reliability refers to the consistency of responses on self-report, norm-referenced measures of attitudes and behavior. Reliability arises from classical measurement theory, which holds that any score obtained from an instrument will be a composite of the individual's true pattern and error variability. 

    The error is made up of random and systematic components. Maximizing the instrument's reliability helps to reduce the random error associated with the scores, although the validity of the instrument helps to minimize systematic error (see “Validity”).

     The “true” score or variance in measurement relies on the consistency of the instrument as reflected by form and content, the stability of the responses over time, and the freedom from response bias or differences that could contribute to error. Error related to content results from the way questions are asked and the mode of instrument administration. 

    Time can contribute to error by the frequency of measurement and the time frame imposed by the questions asked. Error due to response differences results from the state or mood of the respondent, wording of questions that may lead to a response bias, and the testing or conceptual experience of the subject.

Forms of Reliability

    There are generally two forms of reliability assessment designed to deal with random error: stability and equivalence. Stability is the reproducibility of responses over time. Equivalence is the consistency of responses across a set of items so that there is evidence of a systematic pattern. Both of these forms apply to self-report as well as to observations made by a rater. 

    For self-report measures, stability is examined through test-retest procedures; Equivalence is assessed through alternative forms and internal consistency techniques. For observational measurement intra and interrater techniques assess the two forms of reliability respectively.

Stability Reliability

    Stability reliability is considered by some to be the only true way to measure the consistency of responses on an instrument. In fact, stability was the primary manner in which early instruments were examined for reliability. Stability is measured primarily through test-retest procedures in which the same instrument is given to the same subjects at two different points in time, commonly 2 weeks apart. 

    The scores are then correlated, or compared for consistency, using some form of agreement score that depends on the level of measurement. Typically, data are continuous; Thus, correlation coefficients and difference between mean scores are usually assessed. 

    A correlation tells the investigator whether individuals who scored high on the first administration also scored high on the second. It does not provide information on whether the scores are the same. Only a test that looks at the difference in mean scores will give that information.

Stability and Its Problems 

    The problem with stability is that it is not always reasonable to assume that the concept will remain unchanged over time. If the person's true score on a concept change within 2 weeks, instability and high random error will be assumed-when, in effect, it is possible that the instrument is consistently measuring change across time.     

Reliance on a 2-week interval for measuring stability may be faulty. The time interval chosen must directly relate to the theoretical understanding of the concept being measured.

    A special case of stability occurs with instruments that are completed by raters on the basis of their observations. Interrater reliability refers to the need for ratings to remain stable across the course of data collection and not change due to increased familiarity and practice with the instrument. The same assessment procedures are used for interrater reliability as for test-retest reliability.

Equality and Its Evaluation

    Equivalence is evaluated in two major ways. The first of these predated the availability of high-speed computers and easily accessed statistical packages. This set of techniques deals with the comparison of scores on alternate or parallel forms of the instrument to which the subject responds at the same point in time. 

    Parallelism means an item on one form has a comparable item on the second form, indexing the same aspect of the concept, and that the means and variances of these items are equal. These scores are compared through correlation or mean differences in a similar manner to stability. 

    Consistency is assumed if the scores are equivalent. Assessment with alternative/parallel forms is not comparison with two different measures of the concept. It is comparison of two essentially identical tests that were developed at the same time through the same procedures. 

    Therefore, a difficulty with this approach to equivalent reliability is obtaining a true parallel or alternative form of an instrument.

Equivalence and Internal Consistency

    A more common way to look at equivalence is through internal consistency procedures. The assumption underlying internal consistency is that the response to a set of scale items should be equivalent. All internal consistency approaches are based on correlational procedures. 

    An earlier form of internal consistency is split-half reliability, in which responses to half the items on a scale are randomly selected and compared to responses on the other half.

 Cronbach's (1951) Alpha to Measure Reliability

    Currently Cronbach's (1951) alpha reliability coefficient is the most prevalent technique for assessing internal consistency. Developed in the 1950s, the formula basically computes the ratio of variability between individual responses to the total variability in responses, with total variability being a composite of the individual variability and the measurement error. 

    As a ratio, the values obtained can range from 0 to 1, with 1 indicating perfect reliability and no measurement error. The ratio then reflects the proportion of the total variance in the response that is due to real differences between subjects. 

    A general guideline for use of Cronbach's alpha to assess an instrument is that well-established instruments must demonstrate a coefficient value above .80, whereas newly developed instruments should reach values of .70 or greater. 

    This should not be taken to indicate that the higher the coefficient, the better the instrument. Excessively high coefficients indicate redundancy and unnecessary items. A special case of alpha is the Kuder-Richardson 20, which is essentially alpha for dichotomous data.

    Cronbach's alpha is based on correlational analysis, which is highly influenced by the number of items and sample size. It is possible to increase the reliability coefficient of a scale by increasing the number of items. A small sample size can result in a reduced reliability coefficient that is a biased estimate. 

    A limitation of alpha is that items are considered to be parallel, which means they have identical true scores. When this is not the case, alpha is a lower bound to reliability; and other coefficients for internal consistency, based within models of principal components and common factor analysis (eg, Theta and Omega), are more appropriate.

     Obtaining an adequate alpha does not mean that examination of internal consistency is complete. Item analysis must be accomplished and focused on the fit of individual items with the other items and the total instrument.

    Again, observational measures are a special case and require different formulas for the determination of equivalence. Interrater reliability refers to the need for ratings to be essentially equivalent across data collectors and not to differ due to individual rater variability. 

    The most common assessment procedure, kappa, is based on percent agreement and controlling for chance.

    Any discussion of reliability as approached through classical test theory should note more recent proposals for test consistency. Of these proposals, generalizability theory (G theory) has received the most attention. 

    Unlike classical test theory reliability, G theory can estimate several sources of random error in one analysis; in the process a generalizability coefficient is computed. Proponents of G theory believe that its concentration on dependability rather than reliability offers a more global and flexible approach to estimating measurement error.

Post a Comment


Give your opinion if have any.

Post a Comment (0)

#buttons=(Ok, Go it!) #days=(20)

Our website uses cookies to enhance your experience. Check Now
Ok, Go it!