# Data Entry In SPSS

## Data Entry

Coded data typically are transferred onto a data file through a keyboard or computer terminal. Various computer programs can be used for data entry, including spreadsheets or databases. Major software packages for statistical analysis also have data editors that make data entry fairly easy. Each variable had to be given a brief name.

The computer also had to be told whether each variable was categorical or continuous; for example, group is categorical with only two values, 1 or 2. Birth weight ( bweight ), however, is continuous, ranging from 76 on up. Researchers have to specify the maximum “width” of continuous variables ( eg, could the variable priors, number of prior pregnancies, equal 10 or more, which would require two columns?).

It is sometimes necessary to indicate how many
places to the right of the decimal point are required (two places is the
typical default). Once this information is specified, coded data can be typed
in, one subject at a time. Each subject should be assigned a unique ID number,
and this number should be entered along with actual data. This allows
researchers to go back to the original source if there are any difficulties
with the data file. Usually, a consecutive numbering scheme is used, running
from number 1 to the number of actual cases.

Data Verification and Cleaning Data entry is a tedious and error-prone task and so it is necessary to verify the entries and correct any mistakes. Several methods of verification exist. The first is to compare visually the numbers printed on a printout of the data file with codes on the original source. A second possibility is to enter all the data twice and to compare the two sets of records, either visually or by computer. Finally, there are special verifying programs designed to perform comparisons during direct data entry.

Even verified data usually contain some errors. Such errors could result from data entry mistakes, coding problems, or misreporting of information. Data are not ready for analysis until they have been cleaned. Data cleaning involves two types of checks. The first is a check for outliers and wild codes. Outliers are values that lie outside the normal range of values for other cases.

Outliers can be found by inspecting frequency distributions, paying special attention to the lowest and highest values. In some cases, the outliers are true, legitimate values (eg, an annual income of $2 million in a distribution where all other incomes are below $200,000). In other cases, however, outliers indicate an error in data entry that needs to be corrected, as when the frequency distribution reveals a wild code that is, a code that is not possible.

For example, the variable gender might have the following three defined codes: 1 female, 2 male, and 9 not reported. If we discovered a code of 5 in our data file for gender, it would be clear that an error had been made. The computer could be instructed to list the ID number of the culpable record, and the error could be corrected by finding the proper code on the original source.

Another procedure is to use a program for data entry that automatically performs range checks. Editing of this type, of course, will never reveal all errors. If the gender of a male subject is entered incorrectly as a 1, the mistake may never be detected. Because errors can have a big effect on the analysis and interpretation of data, it is naturally important to perform the coding, entering, verifying, and cleaning with great care.

The second data-cleaning procedure involves consistency checks, which focus on internal data consistency. In this task, researchers check for errors by testing whether data for different variables are compatible. For example, one question in a survey might ask respondents their current marital status, and another might ask how many times they had been married.

If the data were internally consistent,
respondents who answered “Single, never married” to the first question should
have a zero (or a missing values code) for the second. As another example, if
the respondent's gender were entered with the code for male and there was an
entry of 2 for the variable “Number of pregnancies,” then one of those two
fields would contain an error. Researchers should search for opportunities to
check the consistency of entered data.

## Creating and Documenting the Analysis Files

Once the data set has been created and cleaned, researchers proceed to develop an analysis file, using one of the many available statistical software packages. If the data set was not created within a statistical software package that is, if it is simply a file of numeric data values the computer must be told basic information about the data set, such as what the variable names are, where to find values for those variables in the file, and how to determine where one case ends and another one begins.

The decisions that researchers make about coding, variable naming, and so on should be documented in full. memory should not be trusted; Several weeks after coding, researchers may no longer remember if male subjects were coded 1 and female subjects 2, or vice versa. Moreover, colleagues may wish to borrow the data set to perform a secondary analysis.

Regardless of whether one anticipates a secondary analysis, documentation should be sufficiently thorough so that a person unfamiliar with the original research project could use the data. Documentation involves primarily preparing a codebook. A codebook is essentially a listing of each variable together with information about placement in the file, codes associated with the values of the variable, and other basic information. Codebooks can be generated by statistical or data entry programs.

A portion of the SPSS generated codebook for the first three variables in the data set. This codebook shows variable names in the left column, variable position in the file in the right column, and then various types of information about the variables (eg, measurement level, width of the data, coded values) in the middle.

We see that the variable
GROUP, for example, has an extended label, **"Treatment Group."** GROUP
is specified as a nominal variable occupying only one column (F1). Data for
this variable have values coded either 1 for experimental subjects or 2 for
controls. No missing data code is specified because all subjects are known to
be in either one or the other group. By contrast, the variable BWEIGHT (infant
birth weight in ounces) allows for missing data, with a missing values code of
999.

## Preliminary Assessments And Actions

Researchers
typically undertake preliminary assessments of their data and several
pre-analytic activities before they test their hypotheses. Several preparatory
activities are discussed next.

Researchers usually find that their data set has some missing values. There are various ways of dealing with missing data. In selecting an approach, researchers should first determine the distribution and patterning of missing data.

The appropriate solution depends on such factors as the extent of missing data, the role the variable with missing data plays in the analysis (ie, whether the missing values are for dependent, independent, or descriptive variables), and the randomness of the missing data (ie, whether missing values are related in any systematic way to important variables in the study).

The magnitude of the problem differs if only 2% of the values for a relatively minor variable are missing, as opposed to 20% of the values for the main dependent variable. Also, if the missing values come disproportionately from people with certain characteristics, there is likely some bias.

The first step, then, is to determine the extent of the problem by examining frequency distributions on a variable-by-variable basis. (Most researchers routinely begin data analysis by running marginals constructing frequency distributions for all or most variables in their data set.)

Another step is to examine the cumulative extent of missing values. Statistical programs can be used to create flags to count how many variables are missing for each sample member. Once a missing values flag has been created, a frequency distribution can be computed for this new variable, which would show how many cases had no missing values, one missing value, and so on.

Another task is to evaluate the randomness of missing values. A simple procedure is to divide the sample into two groups with those with missing data on a specified variable and those without missing data. The two groups can then be compared in terms of other variables in the data set to determine if the two groups are comparable (eg, were men more likely than women to leave certain questions blank?).

Similarly, groups can be created based on
the missing values flag (eg, those with no missing data versus those with any
missing data) and compared on other variables. Once researchers have assessed
the extent and patterning of missing values, decisions must be made about how
to address the problem. Solutions include the following:

1. Delete missing
cases. One simple strategy is to delete a case (ie, a subject) entirely if
there is missing information. When samples are small, it is irksome to throw
away an entire case, especially if data are missing for only one or two
variables. It is, however, advisable to delete cases for subjects with
extensive missing information. This strategy is sometimes referred to as
list wise deletion of missing values.

2. Delete the variable. Another option is to throw out information for particular variables for all subjects. This option is especially suitable when a high percentage of cases have missing values on a specific variable. This may occur if, for example, a question was objectionable and was left blank, or if many respondents did not understand directions and inadvertently skipped it.

When
missing data on a variable are extensive, there may be systematic biases with
regard to those subjects for whom data are available. This approach is clearly
not attractive if the variable is a critical independent or dependent variable

3. Substitute the
mean or median value. When missing values are reasonably random and when the
problem is not extensive, it may be useful to substitute real data values for
missing value codes. Such a substitution represents a **“best guess”** about what
the value would have been, had data actually been collected. Because of this
fact, researchers usually substitute a value that is typical for the sample,
and typical values usually come from the center of the distribution.

For example, if data for a subject's age were missing and if the average age of subjects in the sample were 45.2 years, we might substitute the value 45 in place of the missing values code for subjects whose age is unknown. This approach is especially useful when there are missing values for variables that comprise a multiple-item scale, such as a Likert scale. Suppose, for example, we had a 20-item scale that measures anxiety, and that one person answered only 18 of the 20 items.

It would not be appropriate to score the scale by adding together responses on the 18 items only, and it would be a waste of information on the 18 answered items to code the entire scale as missing. Thus, for the two missing items, the researcher could substitute the most typical responses (based on either the mean, median, or mode, depending on the distribution of scores), so that a scale score on the full 20 items could be computed.

Needless
to say, this approach makes sense only when a small proportion of scale items
is missing. And, if the scale is a test of knowledge rather than a measure of a
psychosocial characteristics, it usually is more appropriate to consider a
missing value as a **"don't know"** and to mark the item as incorrect.

4. Estimate the missing value. When researchers substitute a mean value for a missing value, there is a risk that the substitution is erroneous, perhaps dramatically so. For example, if the mean value of 45 is substituted for a missing value on a subject's age, but the range of ages in the sample is 25 to 70 years, then the substitute value could be wrong by 20 years or more.

When only a couple of cases have missing data for the variable age, an error of even this magnitude is unlikely to alter the results. However, if missing values are more widespread, it is sometimes worth while to substitute values that have a greater likelihood of being accurate. Various procedures can be used to derive estimates of what the missing value should be. A simple method is to use the mean for a subgroup that is most like the case with the missing value.

For example, suppose the mean age for women in the sample were 42.8 years and the mean age for men were 48.9 years. The researcher could then substitute 43 for women with missing age data and 49 for men with missing age data. This likely would result in improved accuracy, compared with using the overall mean of 45 for all subjects with missing age data. Another method is to use multiple regression to “predict” the correct value of missing data.

To continue with the example used previously, suppose that subjects' age was correlated with gender, education, and marital status in this research sample. Based on data from subjects without missing values, these three variables could be used to develop a regression equation that could predict age for subjects whose age information is missing but whose values for the three other variables are not missing.

Some
statistical software packages (such as SPSS) have a special procedure for
estimating the value of missing data. These procedures can also be used to
examine the relationship between the missing values and other variables in the
data set.

5. Delete cases pairwise. Perhaps the most widely used (but not necessarily the best) approach is to delete cases selectively, on a variable-by-variable basis. For example, in describing the characteristics of the sample, researchers might use whatever information is available for each characteristic, resulting in a fluctuating number of cases across descriptors.

Thus, the mean age might be based on 95
cases, the gender distribution might be based on 102 cases, and so on. If the
number of cases fluctuates widely across characteristics, the information is
difficult to interpret because the sample of subjects is essentially a** “moving
target.” **The same strategy is sometimes used to handle missing information for
dependent variables.

For example, in an evaluation of an intervention to reduce patient anxiety we might have blood pressure, self-reported anxiety, and an observational measure of stress related behavior as the dependent variables. Suppose that 10 people out of a sample of 100 failed to complete the anxiety scale. We might base the analyzes of the anxiety data on the 90 subjects who completed the scale, but use the full sample of 100 subjects in the analyzes of the other dependent variables.

The problem with this approach is that it can cause interpretive problems, especially if there are inconsistent findings across outcomes. For example, if there were experimental versus control group differences on all outcomes except the anxiety scale, one possible explanation is that the anxiety scale sample is not representative of the full sample. Researchers usually strive for a rectangular matrix of data: data values for all subjects on all important variables.

In the example just described, we should, at a minimum, perform supplementary analyzes using the rectangular matrix that is, rerun all analyzes using the 90 subjects for whom complete data are available. If the results are different from when the full sample was used, we would at least be in a better position to interpret results. In analyzes involving a correlation matrix, researchers sometimes use pairwise deletion of cases with missing values.

As this figure shows, correlations between pairs of variables are based on varying numbers of cases (shown in the row labeled N), ranging from 425 subjects for the correlation between Center for Epidemiological Studies Depression scale (CES-D) scores and total income, to 483 subjects for the correlation between highest grade and total number of people in the household.

Pairwise deletion is acceptable for descriptive purposes if data are missing at random and differences in Ns are small. It is especially imprudent to use this procedure for multivariate analyzes such as multiple regression because the criterion correlations (ie, the correlations between the dependent variable and the various predictors) are based on nonidentical subsets of subjects.

Each of these solutions has accompanying
problems, so care should be taken in deciding how missing data are to be
handled. Procedures for dealing with missing data are discussed at greater
length in Allison (2000), Little and Rubin (1990), and Kneipp and McIntosh
(2001).

## Assessing quantitative data quality

Steps are often undertaken to assess data quality in the early stage of analysis. For example, when psycho social scales or composite indexes are used, researchers should usually assess their internal consistency reliability . The distribution of data values for key variables should also be examined to determine any anomalies, such as limited variability, extreme skewness, or the presence of ceiling or floor effects.

For example, a vocabulary test for 10-year-olds likely would yield a clustering of high scores in a sample of 11-year-olds, creating a ceiling effect that would reduce correlations between test scores and other characteristics of the children.

Conversely, there like l
would be a clustering of low scores on the test with a sample of 9-year-olds,
resulting in a floor effect with similar consequences. In such situations, data
may need to be transformed to meet the requirements for certain statistical
tests. Assessing Bias Researchers often undertake preliminary analyzes to
assess the direction and extent of any biases, including the following:

• Non-response bias. When possible, researchers should determine whether a biased subset of people participated in a study. If there is information about the characteristics of all people who were asked to participate in a study (eg, demographic information from hospital records), researchers should compare the characteristics of those who did and did not participate to determine the nature and direction of any biases .

This means that the data file would have
to include both respondents and non-respondents, and a variable indicating their
response status (eg, a variable could be coded 1 for participants and 2 for
those who declined to participate).

• Selection bias. When none equivalent comparison groups are used (in quasi-experimental or non-experimental studies), researchers should check for selection biases by comparing the groups' background characteristics. It is particularly important to examine possible group differences on extraneous variables that are strongly related to the dependent variable.

These variables can (and should, if
possible) then be controlled for example, through analysis of covariance (AN
COVA) or multiple regression. Even when an experimental design has been used,
researchers should check the success of randomization. Random assignment does
not guarantee equivalent groups, so researchers should be aware of
characteristics for which the groups are not, in fact, comparable.

• Attribution bias. In longitudinal studies, it is always important to check for attrition biases, which involves comparing people who did and did not continue to participate in the study in later waves of data collection, based on characteristics of these groups at the initial wave.

In performing any of these analyses, significant
group differences are an indication of bias, and such bias must be taken into consideration
in interpreting and discussing the results. Whenever possible, the biases
should be controlled in testing the main hypotheses.

Give your opinion if have any.