Probability Calculation and Research Design
A probability
mass function ( pmf ) is just a full description of the possible outcomes
and their probabilities for some discrete random variable. In some situations
it is written in simple list form, eg,
where f(x) is the probability that random
variable X takes on value x, with f(x)=0 implied for all other x values. We can
see that this is a valid probability distribution because each probability is
between 0 and 1 and the sum of all of the probabilities is 1.00. In other cases
we can use a formula for f(x), eg
which is the so-called binomial distribution
with parameters 4 and p. It is not necessary to understand the mathematics of
this formula for this course, but if you want to try you will need to know that
the exclamation mark symbol is pronounced “factorial” and r! represents the
product of all the integers from 1 to r. As an exception, 0! = 1
This particular pmf represents the probability distribution for getting x “successes” out of 4 “trials” when each trial has a success probability of p independently. This formula is a shortcut for the five different possible outcome values.
If you prefer you can calculate out the five
different probabilities and use the first form for the pmf . Another example is
the so-called geometric distribution, which represents the outcome for an
experiment in which we count the number of independent trials until the first
success is seen. The pmf is:
and it can be shown
that this is a valid distribution with the sum of this infinitely long series
equal to 1.00 for any value of p between 0 and 1. This pmf cannot be written in
the list form. (Again the mathematical details are optional. )
By definition a random variable takes on numeric values (ie, it maps real experimental outcomes to numbers). Therefore it is easy and natural to think about the pmf of any discrete continuous experimental variable, whether it is explanatory or outcome.
For categorical experimental variables, we do not need to assign
numbers to the categories, but we always can do that, and then it is easy to
consider that variable as a random variable with a finite pmf . Of course, for
nominal categorical variables the order of the assigned numbers is meaningless,
and for ordinal categorical variables it is most convenient to use consecutive
integers for the assigned numeric values.
“Probability
mass functions apply to discrete outcomes. A pmf is just a list of all possible
outcomes for a given experiment and the probabilities for each outcome.”
For continuous random variables, we use a somewhat different method for summarizing all of the information in a probability distribution.
This is the probability density
function (pdf), usually represented as “f(x)”, which does not represent
probabilities directly but from which the probability that the outcome falls in
a certain range can be calculated using integration from calculus. (If you
don't remember integration from calculus, don't worry, it is OK to skip over
the details.) remember integration from calculus, don't worry, it is OK to skip
over the details.)
One of the simplest
pdf's is that of the uniform distribution, where all real numbers between a and
b are equally likely and numbers less than a or greater than b are impossible.
The pmf is:
In this formula R dx means that we must use calculus to carry out integration Note that we use capital X for the random variable in the probability statement because this refers to the potential outcome of an experiment that has not yet been conducted, while the formulas for pdf and pmf use lower case x because they represent calculations done for each of several possible outcomes of the experiment.
Also note that, in the pdf
but not the pmf , we could replace either or both ≤ signs with < signs
because the probability that the outcome is exactly equal to t or u (to an
infinite number of decimal places) is zero.
So for the continuous uniform distribution,
for any a ≤ t ≤ u ≤ b,
You can check that this always gives a number between 0 and 1, and the probability of any individual outcome (where u=t) is zero, while the probability that the outcome is some number between a and b is 1 (u=a, t=b). You can also see that, eg, the probability that X is in the middle third of the interval from a to b is 1 3 , etc.
Of course, there are
many interesting and useful continuous distributions other than the continuous
uniform distribution. Some other examples are given below. Each is fully
characterized by its probability density function.
Reading a pdf
In general, we often look at a plot of the
probability density function, f(x), vs. the possible outcome values, x. This
plot is high in the regions of likely outcomes and low in less likely regions.
The well-known standard Gaussian distribution (see 3.2) has a bell-shaped graph
centered at zero with about two thirds of its area between x = -1 and x = +1
and about 95% between x = -2 and x = + 2. But a pdf can have many different
shapes.
It is worth understanding that many pdf's come in “families” of similarly shaped curves. These various curves are named or “indexed” by one or more numbers called parameters .
For example that family of Gaussian (also
called Normal) distributions is indexed by the mean and variance (or standard
deviation) of the distribution. The t-distributions, which are all centered at
0, are indexed by a single parameter called the degrees of freedom. The
chi-square family of distributions is also indexed by a single degree of
freedom value. The F distributions are indexed by two degrees of freedom
numbers designated numerator and denominator degrees of freedom.
In this course we will not do any integration. We will use tables or a computer program to calculate probabilities for continuous random variables. We don't even need to know the formula of the pdf because the most commonly used formulas are known to the computer by name. Sometimes we will need to specify degrees of freedom or other parameters so that the computer will know which pdf of a family of pdf's to use .
Despite our heavy reliance on the computer, getting a feel for the idea of a probability density function is critical to the level of understanding of data analysis and interpretation required in this course.
At a minimum you should realize that a pdf is a curve with outcome
values on the horizontal axis and the vertical height of the curve tells which
values are likely and which are not. The total area under the curve is 1.0, and
the under the curve between any two “x” values is the probability that the
outcome will fall between those values.
“For continuous random variables, we calculate
the probability that the outcome falls in some interval, not that the outcome
exactly equals some value. This calculation is normally done by a computer
program which uses integral calculus on a “probability density function.”
Probability calculations
This section reviews the most basic
probability calculations. It is worth while, but not essential to become
familiar with these calculations. For many readers, the boxed material may be
sufficient. You won't need to memorize any of these formulas for this course.
Remember that in probability theory we don't
worry about where probability assignments (a pmf or pdf) come from. Instead we
are concerned with how to calculate other probabilities given the assigned
probabilities. Let's start with calculation of the probability of a
"complex" or "compound" event that is constructed from the
simple events of a discrete random variable.
For example, if we have a discrete random variable that is the number of correct answers that a student gets on a test of 5 questions, ie integers in the set {0, 1, 2, 3, 4, 5}, then we could be interested in the probability that the student gets an even number of questions correct, or less than 2, or more than 3, or between 3 and 4, etc.
All of these probabilities are for outcomes that are subsets of the sample space of all 6 possible “elementary” outcomes, and all of these are the union (joining together) of some of the 6 possible “elementary” outcomes. In the case of any complex outcome that can be written as the union of some other disjoint (non-overlapping) outcomes, the probability of the complex outcome is the sum of the probabilities of the disjoint outcomes. To complete this example look at Table 3.1 which shows assigned probabilities for the elementary outcomes of the random variable we will call T (the test outcome) and for several complex events.
Disjoint addition rule
You should think of the probability of a
complex event such as T.
Give your opinion if have any.