Figure 1.4 QQplot for the home prices example.
There are also a variety of quantitative methods for assessing normalitybrief details and references are provided in Section 3.4.2.
Optionaltechnical details of QQplots
For the purposes of this book, the technical details of QQplots are not too important. For those that are curious, however, a brief description follows. First, calculate a set of equally spaced percentiles (quantiles) from a standard normal distribution. For example, if the sample size, , is 9, then the calculated percentiles would be the 10th, 20th, , 90th. Then construct a scatterplot with the observed data values ordered from low to high on the vertical axis and the calculated percentiles on the horizontal axis. If the two sets of values are similar (i.e., if the sample values closely follow a normal distribution), then the points will lie roughly along a straight line. To facilitate this assessment, a diagonal line that passes through the first and third quartiles is often added to the plot. The exact details of how a QQplot is drawn can differ depending on the statistical software used (e.g., sometimes the axes are switched or the diagonal line is constructed differently).
In particular, the uppertail area to the right of 1.960 is 0.025; this is equivalent to saying that the area between 0 and 1.960 is 0.475 (since the entire area under the curve is 1 and the area to the right of 0 is 0.5). Similarly, the twotail area, which is the sum of the areas to the right of 1.960 and to the left of 1.960, is two times 0.025, or 0.05.
How does all this help us to make statistical inferences about populations such as that in our home prices example? The essential idea is that we fit a normal distribution model to our sample data and then use this model to make inferences about the corresponding population. For example, we can use probability calculations for a normal distribution (as shown in Figure 1.3) to make probability statements about a population modeled using that normal distributionwe will show exactly how to do this in Section 1.3. Before we do that, however, we pause to consider an aspect of this inferential sequence that can make or break the process. Does the model provide a close enough approximation to the pattern of sample values that we can be confident the model adequately represents the population values? The better the approximation, the more reliable our inferential statements will be.
We saw in Figure 1.2 how a density curve can be thought of as a histogram with a very large sample size. So one way to assess whether our population follows a normal distribution model is to construct a histogram from our sample data and visually determine whether it looks normal, that is, approximately symmetric and bellshaped. This is a somewhat subjective decision, but with experience you should find that it becomes easier to discern clearly nonnormal histograms from those that are reasonably normal. For example, while the histogram in Figure 1.2 clearly looks like a normal density curve, the normality of the histogram of 30 sample sale prices in Figure 1.1 is less certain. A reasonable conclusion in this case would be that while this sample histogram is not perfectly symmetric and bellshaped, it is close enough that the corresponding (hypothetical) population histogram could well be normal.
An alternative way to assess normality is to construct a QQplot (quantilequantile plot), also known as a normal probability plot, as shown in Figure 1.4 (see computer help #22 in the software information files available from the book website). If the points in the QQplot lie close to the diagonal line, then the corresponding population values could well be normal. If the points generally lie far from the line, then normality is in question. Again, this is a somewhat subjective decision that becomes easier to make with experience. In this case, given the fairly small sample size, the points are probably close enough to the line that it is reasonable to conclude that the population values could be normal.
Figure 1.4 QQplot for the home prices example.
There are also a variety of quantitative methods for assessing normalitybrief details and references are provided in Section 3.4.2.
Optionaltechnical details of QQplots
For the purposes of this book, the technical details of QQplots are not too important. For those that are curious, however, a brief description follows. First, calculate a set of equally spaced percentiles (quantiles) from a standard normal distribution. For example, if the sample size, , is 9, then the calculated percentiles would be the 10th, 20th, , 90th. Then construct a scatterplot with the observed data values ordered from low to high on the vertical axis and the calculated percentiles on the horizontal axis. If the two sets of values are similar (i.e., if the sample values closely follow a normal distribution), then the points will lie roughly along a straight line. To facilitate this assessment, a diagonal line that passes through the first and third quartiles is often added to the plot. The exact details of how a QQplot is drawn can differ depending on the statistical software used (e.g., sometimes the axes are switched or the diagonal line is constructed differently).
1.3 Selecting Individuals at RandomProbability
Having assessed the normality of our population of sale prices by looking at the histogram and QQplot of sample sale prices, we now return to the task of making probability statements about the population. The crucial question at this point is whether the sample data are representative of the population for which we wish to make statistical inferences. One way to increase the chance of this being true is to select the sample values from the population at randomwe discussed this in the context of our home prices example in Section 1.1. We can then make reliable statistical inferences about the population by considering properties of a model fit to the sample dataprovided the model fits reasonably well.
We saw in Section 1.2 that a normal distribution model fits the home prices example reasonably well. However, we can see from Figure 1.1 that a standard normal distribution is inappropriate here, because a standard normal distribution has a mean of 0 and a standard deviation of 1, whereas our sample data have a mean of 278.6033 and a standard deviation of 53.8656. We therefore need to consider more general normal distributions with a mean that can take any value and a standard deviation that can take any positive value (standard deviations cannot be negative).
Let represent the population values (sale prices in our example) and suppose that is normally distributed with mean (or expected value), , and standard deviation, . This textbook uses this notation with familiar Roman letters in place of the traditional Greek letters, (mu) and (sigma), which, in the author's experience, are unfamiliar and awkward for many students. We can abbreviate this normal distribution as , where the first number is the mean and the second number is the square of the standard deviation (also known as the variance). Then the population standardizedvalue,
Let represent the population values (sale prices in our example) and suppose that is normally distributed with mean (or expected value), , and standard deviation, . This textbook uses this notation with familiar Roman letters in place of the traditional Greek letters, (mu) and (sigma), which, in the author's experience, are unfamiliar and awkward for many students. We can abbreviate this normal distribution as , where the first number is the mean and the second number is the square of the standard deviation (also known as the variance). Then the population standardizedvalue,
has a standard normal distribution with mean 0 and standard deviation 1. In symbols,
We are now ready to make a probability statement for the home prices example. Suppose that we would consider a home as being too expensive to buy if its sale price is higher than . What is the probability of finding such an expensive home in our housing market? In other words, if we were to randomly select one home from the population of all homes, what is the probability that it has a sale price higher than ? To answer this question, we need to make a number of assumptions. We have already decided that it is probably safe to assume that the population of sale prices () could be normal, but we do not know the mean, , or the standard deviation, , of the population of home prices. For now, let us assume that and (fairly close to the sample mean of 278.6033 and sample standard deviation of 53.8656). (We will be able to relax these assumptions later in this chapter.) From the theoretical result above, has a standard normal distribution with mean 0 and standard deviation 1.
Next, to find the probability that a randomly selected is greater than 380, we perform some standard algebra on probability statements. In particular, if we write the probability that is bigger than as , then we can make changes to (such as adding, subtracting, multiplying, and dividing other quantities) as long as we do the same thing to . It is perhaps easier to see how this works by example:
The second equality follows since is defined to be , which is a standard normal random variable with mean 0 and standard deviation 1. From the normal table in Section 1.2, the probability that a standard normal random variable is greater than 1.96 is 0.025. Thus, Pr() is slightly less than 0.025 (draw a picture of a normal density curve with 1.96 and 2.00 marked on the horizontal axis to convince yourself of this fact). In other words, there is slightly less than a 2.5% chance of finding an expensive home () in our housing market, under the assumption that .
For further practice of this kind of calculation, suppose that we have a budget of . What is the probability of finding such an affordable home in our housing market? (You should find it is slightly less than a 10% chance; see Problem 1.10.)
We can also turn these calculations around. For example, which value of has a probability of 0.025 to the right of it? To answer this, consider the following calculation:
So, the value 378 has a probability of 0.025 to the right of it. Another way of expressing this is that the 97.5th percentile of the variable is .