We can tackle prediction problems with a similar process to that of using a confidence interval to tackle estimating a population mean. In particular, we can calculate a prediction interval of the form point estimate uncertainty or (point estimate uncertainty, point estimate uncertainty). The point estimate is the same one that we used for estimating the population mean, that is, the observed sample mean, . This is because is an unbiased estimate of the population mean, , and we assume that the individual value we are predicting is a member of this population. As discussed in the preceding paragraph, however, the uncertainty is larger for prediction intervals than for confidence intervals. To see how much larger, we need to return to the notion of a model that we introduced in Section 1.2.
We can express the model we have been using to estimate the population mean, , as
In other words, each sample value (the index keeps track of the sample observations) can be decomposed into two pieces, a deterministic part that is the same for all values, and a random error part that varies from observation to observation. A convenient choice for the deterministic part is the population mean, , since then the random errors have a (population) mean of zero. Since is the same for all values, the random errors, , have the same standard deviation as the values themselves, that is, . We can use this decomposition to derive the confidence interval and hypothesis test results of Sections 1.5 and 1.6 (although it would take more mathematics than we really need for our purposes in this book). Moreover, we can also use this decomposition to motivate the precise form of the uncertainty needed for prediction intervals (without having to get into too much mathematical detail).
In particular, write the value to be predicted as , and decompose this into two pieces as above:
Then subtract , which represents potential values of repeated sample means, from both sides of this equation:
(1.1)
Thus, in estimating the population mean, the only error we have to worry about is estimation error, whereas in predicting an individual value, we have to worry about both estimation error and random error.
Recall from Section 1.5 that the form of a confidence interval for the population mean is
The term in this formula is an estimate of the standard deviation of the sampling distribution of sample means, , and is called the standard error of estimation. The square of this quantity, , is the estimated variance of the sampling distribution of sample means, . Then, thinking of as some fixed, unknown constant, is also the estimated variance of the estimation error, , in expression (1.1).
The estimated variance of the random error, , in expression (1.1) is . It can then be shown that the estimated variance of the prediction error, , in expression (1.1) is . Then, is called the standard error of prediction.
Recall from Section 1.5 that the form of a confidence interval for the population mean is
The term in this formula is an estimate of the standard deviation of the sampling distribution of sample means, , and is called the standard error of estimation. The square of this quantity, , is the estimated variance of the sampling distribution of sample means, . Then, thinking of as some fixed, unknown constant, is also the estimated variance of the estimation error, , in expression (1.1).
The estimated variance of the random error, , in expression (1.1) is . It can then be shown that the estimated variance of the prediction error, , in expression (1.1) is . Then, is called the standard error of prediction.
Thus, in general, we can write a prediction interval for an individual value, as
where is the sample mean, is the sample standard deviation, is the sample size, and the tpercentile comes from a tdistribution with degrees of freedom.
For example, for a 95% interval (i.e., with 2.5% in each tail), the 97.5th percentile would be needed, whereas for a 90% interval (i.e., with 5% in each tail), the 95th percentile would be needed. These percentiles can be obtained from Table C.1. For example, the 95% prediction interval for an individual value of picked at random from the population of singlefamily home sale prices is calculated as
What about the interpretation of a prediction interval? Well, for the home prices example, loosely speaking, we can say that we are 95% confident that the sale price for an individual home picked at random from all singlefamily homes in this housing market will be between and . More precisely, if we were to take a large number of random samples of size 30 from our population of sale prices and calculate a 95% prediction interval for each, then 95% of those prediction intervals would contain the (unknown) sale price for an individual home picked at random from the population.
Interpretation of a prediction interval for an individual value:
Suppose we have calculated a 95% prediction interval for an individual value to be (, ). Then we can say that we are 95% confident that the individual value is between and .
As discussed at the beginning of this section, the 95% prediction interval for an individual value of , , is much wider than the 95% confidence interval for the population mean singlefamily home sale price, which was calculated as
Unlike for confidence intervals for the population mean, statistical software does not generally provide an automated method to calculate prediction intervals for an individual value. Thus, they have to be calculated by hand using the sample statistics, and . However, there is a trick that can get around this (although it makes use of simple linear regression, which we cover in Chapter 2). First, create a variable that consists only of the value 1 for all observations. Then, fit a simple linear regression model using this variable as the predictor variable and as the response variable, and restrict the model to fit without an intercept (see computer help #25 in the software information files available from the book website). The estimated regression equation for this model will be a constant value equal to the sample mean of the response variable. Prediction intervals for this model will be the same for each value of the predictor variable (see computer help #30), and will be the same as a prediction interval for an individual value. As further practice, calculate a 90% prediction interval for an individual sale price (see Problem 1.10). Calculate it by hand or using the trick just described. You should find that the interval is (, ).
We derived the formula for a confidence interval for a univariate population mean from the tversion of the central limit theorem, which does not require the data values to be normally distributed. However, the formula for a prediction interval for an individual univariate value tends to work better for datasets in which the values are at least approximately normally distributedsee Problem 1.12.
We spent some time in this chapter coming to grips with summarizing data (graphically and numerically) and understanding sampling distributions, but the four major concepts that will carry us through the rest of the book are as follows:
1 Statistical thinking is the process of analyzing quantitative information about a random sample of observations and drawing conclusions (statistical inferences) about the population from which the sample was drawn. An example is using a univariate sample mean, , as an estimate of the corresponding population mean and calculating the sample standard deviation, , to evaluate the precision of this estimate.
2 Confidence intervals are one method for calculating the sample estimate of a parameter (such as the population mean) and its associated uncertainty. An example is the confidence interval for a univariate population mean, which takes the form
3 Hypothesis testing provides another means of making decisions about the likely values of a population parameter. An example is hypothesis testing for a univariate population mean, whereby the magnitude of a calculated sample test statistic,indicates which of two hypotheses (about likely values for the population mean) we should favor.
4 Prediction intervals, while similar in spirit to confidence intervals, tackle the different problem of predicting the value of an individual observation picked at random from the population. An example is the prediction interval for an individual univariate value, which takes the form
Problems
Computer help refers to the numbered items in the software information files available from the book website. There are brief answers to the evennumbered problems in Appendix F (www.wiley.com/go/pardoe/AppliedRegressionModeling3e).
1 1.1 Assume that weekly orders of a popular mobile phone at a local store follow a normal distribution with mean and standard deviation . Find the scores, , that correspond to the:95th percentile (i.e., find such that );50th percentile (i.e., find such that );2.5th percentile (i.e., find such that ). Suppose represents potential values of repeated sample means from this population for samples of size . Use the normal version of the central limit theorem to find the mean scores, , that correspond to the:95th percentile (i.e., find such that );50th percentile (i.e., find such that );2.5th percentile (i.e., find such that ).How many phones should the store order to be 95% confident they can meet demand for a particular week?
2 1.2 Assume that final scores in a statistics course follow a normal distribution with mean and standard deviation . Find the scores, , that correspond to the:90th percentile (i.e., find such that );99th percentile (i.e., find such that );5th percentile (i.e., find such that ). Suppose represents potential values of repeated sample means from this population for samples of size (e.g., average class scores). Use the normal version of the central limit theorem to find the mean scores, , that correspond to the:90th percentile (i.e., find such that );99th percentile (i.e., find such that );5th percentile (i.e., find such that ).If the bottom 5% of the class fail, what is the cutoff percentage to pass the class?The university requires the longterm average class score for this course to be no higher than 75%. Does this requirement seem feasible?