Applied Regression Modeling - Iain Pardoe, читать онлайн бесплатно (полностью) 8 стр.

Suppose that a random sample of data values, represented by , comes from a population that has a mean of . Imagine taking a large number of random samples of data values and calculating the mean and standard deviation for each sample. As before, we will let represent the imagined list of repeated sample means, and similarly, we will let represent the imagined list of repeated sample standard deviations. Define

Under very general conditions, t has an approximate tdistribution with degrees of freedom. The two differences from the normal version of the central limit theorem that we used before are that the repeated sample standard deviations, , replace an assumed population standard deviation, , and that the resulting sampling distribution is a tdistribution (not a normal distribution).

To illustrate, let us repeat the calculations from Section 1.4.1 based on an assumed population mean, , but rather than using an assumed population standard deviation, , we will instead use our observed sample standard deviation, 53.8656 for . To find the 90th percentile of the sampling distribution of the mean sale price, :

Thus, the 90th percentile of the sampling distribution of is (to the nearest ).

Turning this around, what is the probability that is greater than 292.893?

So, the probability that is greater than 292.893 is 0.10.

So far, we have focused on the sampling distribution of sample means, , but what we would really like to do is infer what the observed sample mean, , tells us about the population mean, . Thus, while the preceding calculations have been useful for building up intuition about sampling distributions and manipulating probability statements, their main purpose has been to prepare the ground for the next two sections, which cover how to make statistical inferences about the population mean, .

1.5 Interval Estimation

We have already seen that the sample mean, , is a good point estimate of the population mean, (in the sense that it is unbiasedsee Section 1.4). It is also helpful to know how reliable this estimate is, that is, how much sampling uncertainty is associated with it. A useful way to express this uncertainty is to calculate an interval estimate or confidence interval for the population mean, . The interval should be centered at the point estimate (in this case, ), and since we are probably equally uncertain that the population mean could be lower or higher than this estimate, it should have the same amount of uncertainty either side of the point estimate. We quantify this uncertainty with a number called the margin of error. Thus, the confidence interval is of the form point estimate margin of error or (point estimate margin of error, point estimate margin of error).

The difference from earlier calculations is that this time is the focus of inference, so we have not assumed that we know its value. One consequence for the probability calculation is that in the fourth line we have . To change this to in the fifth line, we multiply each side of the inequality sign by (this also has the effect of changing the direction of the inequality sign).

This probability statement must be true for all potential values of and . In particular, it must be true for our observed sample statistics, and . Thus, to find the values of that satisfy the probability statement, we plug in our sample statistics to find

This shows that a population mean greater than would satisfy the expression . In other words, we have found that the lower bound of our confidence interval is , or approximately . The value 20.1115 in this calculation is the margin of error.

To find the upper bound, we perform a similar calculation:

To find the values of that satisfy this expression, we plug in our sample statistics to find

This shows that a population mean less than would satisfy the expression . In other words, we have found that the upper bound of our confidence interval is , or approximately . Again, the value 20.1115 in this calculation is the margin of error.

We can write these two calculations a little more concisely as

As before, we plug in our sample statistics to find the values of that satisfy this expression:

This shows that a population mean between and would satisfy the expression . In other words, we have found that a 95% confidence interval for for this example is (, ), or approximately (, ). It is traditional to write confidence intervals with the lower number on the left.

More generally, using symbols, a 95% confidence interval for a univariate population mean, , results from the following:

where the 97.5th percentile comes from the tdistribution with degrees of freedom. In other words, plugging in our observed sample statistics, and , we can write the 95% confidence interval as . In this expression, is the margin of error.

For a lower or higher level of confidence than 95%, the percentile used in the calculation must be changed as appropriate. For example, for a 90% interval (i.e., with 5% in each tail), the 95th percentile would be needed, whereas for a 99% interval (i.e., with 0.5% in each tail), the 99.5th percentile would be needed. These percentiles can be obtained from the table Univariate Data in Notation and Formulas (which is an expanded version of the table in Section 1.4.2). Instructions for using the table can be found in Notation and Formulas.

Thus, in general, we can write a confidence interval for a univariate mean, , as

where is the sample mean, is the sample standard deviation, is the sample size, and the tpercentile comes from a tdistribution with degrees of freedom. In this expression, is the margin of error.

The example above becomes

Computer help #23 in the software information files available from the book website shows how to use statistical software to calculate confidence intervals for the population mean. As further practice, calculate a 90% confidence interval for the population mean for the home prices example (see Problem 1.10)you should find that it is (, ).

Thus, in general, we can write a confidence interval for a univariate mean, , as

where is the sample mean, is the sample standard deviation, is the sample size, and the tpercentile comes from a tdistribution with degrees of freedom. In this expression, is the margin of error.

The example above becomes

Now that we have calculated a confidence interval, what exactly does it tell us? Well, for the home prices example, loosely speaking, we can say that we are 95% confident that the mean singlefamily home sale price in this housing market is between and . This will get you by among friends (as long as none of your friends happen to be expert statisticians). But to provide a more precise interpretation we have to revisit the notion of hypothetical repeated samples. If we were to take a large number of random samples of size 30 from our population of sale prices and calculate a 95% confidence interval for each, then 95% of those confidence intervals would contain the (unknown) population mean. We do not know (nor will we ever know) whether the 95% confidence interval for our particular sample contains the population meanthus, strictly speaking, we cannot say the probability that the population mean is in our interval is 0.95. All we know is that the procedure that we have used to calculate the 95% confidence interval tends to produce intervals that under repeated sampling contain the population mean 95% of the time. Stick with the phrase 95% confident and avoid using the word probability and chances are that no one (not even expert statisticians) will be too offended.

Interpretation of a confidence interval for a univariate mean:

Suppose we have calculated a 95% confidence interval for a univariate mean, , to be (, ). Then we can say that we are 95% confident that is between and .

Before moving on to Section 1.6, which describes another way to make statistical inferences about population meanshypothesis testinglet us consider whether we can now forget the normal distribution. The calculations in this section are based on the central limit theorem, which does not require the population to be normal. We have also seen that tdistributions are more useful than normal distributions for calculating confidence intervals. For large samples, it does not make much difference (note how the percentiles for tdistributions get closer to the percentiles for the standard normal distribution as the degrees of freedom get larger in Table C.1), but for smaller samples it can make a large difference. So for this type of calculation, we always use a tdistribution from now on. However, we cannot completely forget about the normal distribution yet; it will come into play again in a different context in later chapters.

When using a tdistribution, how do we know how many degrees of freedom to use? One way to think about degrees of freedom is in terms of the information provided by the data we are analyzing. Roughly speaking, each data observation provides one degree of freedom (this is where the in the degrees of freedom formula comes in), but we lose a degree of freedom for each population parameter that we have to estimate. So, in this chapter, when we are estimating the population mean, the degrees of freedom formula is . In Chapter 2, when we will be estimating two population parameters (the intercept and the slope of a regression line), the degrees of freedom formula will be . For the remainder of the book, the general formula for the degrees of freedom in a multiple linear regression model will be or , where is the number of predictor variables in the model. Note that this general formula actually also works for Chapter 2 (where ) and even this chapter (where , since a linear regression model with zero predictors is equivalent to estimating the population mean for a univariate dataset).

Applied Regression Modeling - Iain Pardoe 8 стр.

1.5 Interval Estimation

Меню