Applied Regression Modeling - Iain Pardoe 5 стр.



Histograms can convey very different impressions depending on the bin width, start point, and so on. Ideally, we want a large enough bin size to avoid excessive sampling noise (a histogram with many bins that looks very wiggly), but not so large that it is hard to see the underlying distribution (a histogram with few bins that looks too blocky). A reasonable pragmatic approach is to use the default settings in whichever software package we are using, and then perhaps to create a few more histograms with different settings to check that we are not missing anything. There are more sophisticated methods, but for the purposes of the methods in this book, this should suffice.

 The sample mean, , is a measure of the central tendency of the data values.

 The sample standard deviation, , is a measure of the spread or variation in the data values.

We will not bother here with the formulas for these sample statistics. Since almost all of the calculations necessary for learning the material covered by this book will be performed by statistical software, the book only contains formulas when they are helpful in understanding a particular concept or provide additional insight to interested readers.

We can calculate sample standardizedvalues from the data values:


Sometimes, it is useful to work with sample standardized values rather than the original data values since sample standardized values have a sample mean of 0 and a sample standard deviation of 1. Try using statistical software to calculate sample standardized values for the home prices data, and then check that the mean and standard deviation of the values are 0 and 1, respectively.

 The sample standard deviation, , is a measure of the spread or variation in the data values.

We will not bother here with the formulas for these sample statistics. Since almost all of the calculations necessary for learning the material covered by this book will be performed by statistical software, the book only contains formulas when they are helpful in understanding a particular concept or provide additional insight to interested readers.

We can calculate sample standardizedvalues from the data values:


Sometimes, it is useful to work with sample standardized values rather than the original data values since sample standardized values have a sample mean of 0 and a sample standard deviation of 1. Try using statistical software to calculate sample standardized values for the home prices data, and then check that the mean and standard deviation of the values are 0 and 1, respectively.

Statistical software can also calculate additional sample statistics, such as:

 the median (another measure of central tendency, but which is less sensitive than the sample mean to very small or very large values in the data)half the dataset values are smaller than this quantity and half are larger;

 the minimum and maximum;

 percentiles or quantiles such as the 25th percentilethis is the smallest value that is larger than 25% of the values in the dataset (i.e., 25% of the dataset values are smaller than the 25th percentile, while 75% of the dataset values are larger).

There are many other methodsnumerical and graphicalfor summarizing data. For example, another popular graph besides the histogram is the boxplot; see Chapter 6 (www.wiley.com/go/pardoe/AppliedRegressionModeling3e) for some examples of boxplots used in case studies.

1.2 Population Distributions

While the methods of the preceding section are useful for describing and displaying sample data, the real power of statistics is revealed when we use samples to give us information about populations. In this context, a population is the entire collection of objects of interest, for example, the sale prices for all singlefamily homes in the housing market represented by our dataset. We would like to know more about this population to help us make a decision about which home to buy, but the only data we have is a random sample of 30 sale prices.

Nevertheless, we can employ statistical thinking to draw inferences about the population of interest by analyzing the sample data. In particular, we use the notion of a modela mathematical abstraction of the real worldwhich we fit to the sample data. If this model provides a reasonable fit to the data, that is, if it can approximate the manner in which the data vary, then we assume it can also approximate the behavior of the population. The model then provides the basis for making decisions about the population, by, for example, identifying patterns, explaining variation, and predicting future values. Of course, this process can work only if the sample data can be considered representative of the population. One way to address this is to randomly select the sample from the population. There are other more complex sampling methods that are used to select representative samples, and there are also ways to make adjustments to models to account for known nonrandom sampling. However, we do not consider these hereany good sampling textbook should cover these issues.

Since the real world can be extremely complicated (in the way that data values vary or interact together), models are useful because they simplify problems so that we can better understand them (and then make more effective decisions). On the one hand, we therefore need models to be simple enough that we can easily use them to make decisions, but on the other hand, we need models that are flexible enough to provide good approximations to complex situations. Fortunately, many statistical models have been developed over the years that provide an effective balance between these two criteria. One such model, which provides a good starting point for the more complicated models we consider later, is the normal distribution.

The key feature of the normal density curve that allows us to make statistical inferences is that areas under the curve represent probabilities. The entire area under the curve is one, while the area under the curve between one point on the horizontal axis (, say) and another point (, say) represents the probability that a random variable that follows a standard normal distribution is between and . So, for example, Figure 1.3 shows there is a probability of 0.475 that a standard normal random variable lies between and , since the area under the curve between and is 0.475.

In particular, the uppertail area to the right of 1.960 is 0.025; this is equivalent to saying that the area between 0 and 1.960 is 0.475 (since the entire area under the curve is 1 and the area to the right of 0 is 0.5). Similarly, the twotail area, which is the sum of the areas to the right of 1.960 and to the left of 1.960, is two times 0.025, or 0.05.

How does all this help us to make statistical inferences about populations such as that in our home prices example? The essential idea is that we fit a normal distribution model to our sample data and then use this model to make inferences about the corresponding population. For example, we can use probability calculations for a normal distribution (as shown in Figure 1.3) to make probability statements about a population modeled using that normal distributionwe will show exactly how to do this in Section 1.3. Before we do that, however, we pause to consider an aspect of this inferential sequence that can make or break the process. Does the model provide a close enough approximation to the pattern of sample values that we can be confident the model adequately represents the population values? The better the approximation, the more reliable our inferential statements will be.

We saw in Figure 1.2 how a density curve can be thought of as a histogram with a very large sample size. So one way to assess whether our population follows a normal distribution model is to construct a histogram from our sample data and visually determine whether it looks normal, that is, approximately symmetric and bellshaped. This is a somewhat subjective decision, but with experience you should find that it becomes easier to discern clearly nonnormal histograms from those that are reasonably normal. For example, while the histogram in Figure 1.2 clearly looks like a normal density curve, the normality of the histogram of 30 sample sale prices in Figure 1.1 is less certain. A reasonable conclusion in this case would be that while this sample histogram is not perfectly symmetric and bellshaped, it is close enough that the corresponding (hypothetical) population histogram could well be normal.

Назад Дальше