2) Basic Ideas of Linear Regression: The Two-Variable Model
In this chapter we introduced some fundamental ideas of regression analysis. Starting with the key concept of the population regression function (PRF), we developed the concept of linear PRF. This book is primarily concerned with linear PRFs, that is, regressions that are linear in the parameters regardless of whether or not they are linear in the variables. We then introduced the idea of the stochastic PRF and discussed in detail the nature and role of the stochastic error term u. PRF is, of course, a theoretical or idealized construct because, in practice, all we have is a sample(s) from some population.
This necessitated the discussion of the sample regression function (SRF). We then considered the question of how we actually go about obtaining the SRF. Here we discussed the popular method of ordinary least squares (OLS) and presented the appropriate formulas to estimate the parameters of the PRF. We illustrated the OLS method with a fully worked-out numerical example as well as with several practical examples. Our next task is to find out how good the SRF obtained by OLS is as an estimator of the true PRF. We undertake this important task in Chapter 3.
3) The Two-Variable Model: Hypothesis Testing
In Chapter 2 we showed how to estimate the parameters of the two-variable linear regression model. In this chapter we showed how the estimated model can be used for the purpose of drawing inferences about the true population regression model. Although the two-variable model is the simplest possible linear regression model, the ideas introduced in these two chapters are the foundation of the more involved multiple regression models that we will discuss in ensuing chapters. As we will see, in many ways the multiple regression model is a straightforward extension of the two-variable model.
4) Multiple Regression: Estimation and Hypothesis Testing
In this chapter we considered the simplest of the multiple regression models, namely, the three-variable linear regression model—one dependent variable and two explanatory variables. Although in many ways a straightforward extension of the two-variable linear regression model, the three-variable model introduced several new concepts, such as partial regression coefficients, adjusted and unadjusted multiple coefficient of determination, and multicollinearity. Insofar as estimation of the parameters of the multiple regression coefficients is concerned, we still worked within the framework of the classical linear regression model and used the method of ordinary least squares (OLS). The OLS estimators of multiple regression, like the two-variable model, possess several desirable statistical properties summed up in the Gauss-Markov property of best linear unbiased estimators (BLUE).
With the assumption that the disturbance term follows the normal distribution with zero mean and constant variance σ2, we saw that, as in the two-variable case, each estimated coefficient in the multiple regression follows the normal distribution with a mean equal to the true population value and the variances given by the formulas developed in the text. Unfortunately, in practice, σ2 is not known and has to be estimated. The OLS estimator of this unknown variance is . But if we replace σ2 by , then, as in the two-variable case, each estimated coefficient of the multiple regression follows the t distribution, not the normal distribution. The knowledge that each multiple regression coefficient follows the t distribution with d.f. equal to (n – k), where k is the number of parameters estimated (including the intercept), means we can use the t distribution to test statistical hypotheses about each multiple regression coefficient individually.
This can be done on the basis of either the t test of significance or the confidence interval based on the t distribution. In this respect, the multiple regression model does not differ much from the two-variable model, except that proper allowance must be made for the d.f., which now depend on the number of parameters estimated. However, when testing the hypothesis that all partial slope coefficients are simultaneously equal to zero, the individual t testing referred to earlier is of no help.
Here we should use the analysis of variance (ANOVA) technique and the attendant F test. Incidentally, testing that all partial slope coefficients are simultaneously equal to zero is the same as testing that the multiple coefficient of determination R2 is equal to zero. Therefore, the F test can also be used to test this latter but equivalent hypothesis. We also discussed the question of when to add a variable or a group of variables to a model, using either the t test or the F test. In this context we also discussed the method of restricted least squares.
5) Functional Forms of Regression Models
In this chapter we considered models that are linear in parameters, or that can be rendered as such with suitable transformation, but that are not necessarily linear in variables. There are a variety of such models, each having special applications. We considered five major types of nonlinear-in-variable but linear-in-parameter models, namely: 1.The log-linear model, in which both the dependent variable and the explanatory variable are in logarithmic form. 2.The log-lin or growth model, in which the dependent variable is logarithmic but the independent variable is linear. 3.The lin-log model, in which the dependent variable is linear but the independent variable is logarithmic. 4.The reciprocal model, in which the dependent variable is linear but the independent variable is not. 5.The polynominal model, in which the independent variable enters with various powers. Of course, there is nothing that prevents us from combining the features of one or more of these models.
Thus, we can have a multiple regression model in which the dependent variable is in log form and some of the X variables are also in log form, but some are in linear form. We studied the properties of these various models in terms of their relevance in applied research, their slope coefficients, and their elasticity coefficients. We also showed with several examples the situations in which the various models could be used. Needless to say, we will come across several more examples in the remainder of the text. In this chapter we also considered the regression-through-the-origin model and discussed some of its features. It cannot be overemphasized that in choosing among the competing models, the overriding objective should be the economic relevance of the various models and not merely the summary statistics, such as R2.
Model building requires a proper balance of theory, availability of the appropriate data, a good understanding of the statistical properties of the various models, and the elusive quality that is called practical judgment. Since the theory underlying a topic of interest is never perfect, there is no such thing as a perfect model. What we hope for is a reasonably good model that will balance all these criteria. Whatever model is chosen in practice, we have to pay careful attention to the units in which the dependent and independent variables are expressed, for the interpretation of regression coefficients may hinge upon units of measurement.
6) Dummy Variable Regression Models
In this chapter we showed how qualitative, or dummy, variables taking values of 1 and 0 can be introduced into regression models alongside quantitative variables. As the various examples in the chapter showed, the dummy variables are essentially a data-classifying device in that they divide a sample into various subgroups based on qualities or attributes (sex, marital status, race, religion, etc.) and implicitly run individual regressions for each subgroup. Now if there are differences in the responses of the dependent variable to the variation in the quantitative variables in the various subgroups, they will be reflected in the differences in the intercepts or slope coefficients of the various subgroups, or both. Although it is a versatile tool, the dummy variable technique has to be handled carefully. First, if the regression model contains a constant term (as most models usually do), the number of dummy variables must be one less than the number of classifications of each qualitative variable.
Second, the coefficient attached to the dummy variables must always be interpreted in relation to the control, or benchmark, group—the group that gets the value of zero. Finally, if a model has several qualitative variables with several classes, introduction of dummy variables can consume a large number of degrees of freedom (d.f.). Therefore, we should weigh the number of dummy variables to be introduced into the model against the total number of observations in the sample. In this chapter we also discussed the possibility of committing a specification error, that is, of fitting the wrong model to the data. If intercepts as well as slopes are expected to differ among groups, we should build a model that incorporates both the differential intercept and slope dummies.
In this case a model that introduces only the differential intercepts is likely to lead to a specification error. Of course, it is not always easy a priori to find out which is the true model. Thus, some amount of experimentation is required in a concrete study, especially in situations where theory does not provide much guidance. The topic of specification error is discussed further in Chapter 7. In this chapter we also briefly discussed the linear probability model (LPM) in which the dependent variable is itself binary. Although LPM can be estimated by ordinary least square (OLS), there are several problems with a routine application of OLS. Some of the problems can be resolved easily and some cannot. Therefore, alternative estimating procedures are needed. We mentioned two such alternatives, the logit and probit models, but we did not discuss them in view of the somewhat advanced nature of these models (but see Chapter 12).
7) Model Selection: Criteria and Tests
The major points discussed in this chapter can be summarized as follows: 1.The classical linear regression model assumes that the model used in empirical analysis is “correctly specified.” 2.The term correct specification of a model can mean several things, including: a.No theoretically relevant variable has been excluded from the model. b.No unnecessary or irrelevant variables are included in the model. c.The functional form of the model is correct.
d.There are no errors of measurement.
3.If a theoretically relevant variable(s) has been excluded from the model, the coefficients of the variables retained in the model are generally biased as well as inconsistent, and the error variance and the standard errors of the OLS estimators are biased. As a result, the conventional t and F tests remain of questionable value. 4.Similar consequences ensue if we use the wrong functional form. 5.The consequences of including irrelevant variables(s) in the model are less serious in that estimated coefficients still remain unbiased and consistent, the error variance and standard errors of the estimators are correctly estimated, and the conventional hypothesis-testing procedure is still valid. The major penalty we pay is that estimated standard errors tend to be relatively large, which means parameters of the model are estimated rather imprecisely.
As a result, confidence intervals tend to be somewhat wider. 6.In view of the potential seriousness of specification errors, in this chapter we considered several diagnostic tools to help us find out if we have the specification error problem in any concrete situation. These tools include a graphical examination of the residuals and more formal tests, such as MWD and RESET. Since the search for a theoretically correct model can be exasperating, in this chapter we considered several practical criteria that we should keep in mind in this search, such as (1) parsimony, (2) identifiability, (3) goodness of fit, (4) theoretical consistency, and (5) predictive power. As Granger notes, “In the ultimate analysis, model building is probably both an art and a science. A sound knowledge of theoretical econometrics and the availability of an efficient computer program are not enough to ensure success.”
8) Multicollinearity: What Happens If Explanatory Variables are Correlated? An important assumption of the classical linear regression model is that there is no exact linear relationship(s), or multicollinearity, among explanatory variables. Although cases of exact multicollinearity are rare in practice, situations of near exact or high multicollinearity occur frequently. In practice, therefore, the term multicollinearity refers to situations where two or more variables can be highly linearly related. The consequences of multicollinearity are as follows. In cases of perfect multicollinearity we cannot estimate the individual regression coefficients or their standard errors. In cases of high multicollinearity individual regression coefficients can be estimated and the OLS estimators retain their BLUE property.
But the standard errors of one or more coefficients tend to be large in relation to their coefficient values, thereby reducing t values. As a result, based on estimated t values, we can say that the coefficient with the low t value is not statistically different from zero. In other words, we cannot assess the marginal or individual contribution of the variable whose t value is low. Recall that in a multiple regression the slope coefficient of an X variable is the partial regression coefficient, which measures the (marginal or individual) effect of that variable on the dependent variable, holding all other Xvariables constant.
However, if the objective of study is to estimate a group of coefficients fairly accurately, this can be done so long as collinearity is not perfect. In this chapter we considered several methods of detecting multicollinearity, pointing out their pros and cons. We also discussed the various remedies that have been proposed to solve the problem of multicollinearity and noted their strengths and weaknesses. Since multicollinearity is a feature of a given sample, we cannot foretell which method of detecting multicollinearity or which remedial measure will work in any given concrete situation.
9) Heteroscedasticity: What Happens If the Error Variance Is Nonconstant? A critical assumption of the classical linear regression model is that the disturbances ui all have the same (i.e., homoscedastic) variance. If this assumption is not satisfied, we have heteroscedasticity. Heteroscedasticity does not destroy the unbiasedness property of OLS estimators, but these estimators are no longer efficient. In other words, OLS estimators are no longer BLUE. If heteroscedastic variances σi2 are known, then the method of weighted least squares (WLS) provides BLUE estimators. Despite heteroscedasticity, if we continue to use the usual OLS method not only to estimate the parameters (which remain unbiased) but also to establish confidence intervals and test hypotheses, we are likely to draw misleading conclusions, as in the NYSE Example 9.8. This is because estimated standard errors are likely to be biased and therefore the resulting t ratios are likely to be biased, too.
Thus, it is important to find out whether we are faced with the heteroscedasticity problem in a specific application. There are several diagnostic tests of heteroscedasticity, such as plotting the estimated residuals against one or more of the explanatory variables, the Park test, the Glejser test, or the rank correlation test (See Problem 9.13). If one or more diagnostic tests reveal that we have the heteroscedasticity problem, remedial measures are called for. If the true error variance σi2 is known, we can use the method of WLS to obtain BLUE estimators. Unfortunately, knowledge about the true error variance is rarely available in practice.
As a result, we are forced to make some plausible assumptions about the nature of heteroscedasticity and to transform our data so that in the transformed model the error term is homoscedastic. We then apply OLS to the transformed data, which amounts to using WLS. Of course, some skill and experience are required to obtain the appropriate transformations. But without such a transformation, the problem of heteroscedasticity is insoluble in practice. However, if the sample size is reasonably large, we can use White’s procedure to obtain heteroscedasticity-corrected standard errors.
10) Autocorrelation: What Happens If Error Terms Are Correlated? The major points of this chapter are as follows:
1.In the presence of autocorrelation OLS estimators, although unbiased, are not efficient. In short, they are not BLUE. 2.Assuming the Markov first-order autoregressive, the AR(1), scheme, we pointed out that the conventionally computed variances and standard errors of OLS estimators can be seriously biased. 3.As a result, standard t and F tests of significance can be seriously misleading. 4.Therefore, it is important to know whether there is autocorrelation in any given case. We considered three methods of detecting autocorrelation: a.graphical plotting of the residuals
b.the runs test
c.the Durbin-Watson d test
5.If autocorrelation is found, we suggest that it be corrected by appropriately transforming the model so that in the transformed model there is no autocorrelation. We illustrated the actual mechanics with several examples.
11) Simultaneous Equation Models
In contrast to the single equation models discussed in the preceding chapters, in simultaneous equation regression models what is a dependent (endogenous) variable in one equation appears as an explanatory variable in another equation. Thus, there is a feedback relationship between the variables. This feedback creates the simultaneity problem,rendering OLS inappropriate to estimate the parameters of each equation individually. This is because the endogenous variable that appears as an explanatory variable in another equation may be correlated with the stochastic error term of that equation. This violates one of the critical assumptions of OLS that the explanatory variable be either fixed, or nonrandom, or if random, that it be uncorrelated with the error term. Because of this, if we use OLS, the estimates we obtain will be biased as well as inconsistent. Besides the simultaneity problem, a simultaneous equation model may have an identification problem.
An identification problem means we cannot uniquely estimate the values of the parameters of an equation. Therefore, before we estimate a simultaneous equation model, we must find out if an equation in such a model is identified. One cumbersome method of finding out whether an equation is identified is to obtain the reduced form equations of the model. A reduced form equation expresses a dependent (or endogenous) variable solely as a function of exogenous, or predetermined, variables, that is, variables whose values are determined outside the model. If there is a one-to-one correspondence between the reduced form coefficients and the coefficients of the original equation, then the original equation is identified. A shortcut to determining identification is via the order condition of identification. The order condition counts the number of equations in the model and the number of variables in the model (both endogenous and exogenous).
Then, based on whether some variables are excluded from an equation but included in other equations of the model, the order condition decides whether an equation in the model is underidentified, exactly identified, or overidentified. An equation in a model is underidentified if we cannot estimate the values of the parameters of that equation. If we can obtain unique values of parameters of an equation, that equation is said to be exactly identified. If, on the other hand, the estimates of one or more parameters of an equation are not unique in the sense that there is more than one value of some parameters, that equation is said to be overidentified. If an equation is underidentified, it is a dead-end case. There is not much we can do, short of changing the specification of the model (i.e., developing another model).
If an equation is exactly identified, we can estimate it by the method of indirect least squares (ILS). ILS is a two-step procedure. In step 1, we apply OLS to the reduced form equations of the model, and then we retrieve the original structural coefficients from the reduced form coefficients. ILS estimators are consistent; that is, as the sample size increases indefinitely, the estimators converge to their true values. The parameters of the overidentified equation can be estimated by the method of two-stage least squares (2SLS). The basic idea behind 2SLS is to replace the explanatory variable that is correlated with the error term of the equation in which that variable appears by a variable that is not so correlated. Such a variable is called a proxy, or instrumental, variable.2SLS estimators, like the ILS estimators, are consistent estimators.
12) Selected Topics in Single Equation Regression Models
In this chapter we discussed several topics of considerable practical importance. The first topic we discussed was dynamic modeling, in which time or lag explicitly enters into the analysis. In such models the current value of the dependent variable depends upon one or more lagged values of the explanatory variable(s). This dependence can be due to psychological, technological, or institutional reasons. These models are generally known as distributed lag models. Although the inclusion of one or more lagged terms of an explanatory variable does not violate any of the standard CLRM assumptions, the estimation of such models by the usual OLS method is generally not recommended because of the problem of multicollinearity and the fact that every additional coefficient estimated means a loss of degrees of freedom. Therefore, such models are usually estimated by imposing some restrictions on the parameters of the models (e.g., the values of the various lagged coefficients decline from the first coefficient onward).
This is the approach adopted by the Koyck, the adaptive expectations, and the partial, or stock, adjustment models. A unique feature of all these models is that they replace all lagged values of the explanatory variable by a single lagged value of the dependent variable. Because of the presence of the lagged value of the dependent variable among explanatory variables, the resulting model is called an autoregressive model. Although autoregressive models achieve economy in the estimation of distributed lag coefficients, they are not free from statistical problems. In particular, we have to guard against the possibility of autocorrelation in the error term because in the presence of autocorrelation and the lagged dependent variable as an explanatory variable, the OLS estimators are biased as well as inconsistent.
In discussing the dynamic models, we pointed out how they help us to assess the short- and long-run impact of an explanatory variable on the dependent variable. The next topic we discussed related to the phenomenon of spurious, or nonsense, regression. Spurious regression arises when we regress a nonstationary random variable on one or more nonstationary random variables. A time series is said to be (weakly) stationary, if its mean, variance, and covariances at various lags are not time dependent. To find out whether a time series is stationary, we can use the unit root test. If the unit root test (or other tests) shows that the time series of interest is stationary, then the regression based on such time series may not be spurious. We also introduced the concept of cointegration. Two or more time series are said to be cointegrated if there is a stable, long-term relationship between the two even though individually each may be nonstationary.
If this is the case, regression involving such time series may not be spurious. Next we introduced the random walk model, with or without drift. Several financial time series are found to follow a random walk; that is, they are nonstationary either in their mean value or their variance or both. Variables with these characteristics are said to follow stochastic trends. Stock prices are a prime example of a random walk. It is hard to tell what the price of a stock will be tomorrow just by knowing its price today. The best guess about tomorrow’s price is today’s price plus or minus a random error term (or shock, as it is called). If we could predict tomorrow’s price fairly accurately, we would all be millionaires!
The next topic we discussed in this chapter was the dummy dependent variable, where the dependent variable can take values of either 1 or 0. Although such models can be estimated by OLS, in which case they are called linear probability models (LPM), this is not the recommended procedure since probabilities estimated from such models can sometimes be negative or greater than 1. Therefore, such models are usually estimated by the logit or probit procedures. In this chapter we illustrated the logit model with concrete examples. Thanks to excellent computer packages, estimation of logit and probit models is no longer a mysterious or forbidding task.