Part A. Multiple choice questions
Answer each question by circling one and only one answer. Each question is worth 3 marks (total 30 marks).
1. When estimating a linear probability model using OLS:
a. The estimators are biased because errors are necessarily heteroskedastic
b. The slope coefficient estimates cannot measure changes in the predicted probability of Y=1
c. The estimators can be asymptotically normally distributed d. All of the above
2. When internal validity is violated:
a. OLS coefficients no longer measure the partial correlation between the explanatory variable and the dependent variable
b. The population error terms cannot be normally distributed c. The dependent variable necessarily becomes skewed
d. None of the above
3. Which of the following dependent variables is least like a limited dependent variable? a. Wages
b. Net assets of a household (total assets minus debts)
c. Number of visits to the dentist in a year
d. An index of happiness where happiness is rated 1 to 10
4. A variable Y is a Bernoulli variable
a. Its distribution has the usual 2 independent parameters representing the mean and the variance
b. Its expected value equals the ratio of the probability of Y=0 to the probability of Y=1
c. Its variance equals the product of the probability of Y=0 and the probability of Y=1
d. All of the above
5. In the probit model seen in class
a. The variance of the error term depends on the vector of explanatory variables b. The variance of the error term is assumed to be 1
c. The variance of the error term does not need to be specified because of the normality assumption
d. The variance of the error term can be estimated from the variance of the estimated residual
6. In panel data, the problem of attrition refers to
a. The presence of large measurement error in key variables
b. The correlation of measurement errors with explanatory variables
c. The misclassification of key dummy explanatory variables due to measurement error d. None of the above
7. In the probit model
a. The partial effect of a single continuous explanatory variable X on the predicted probability has the same sign as the estimated coefficient on X
b. The test statistic constructed by the ratio of the estimated coefficient to its standard error is normally distributed because we are using the normal distribution to model the expected value of the dependent variable
c. The partial effects of an explanatory variable are quantitatively close to zero when the standard error of the coefficient on this variable is very large.
d. All of the above
8. You have data on a sample of 95 managers working in large firms in Australia. You estimate a logit model of Y= 1 if earning >$500,000 per annum using as explanatory variables: F=1 if the manager is female (0 otherwise); PHD=1 if the manager has a PHD (0 otherwise); an interaction variable FPHD=F*PHD; TEN=tenure with the firm measured in years (a continuous variable). You find the following estimates:
Indexi = 0.053 – 0.095 Fi + 0.020 PHDi + 0.007 FPHDi + 0.0015 TENi (0.002) (0.011) (0.009) (0.003) (0.0005)
where the standard errors are denoted in parenthesis. You want to test H0: tenure has no effect on the probability of earning >$500,000 per annum versus H1: tenure has a positive effect on the probability of earning >$500,000 per annum. You will use a 5% level of significance to conduct this test. You get an asymptotic t-stat equal to 3.0. Using the tables provided at the end of the exam, choose one of the following as an appropriate critical value to conduct this test:
9. Refer to the model and estimates in the previous question. Ceteris paribus, according to these estimates (and ignoring statistical significance):
a. Women without PHDs have a higher probability of earning >$500,000 than men without PHDs.
b. Men with PHDs have a lower probability of earning >$500,000 than men without PHDs. c. Women with PHDs have higher probability of earning >$500,000 than women without PHDs. d. Women with PHDs have higher probability of earning >$500,000 than men with PHDs.
10. Refer to the model and estimates in the previous question. You want to test that ceteris paribus, men and women have the same probability of earning >$500,000. Under the null, the Wald test statistic is asymptotically chi-squared distributed with a. 1 degree of freedom
b. 90 degrees of freedom
c. 93 degrees of freedom
d. 2 degrees of freedom
e. 3 degrees of freedom
PART A. Multiple Choice
Part B. Problem (Total 30 marks)
Equity of access is a primary goal of many health systems. Determining whether Australia’s system (Medicare) meets this goal is an important research question. Consider the case of access to general practitioners (GPs). The probit results presented below in Table 4 are part of an analysis aimed at answering whether there is equitable access to GP services where access is defined on the basis of health needs rather than ability to pay. The data consists of a sample of 3207 single females who were surveyed throughout Australia in 1995. The “dependent variable” for the study was VISIT, an indicator variable that was equal to one if the women had visited a GP in the last two weeks and zero otherwise. The sample has been divided into two subsets depending on whether the women are less than 40 years old (the “young” sub-sample) or whether they are greater than 40 years old (the “old” subsample). Table 4 presents estimation results (variable definitions follow the table).
Table 4: Probit
estimates for visit
to GP* Variable
i. (8 marks) Discuss the effects of PHI on the probability of visiting a GP and compare these effects for the two subsamples of young and old women. Repeat the exercise for the KIDS variable. Do you think that these variables are likely to violate the zero conditional mean assumption? Discuss
In both subsamples, the estimated coefficient on PHI is positive; ceteris paribus the probability of visiting a GP is higher for those with PHI than without. The size of coefficients may be discussed using the rule of thumb but these must not be confused with partial effects. The effect is statistically significant among the old while the opposite is true for the young. In the young subsample, the coefficient is insignificant at any conventional level (t statistic for testing irrelevance of PHI against the 2-sided alternative is 0.3295 < 1.645) whereas in the old subsample it is significantly different from zero at the 5% significance level (t statistic = 2.319 > 1.96).
The sign is as expected since PHI makes it cheaper to use GP services and women who expect to visit GPs more often are more likely to purchase PHI. The latter implies that ZCM may be violated due to a selection effect.
In the young subsample, the coefficient on KIDS is positive and statistically different from zero at the 10% level (t statistic = 1.874 > 1.645); the probability of GP visit is higher for those with dependent children.
In the old subsample, the sign of the coefficient indicates that the effect is negative but the coefficient is statistically insignificant from zero at conventional levels (t stat = 1.276 < 1.645). A priori, the expected sign is ambiguous; women may visit GPs for children’s medical care as well as their own (positive) but at the same time they may become busier due to child rearing (negative). For the old sample, KIDS may be older and hence mothers no longer visit GPs for the children’s health. Other reasonable explanations are acceptable. You can argue both ways on the ZCM assumption: for example, you can argue that fertility decisions are exogenous to GP visits. You could also argue that there is an omitted variable bias (KIDS is picking up some unobserved component – e.g. better health measurement – than what is being captured by the existing explanatory variables). Also if the true underlying relationship depends on the number of resident dependent children, KIDS is top-coded at 1, causing the ZCM assumption to fail due to a measurement error correlated with this variable. Additional material:
You could also earn marks (lost elsewhere in the question) by discussing the size of the effects. For example, the effect among the young seems non-trivial in the sense that the coefficient’s magnitude is slightly over 40% of that of the coefficient on the poor health indicator (HEALTH) while for the old, the variable seems far less economically relevant relative to HEALTH. ii. (5 marks) If there is equity of access then variables related to income, education and private health insurance should not affect visits to GPs. When the models are re-estimated without these variables (i.e. with only AGE, HEALTH and KIDS included) the log-likelihood values are – 937.92 for the young sub-sample and –898.63 for the old. Using these results evaluate the null hypothesis of equity of access.
Statement of the hypotheses:
LR test statistics:
LLRYOUNG = 2(-935.52+937.92) = 4.8;
LLROLD = 2(-892.24+898.63) = 12.78.
Distributions of the test statistics and critical values:
They are asymptotically chi-squared distributed with 5 degrees of freedom under the null.
The appropriate 10% and 5% critical values are 9.2364 and 11.0705 respectively. Decision rules and conclusions:
Since LLRYOUNG < 9.2364, we fail to reject the null at 10% level in the young subsample; there is not enough evidence to conclude that income, education and PHI variables affect young women’s GP visits.
Since LLROLD > 11.0705, we reject the null at the 5% significance level in the old subsample and conclude that there is some evidence against equity of access among the old women. iii. (4 marks) Consider two types of women: type #1 where AGE = 20, HEALTH = 1, INCOME = 20 and all other variables = 0; type #2 is identical except that AGE = 60. Write down the equation(s) you would use to compare the probability of visiting a GP for these two types of women. Using the probit results can you determine which of these two types of women are more likely to have visited a GP in the last two weeks? If your answer is yes then make the comparison, if your answer is no then explain what information you would need to make the comparison.
One possible answer is to use the index and argue that the ranking by the probabilities will be the same as that provided by the index:
Index for type #1 = -.791 + -.006*20 +.3930 + 0.0003*20 = -.5120 -.51 < Index for type #2 = -1.1570 + .0055*60 +.6131 – 0.0052*20 = -.3179 -0.32 Since the standard normal CDF increases in the probit index, type #2 woman is more likely to visit GPs than type #1 woman.
Another possible answer is to write down the normal CDF for the two types and argue that the equation for type 2 will be greater than type
You could also earn marks (lost elsewhere in the question) by calculating the difference in the probabilities using the table on p.10 of the exam paper; i.e. the difference in the predicted probabilities can be evaluated as (.5-.1255)-(0.5-.1950) = 0.0695 .07 higher for type 2. iv. (6 marks) In determining the sample to be used for estimation, any individual who did not report their income or reported zero income was deleted from the analysis. Do you see any real or potential problems with this modelling decision? Can you provide an alternative method to deal with this problem?
Likely problems (one of the following or another sensible problem): -The potential selection bias which arises when the decision to report zero income or refuse reporting any is correlated with the decision to use GP services. For instant, top income
groups may be more jealous of their income information and at the same time more likely to be health conscious and visit GPs in consequence; excluding the said individuals would affect all coefficient estimates as the model would have to predict a lower probability of GP visit on average.
-The decrease in the sample size and the resulting increase in standard errors. The incomplete cases may still provide useful information on the effects of other variables on GP visits and the researcher has discarded this information.
Alternative solutions (one of the following or another sensible solution): – Use other information to impute the missing information
– Use dummy variables for missing income.
– More sophisticated imputation methods
– Estimate a selection model (this is covered in more detail later in the class but you may know about it from reading or elsewhere)
v. (7 marks) Explain how you would construct and use a hit and miss table to compare the performance of the models for the two subsamples of women (young and old). (You do not have to actually construct a table.)
Step 1. Calculate a predicted probability for each person in the relevant subsample. Step 2. Obtain a predicted binary outcome for each person using a classification rule: if person i’s predicted probability exceeds c, the predicted outcome is 1 and otherwise 0. It is ok if you use 0.5 or the sample mean.
Step 3. For each subsample, tabulate frequencies of predicted and actual binary outcomes in the following form:
Observed 0 A B
1 B’ A’
A (A’) = the total number of women whose predicted and observed outcomes are 0 (1) B (B’) = the total number of women whose observed outcome is 0 (1) but the predicted outcome is 1 (0).
Step 4. Now, compare the relative frequencies of correct predictions for each subsamples; i.e., compare (A+A’) / (A+B+A’+B’) across the subsamples. This tells us how well one model performs relative to another in terms of predicting the observed outcomes. It is ok to describe the comparisons of the predicted 0’s separately from the predicted 1’s (ie the comparisons of A / (A+B) and A’ / (A’+B’) across subsamples) but this is not needed for full marks.