Significance and standard errors in regression models
The working horse in empirical economics is the classical linear model
$$ y_{i}= x'_{i}\beta+ u_{i}, \quad i=1,\ldots,n. $$
The coefficient vector β is estimated by ordinary least squares (OLS)
$$ \hat{\beta}= \bigl(X'X\bigr)^{-1}X'y $$
and the covariance matrix by
$$ \hat{V}(\hat{\beta}) = \hat{\sigma}^2 \bigl(X'X\bigr)^{-1}, $$
where X is the design matrix and \(\hat{\sigma}^{2}\) the estimated variance of the disturbances. The influence of a regressor, e.g. x
k
, on the regressand y is called significant at a 5 percent level if \(|t|=|\hat{\beta}_{k}/\sqrt {\hat{V}(\hat{\beta}_{k})}|>t_{0.975}\). In empirical papers this result is often documented by an asterisk and implicitly interpreted as a good one, while insignificance is a negative signal. Ziliak and McCloskey (2008) and Krämer (2011) have criticized this procedure although the analysis is extended by robustness tests in many investigations. Three types of mistakes can lead to a misleading interpretation:
-
(1)
There does not exist any effect but due to technical inefficiencies a significant effect is reported.
-
(2)
The effect is small but due to the precision of the estimates a significant effect is determined.
-
(3)
There exists a strong effect but due to the variability of the estimates the statistical effect cannot be detected.
The consequence cannot be to neglect the instrument of significance. But what can we do? The following proposals may help to clarify why some standard errors are high and others low, why some influences are significant and others not, whether alternative procedures can reduce the danger of one of the three mistakes:
-
Compute robust standard errors.
-
Analyze whether variation within clusters is only small in comparison with variation between the clusters.
-
Check whether dummies as regressors with high or low probability are responsible for insignificance.
-
Test whether outliers induce large standard errors.
-
Consider the problem of partially identified parameters.
-
Detect whether collinearity is effective.
-
Investigate alternative specifications.
-
Use sub-samples and compare the results.
-
Execute sensitivity analyses (Leamer 1985).
-
Employ the sniff test (Hamermesh 2000) in order to detect whether econometric results are in accord with economic plausibility.
Heteroskedasticity-robust standard errors
OLS estimates are inefficient or biased and inconsistent if assumptions of the classical linear model are violated. We need alternatives which are robust to the violation of specific assumptions. In empirical papers we find often the hint that robust standard errors are displayed. This is imprecise. In most cases this means only heteroskedasticity-robust. This should be mentioned and also that the estimation is based on White’s approach. If we know the type of heteroskedasticity, a transformation of the regression model should be preferred, namely
$$ \frac{y_i}{\sigma_i} = \frac{\beta_0}{\sigma_i}+\beta_1 \frac {x_{1i}}{\sigma_i}+ \cdots+\beta_K\frac{x_{Ki}}{\sigma_i} + \frac {u_i}{\sigma_i}, $$
where i=1,…,n. Typically, the individual variances of the error term are unknown. In the case of unknown and unspecific heteroscedasticity White (1980) recommends the following estimation of the covariance matrix
$$\begin{aligned} \hat{V}_{white}(\hat{\beta}) = \bigl(X'X\bigr)^{-1} \Bigl(\sum\hat{u}_i^2x_ix_i' \Bigr) \bigl(X'X\bigr)^{-1}. \end{aligned}$$
Such estimates are asymptotically heteroscedasticity-robust. In many empirical investigations this robust estimator is routinely applied without testing whether heteroskedasticity exists. We should stress that those estimated standard errors are more biased than conventional estimators if residuals are homoskedastic. As long as there is not too much heteroskedasticity, robust standard errors are also biased downward. In the literature we find some suggestions to modify this estimator, namely to weight the squared residuals \(\hat{u}_{i}^{2}\):
$$\begin{aligned} hc_1=\frac{n}{n-K}\hat{u}^2_i \end{aligned}$$
$$\begin{aligned} hc_j=\frac{1}{(1-c_{ii})^{\delta_j}}\hat{u}_i^2, \end{aligned}$$
where j=2,3,4, c
ii
is the main diagonal element of X′(X′X)−1
X and δ
j
=1;2;min[γ
1,(nc
ii
)/K]+min[γ
2,(nc
ii
)/K], γ
1 and γ
2 are real positive constants.
The intention is to obtain more efficient estimates. It can be shown for hc
2 that under homoskedasticity the mean of \(\hat{u}_{i}^{2}\) is the same as σ
2(1−c
ii
). Therefore, we should expect that the hc
2 option leads under homoskedasticity to better estimates in small samples than the simple hc
1 option. Then \(E(\hat{u}_{i}^{2}/(1-c_{ii}))\) is σ
2. The second correction is presented by MacKinnon and White (1985). This is an approximation of a more complicated estimator which is based on a jackknife estimator—see Sect. 2.1.2. Applications demonstrate that the standard error increases started with OLS via hc
1, hc
2 to the hc
3 option. Simulations, however, do not show a clear preference. As one cannot be sure which case is the correct one, a conservative choice is preferable (Angrist and Pischke 2009, p. 302). The estimator should be chosen that has the largest standard error. This means the null hypothesis (H
0: no influence on the regressand) keeps up longer than with other options.
Cribari-Neto and da Silva (2011) suggest γ
1=1 and γ
2=1.5 in hc
4. The intention is to weaken the effect of influential observations compared with hc
2 and hc
3 or in other words to enlarge the standard errors. In an earlier version (Cribari-Neto et al. 2007) a slight modification is presented: \(hc_{4}^{*}=1/(1-c_{ii})^{\delta_{4*}}\), where δ
4∗=min(4,nc
ii
/K). It is argued that the presence of high leverage observations is more decisive for the finite-sample behavior of the consistent estimators of \(V(\hat{\beta})\) than the intensity of heteroskedasticity, hc
4 and hc
4∗ aim at discounting for leverage points—see Sect. 2.1.5—more heavily than hc
2 and hc
3. The same authors formulate a further estimator
$$\begin{aligned} hc_5=\frac{1}{(1-c_{ii})^{\delta_5}}\hat{u}^2_i, \end{aligned}$$
where \(\delta_{5}=\min(\frac{nc_{ii}}{K},\max(4,\frac{nkc_{ii, \max}}{K}))\), k is a predefined constant, where k=0.7 is suggested. In this case squared residuals are affected by the maximal leverage.
Re-sampling procedures
Other possibilities to determine the standard error are the jackknife and the bootstrap estimator. These are re-sampling procedures, which construct sub-samples with n−1 observations in the jackknife case. Sequentially, one observation is eliminated. The former methods compare the estimated coefficients of the total sample size \(\hat{\beta}\) with those after eliminating one observation \(\hat{\beta}_{-i}\). The jackknife estimator of the covariance matrix is
$$\begin{aligned} \hat{V}_{\mathrm{jack}}=\frac{n-K}{n}\sum _{i=1}^n(\hat{\beta}_{-i}-\hat{\beta}) ( \hat{\beta}_{-i}-\hat{\beta})'. \end{aligned}$$
There exist many ways to bootstrap regression estimates. The basic idea is assume that the sample with n elements is the population and B times m elements (sampling with replacement) are drawn, where m≤n and m>n is feasible. If \(\hat{\beta}_{\mathrm{boot}}'=(\hat {\beta}(1)_{m}', \ldots,\hat{\beta}(B)_{m}')\) are the bootstrap estimators of the coefficients the asymptotic covariance matrix is
$$\begin{aligned} \hat{V}_{\mathrm{boot}}=\frac{1}{B}\sum ^B_{b=1}\bigl(\hat{\beta}(b)_m-\hat{ \beta}\bigr) \bigl(\hat{\beta}(b)_m-\hat{\beta}\bigr)', \end{aligned}$$
where \(\hat{\beta}\) is the estimator with the original sample size n. Alternatively, \(\hat{\beta}\) can be substituted by \(\bar{\beta}=1/B\sum \hat{\beta}(b)_{m}\). Bootstrap estimates of the standard error are especially helpful when it is difficult to compute standard errors by conventional methods, e.g. 2SLS estimators under heteroskedasticity or cluster-robust standard errors when many small clusters or only short panels exist. The jackknife can be viewed as a linear approximation of the bootstrap estimator. A further popular way to estimate the standard errors is the delta method. This approach is especially used for nonlinear functions of parameter estimates \(\hat{\gamma}=g(\hat{\beta})\). An asymptotic approximation of the covariance matrix of a vector of such functions is determined. It can be shown that
$$\begin{aligned} n^{1/2}(\hat{\gamma}- \gamma_0) \sim N\bigl(0, G_0V^{\infty}(\hat{\beta})G_0'\bigr), \end{aligned}$$
where γ
0 is the vector of the true values of γ, G
0 is an l×K matrix with typical element ∂g
i
(β)/∂β
j
, evaluated at β
0, and V
∞ is the asymptotic covariance matrix of \(n^{1/2}(\hat{\beta}- \beta_{0})\).
The Moulton problem
The variance of a regressor is low if this variable strongly varies between groups but only little within groups (Moulton 1986, 1987, 1990). This is especially the case if industry, regional and macroeconomic variables are introduced in a microeconomic model or panel data are considered. In a more general context this is called the problem of cluster sampling. Individuals or establishments are sampled in groups or clusters. Consequence may be a weighted estimation that adjust for differences in sampling rates. However, weighting is not always necessary and estimates may understate the true standard errors. Some empirical investigations note that cluster-robust standard errors are displayed but do not mention the cluster variable. If panel data are used then this is usually the identification variable of the individuals or firms. In many specifications more than one cluster variable, e.g. a regional and an industry variable, is incorporated. Then it is misleading if the cluster variable is not mentioned. Furthermore, then a sequential determination of a cluster-robust correction is not qualified if there is a dependency between the cluster variables. If we can assume that there is a hierarchy of the cluster variables then a multilevel approach can be applied (Raudenbush and Bryk 2002; Goldstein 2003). Cameron and Miller (2010) suggest a two-way clustering procedure. The covariance matrix can be determined by
$$\begin{aligned} \hat{V}_{\mathrm{two}\mbox{-}\mathrm{way}}(\hat{\beta})=\hat{V}_1(\hat{\beta})+ \hat{V}_1(\hat{\beta})-\hat{V}_{1\cap2}(\hat{\beta}) \end{aligned}$$
when the three components are computed by
$$\begin{aligned} &{\hat{V}(\hat{\beta})=\bigl(X'X\bigr)^{-1}\hat{B} \bigl(X'X\bigr)^{-1}} \\ &{\hat{B}=\Biggl(\sum_{g=1}^{G}X'_g \hat{u}_g\hat{u}_g'X_g \Biggr).} \end{aligned}$$
Different ways of clustering can be used. Cluster-robust inference asymptotics are based on G→∞. In many applications there are only a few clusters. In this case \(\hat{u}_{g}\) has to be modified. One way is the following transformation
$$\begin{aligned} \tilde{u}_g=\sqrt{\frac{G}{G-1}}\hat{u}_g. \end{aligned}$$
Further methods and suggestions in the literature are presented by Cameron and Miller (2010) and Wooldridge (2003).
A simple and extreme example shall demonstrate the cluster problem.
Example
Assume a data set with 5 observations (n=5) and 4 variables (V1–V4).
i
|
V1
|
V2
|
V3
|
V4
|
---|
1
|
24
|
123
|
−234
|
−8
|
2
|
875
|
87
|
54
|
3
|
3
|
−12
|
1234
|
−876
|
345
|
4
|
231
|
−87
|
−65
|
9808
|
5
|
43
|
34
|
9
|
−765
|
The linear model
$$\begin{aligned} V1=\beta_1+\beta_2V2+\beta_3V3+ \beta_4V4+u \end{aligned}$$
is estimated by OLS using the original data set (1M). Then the data set is doubled (2M), quadrupled (4M) and octuplicated (8M). The following OLS estimates result.
|
\(\hat{\beta}\)
|
1M
|
2M
|
4M
|
8M
|
---|
\(\hat{\sigma}_{\hat{\beta}}\)
|
\(\hat{\sigma}_{\hat{\beta}}\)
|
\(\hat{\sigma}_{\hat{\beta}}\)
|
\(\hat{\sigma}_{\hat{\beta}}\)
|
---|
V2
|
1.7239
|
1.7532
|
0.7158
|
0.4383
|
0.2922
|
V3
|
2.7941
|
2.3874
|
0.9747
|
0.5969
|
0.3979
|
V4
|
0.0270
|
0.0618
|
0.0252
|
0.0154
|
0.0103
|
const
|
323.2734
|
270.5781
|
110.463
|
67.64452
|
45.0963
|
The coefficients of 1M to 8M are the same, however, the standard errors decrease if the same data set is multiplied. Namely, the variance is only 1/6, 1/16 and 1/36 of the original variance. The general relationship can be shown as follows. For the original data set (X
1) the covariance matrix is
$$\begin{aligned} \hat{V}_1(\hat{\beta}) = \hat{\sigma}_1^2 \bigl(X_1'X_1\bigr)^{-1}. \end{aligned}$$
Using X
1=⋯=X
F
the F times enlarged data set with the design matrix \(X'=:(X_{1}'\cdots X_{F}')\) leads to
$$\begin{aligned} \hat{\sigma}_F^2 = \frac{1}{F\cdot n - K}\sum _{i=1}^{F\cdot n}\hat {u}^2_{i} = \frac{F(n-K)}{F\cdot n - K}\hat{\sigma}^2_1 \end{aligned}$$
and
$$\begin{aligned} \hat{V}_F(\hat{\beta}) =& \hat{\sigma}_F^2 \bigl(X'X\bigr)^{-1} = \hat{\sigma }_F^2 \frac{1}{F}\cdot\bigl(X_1'X_1 \bigr)^{-1} \\ =& \frac{n-K}{F\cdot n - K}\hat {V}_1(\hat{\beta}). \end{aligned}$$
K is the number of regressors including the constant term, n is the number of observations in the original data set (number of clusters), F is the number of observations within a cluster. In the numerical example with F=8, K=4, n=5 the Moulton factor MF that indicates the deflation factor of the variance is
$$\begin{aligned} MF = \frac{n-K}{F\cdot n - K} = \frac{1}{36}. \end{aligned}$$
This is exactly the same as it was demonstrated in the numerical example. Analogously the estimated values 1/6 and 1/16 can be determined. As the multiplying of the data set does not add any further information to the simple original data set not only the coefficients but also the standard errors should be the same. Therefore, it is necessary to correct the covariance matrix. Statistical packages, e.g. Stata, supply cluster-robust estimates
$$\begin{aligned} \hat{V}(\hat{\beta})_C = \Biggl(\sum^C_{c=1}X_c'X_c \Biggr)^{-1}\sum^C_{c=1}X_c' \hat {u}_c\hat{u}_cX_c\Biggl(\sum ^C_{c=1}X_c'X_c \Biggr)^{-1}, \end{aligned}$$
where C is the number of clusters. In our specific case this is the number of observations n. This approach implicitly assumes that F is small and n→∞. If this assumption does not hold a degrees-of-freedom correction
$$\begin{aligned} \mathit{df}_C=\frac{F\cdot n-1}{F\cdot n-K}\cdot\frac{n}{n-1} \end{aligned}$$
is helpful. \(\mathit{df}_{C}\cdot\hat{V}(\hat{\beta})_{C}\) is the default option in Stata and corrects for the number of clusters in practice being finite. Nevertheless, this correction eliminates only partially the underestimated standard errors. In other words, the corrected t-statistic of the regressor x
k
is larger than that of \(\hat{\beta }_{k}/\sqrt{\hat{V}_{1k}}\).
Large standard errors of dichotomous regressors with small or large mean
Another problem with estimated standard errors can be induced by Bernoulli distributed regressors. Assume a simple two-variable classical regression model
$$\begin{aligned} y = a + b\cdot D + u. \end{aligned}$$
D is a dummy variable and the variance of \(\hat{b}\) is
$$\begin{aligned} V(\hat{b})=\frac{\sigma^2}{n}\cdot\frac{1}{s_D^2}, \end{aligned}$$
where
$$\begin{aligned} s_D^2 =&\hat{P}(D=1)\cdot\hat{P}(D=0)=: \hat{p}(1- \hat{p})\\ =&\frac {(n|D=1)}{n}\cdot\biggl(1-\frac{(n|D=1)}{n}\biggr). \end{aligned}$$
If \(s_{D}^{2}\) is determined by \(\bar{D}=(n|D=1)/n\) we find that \(\bar{D}\) is at most 0.5. \(V(\hat{b})\) is minimal at given n and σ
2 when the sample variance of D reaches the maximum, if \(\bar{D}=0.5\). This result holds only for inhomogeneous models.
Example
An income variable (Y=Y
0/107) with 53,664 observations is regressed on a Bernoulli distributed random variable RV. The coefficient β
1 of the linear model Y=β
0+β
1
RV+u is estimated by OLS, where alternative values of the mean of RV (\(\overline{RV}\)) are assumed (0.1,0.2,…,0.9)
Y
|
\(\hat{\beta}_{1}\)
|
std.err.
|
---|
\(\overline{RV}=0.1\)
|
−0.3727
|
0.6819
|
\(\overline{RV}=0.2 \)
|
−0.5970
|
0.5100
|
\(\overline{RV}=0.3\)
|
−0.4768
|
0.4455
|
\(\overline{RV}=0.4\)
|
0.3068
|
0.4170
|
\(\overline{RV}=\boldsymbol{0.5}\)
|
0.1338
|
0.4094
|
\(\overline{RV}=0.6\)
|
0.0947
|
0.4187
|
\(\overline{RV}=0.7\)
|
−0.0581
|
0.4479
|
\(\overline{RV}=0.8\)
|
−0.1860
|
0.5140
|
\(\overline{RV}=0.9\)
|
−0.1010
|
0.6827
|
This example confirms the theoretical result. The standard error is smallest if \(\overline{RV}=0.5\) and increases systematically if the mean of RV decreases or increases. An extension to multiple regression models seems possible—see applications in the Appendix, Tables 11, 12, 13, 14. The more \(\bar{D}\) deviates from 0.5, the larger or smaller is the mean of D, the higher is the tendency to insignificant effects. A caveat is necessary. The conclusion that the t-value of a dichotomous regressor D
1 is always smaller than that of D
2, when V(D
1)>V(D
2), is not unavoidable. The basic effect of D
1 on y may be larger than that of D
2 on y. The theoretical result aims on specific variables and not on the comparison between regressors. In practice, significance is determined by \(t=\hat{b}/\sqrt{\hat{V}(\hat {b})}\). However, we do not find a systematic influence of \(\hat{b}\) on t if \(\bar{D}\) varies. Nevertheless, the random differences in the influence of D on y can dominate the \(\bar{D}\) effect via \(s_{D}^{2}\). The comparison of Table 13 with Table 14 shows that the influence of a works council (WOCO) is stronger than that of a company-level pact (CLP). The coefficients of the former regressor are larger and the standard errors are lower than that of the latter regressor so that the t-values are larger. In both cases the standard errors increase if the mean of the regressor is reduced. The comparison of line 1 in Table 13 with line 9 in Table 14, where the mean of CLP and WOCO is nearly the same, makes clear that the stronger basic effect of WOCO on lnY dominates the mean reduction effect of WOCO. The t-value in line 9 of Table 14 is smaller than that in line 1 of Table 14 but still larger than that in line 1 of Table 13. Not all deviations of the mean of a dummy D as regressor from 0.5 induce the described standard error effects. A random variation of \(\bar{D}\) is necessary. An example, where this is not the case, is matching—see Sect. 2.2 and the application in Sect. 3. \(\bar{D}\) increases due to the systematic elimination of those observations with D=0 that are dissimilar to those of D=1 in other characteristics.
Outliers and influential observations
Outliers may have strong effects on the estimates of the coefficients, of the dependent variable and on standard errors and therefore on significance. In the literature we find some suggestions to measure outliers that are due to large or small values of the dependent variable or on the independent variables. Belsley et al. (1980) use the main diagonal elements c
ii
of the hat matrix C=X(X′X)−1
X′ to determine the effects of a single observation on the coefficient estimator \(\hat{\beta}\), on the estimated endogenous variable \(\hat {y}_{i}\) and on the variance \(\hat{V}(\hat{y})\). The higher c
ii
, the higher is the difference between the estimated dependent variable with and without the ith observation. A rule of thumb orients on the relation
$$\begin{aligned} c_{ii}>\frac{2K}{n}. \end{aligned}$$
An observation i is called an influential observation with a strong leverage if this inequality is fulfilled. The effects of the ith observation on \(\hat{\beta}\), \(\hat{y}\) and \(\hat{V}(\hat{\beta})\) and the rules of thumb can be expressed by
$$\begin{aligned} \bigl|\hat{\beta}_{k}-\hat{\beta}_k(i)\bigr|>\frac{2}{\sqrt{n}} \end{aligned}$$
$$\begin{aligned} \biggl|\frac{\hat{y}_i-\hat{y}_{i(i)}}{s(i)\sqrt{c_{ii}}}\biggr|>2\sqrt{\frac{K}{n}} \end{aligned}$$
$$\begin{aligned} \biggl|\frac{\operatorname{det}(s^2(i)(X'(i)X(i))^{-1}}{\operatorname{det}(s^2(X'X)^{-1})} \biggr| > \frac{3K}{n}. \end{aligned}$$
If the inequalities are fulfilled, this indicates a strong influence of observation i where (i) means that observation i is not considered in the estimates. The determination of an outlier is based on externally studentized residuals
$$\begin{aligned} \hat{u}^*_i=\frac{\hat{u}_i}{s(i)\sqrt{1-c_{ii}}} \sim t_{n-K-1}. \end{aligned}$$
Observations which fulfill the inequality \(|\hat{u}^{*}_{i}|>t_{1-\alpha /2;n-K-1}\) are called outliers. Alternatively, a mean shift outlier model can be formulated
$$\begin{aligned} y = X\beta+ A_j\delta+ \epsilon, \end{aligned}$$
where
$$\begin{aligned} A_{j} = \left \{ \begin{array}{l@{\quad }l} 1 &\mbox{if}\ i=j\\ 0 &\mbox{otherwise}. \end{array} \right . \end{aligned}$$
Observation j has a statistical effect on y if δ is significantly different from zero. The estimated t-value is the same as \(\hat{u}^{*}_{j}\). This procedure does not separate whether the outlier j is due to unusual y- or unusual x-values.
Hadi (1992) proposes an outlier detection with respect to all regressors. The decision whether the design matrix X contains outliers is based on an elliptical distance
$$\begin{aligned} d_i(c,W) = \sqrt{(x_i-c)'W(x_i-c)}, \end{aligned}$$
where intuitively the classical choices of c and W are the arithmetic mean (\(\bar{x}\)) and the inverse of the sample covariance matrix (S
−1) of the estimation function of β, respectively, so that the Mahalanobis distance follows. If
$$\begin{aligned} d_i\bigl(\bar{x} ,S^{-1}\bigr)^2 > \chi^2_K, \end{aligned}$$
observation i is identified as an outlier. As \(\bar{x}\) and S react sensitive to outliers it is necessary to estimate an outlier-free mean and sample covariance matrix. For this purpose, only outlier-free observations are considered to determine \(\bar{x}\) and S. Another way to avoid the sensitivity problem is to use more robust estimators of the location and covariance matrix, e.g. the median but not the mean is robust to outliers. Finally, an outlier vector MOD (multiple outlier dummy) instead of A is incorporated in the model in order to test whether the identified outlier observations have a significant effect on the dependent variable. A second problem is whether we should eliminate all outliers or only some of them or no outlier. The situation is obvious if an outlier is induced by measurement errors. Then we should eliminate this observation if we have no information to correct the error. Typically, however, we cannot be sure that an anomalous value is due to measurement errors. Insofar, the correct estimation is based between the two extremes: all outliers are considered or all outliers are eliminated. A solution is presented in the next subsection.
Partially identified parameters
Assume that some observations are unknown or not exactly measured. Consequence is that a parameter cannot exactly be determined but only within a range. The outlier situation leads to such a partial identification problem. There exist many other similar constellations.
Example
The share of unemployed persons is 8 % but 5 % have not answered to the question of the employment status. Therefore, the unemployment rate can only be calculated within certain limits, namely between the two extremes:
In the first case the unemployment rate is 7.6 % and in the second case 12.6 %.
The main methodological focus of partially identified parameters is the search for the best statistical inference. Chernozhukov et al. (2007), Imbens and Manski (2004), Romano and Shaikh (2010), Stoye (2009) and Woutersen (2009) have discussed solutions.
If Θ
0=[θ
l
,θ
u
] describes the lower and the upper bound based on the two extreme situations Stoye (2009) develops the following confidence interval
$$\begin{aligned} CI_{\alpha}=\biggl[\hat{\theta}_l-\frac{{c}_{\alpha}\hat{\sigma}_l}{\sqrt{n}}, \hat{ \theta}_u-\frac{{c}_{\alpha}\hat{\sigma}_l}{\sqrt{n}}\biggr], \end{aligned}$$
where \(\hat{\sigma}_{l}\) is the standard error of the estimation function \(\hat{\theta}_{l}\). c
α
is chosen by
$$\begin{aligned} \varPhi\biggl({c}_{\alpha}+\frac{\sqrt{n}\hat{\Delta}}{ \hat{\sigma}_l}\biggr)-\varPhi(-{c}_{\alpha})=1- \alpha, \end{aligned}$$
where Δ=θ
u
−θ
l
. As Δ is unknown, the interval has to be estimated (\(\hat{\Delta}\)).
Treatment evaluation
The objective of treatment evaluation is the determination of causal effects of economic measures. The simplest form to measure the effect is to estimate α in the linear model
$$\begin{aligned} y=X\beta+ \alpha D + u, \end{aligned}$$
where D is the intervention variable and measured by a dummy: 1 if an individual or an establishment is assigned to treatment; 0 otherwise. Typically, this is not the causal effect. An important reason for this failure are unobserved variables that influence y and D, when D and u correlate.
In the last 20 years a wide range of methods was developed to determine the “correct” causal effect. Which approach should be preferred depends on the data, the behavior of the economic agents and the assumptions of the model. The major difficulty is that we have to compare an observed situation with an unobserved situation. Depending on the available information the latter is estimated. We have to ask what would occur if not D=1 but D=0 (treatment on the treated) would take place. This counterfactual is unknown and has to be estimated. Inversely, if D=0 is observable we can search for the potential result under D=1 (treatment on the untreated). A further problem is the fixing of the control group. What is the meaning of “otherwise” in the definition of D? Or in other words: What is the causal effect of an unobserved situation? Should we determine the average causal effect or only that of a subgroup?
Neither a before-after comparison \((\bar{y}_{1}|D=1)-(\bar{y}_{0}|D=1)\) nor a comparison of \((\bar{y}_{t}|D=1)\) and \((\bar{y}_{t}|D=0)\) in cross-section is usually appropriate. Difference-in-differences estimators (DiD), a combination of these two methods, are very popular in applications
$$\begin{aligned} \bar{\Delta}_1-\bar{\Delta}_0 =& \bigl[( \bar{y}_1|D=1)-(\bar{y}_1|D=0)\bigr]\\ &{} - \bigl[( \bar{y}_0|D=1)-(\bar{y}_0|D=0)\bigr]. \end{aligned}$$
The effect can be determined in the following unconditional model
$$\begin{aligned} y = a_1 + b_1T + b_2D + b_3TD + u, \end{aligned}$$
where T=1 means a period that follows the period of the measure (D=1). T=0 is a period before the measure takes place. In this approach \(\hat{b}_{3}=\bar{\Delta}_{1}-\bar{\Delta}_{0}\) is the causal effect. The equation can be extended by further regressors X. This is called a conditional DiD estimator. Nearly all DiD investigations neglect a potential bias in standard error estimates induced by serial correlation. A further problem results under endogenous intervention variables. Then an instrumental variables estimator should be employed avoiding the endogeneity bias. This procedure will be considered in the quantile regression analysis. If the dependent variable is a dummy a nonlinear estimator has to be applied. Suggestions are presented by Ai and Norton (2003) and Puhani (2012).
Matching procedures were developed with the objective to find a control group that is very similar to the treatment group. Parametric and non-parametric procedures can be employed to determine the control group. Kernel, inverse probability, radius matching, local linear regression, spline smoothing or trimming estimators are possible. Mahalanobis metric matching with or without propensity scores and nearest neighbor matching with or without caliper are typical procedures—see e.g. Guo and Fraser (2010). The Mahalanobis distance is defined by
$$\begin{aligned} (u-v)'S^{-1}(u-v), \end{aligned}$$
where u (v) is a vector that incorporates the values of matching variables of participants (non-participants) and S is the empirical covariance matrix from the full set of non-treated participants.
An observed or artificial statistical twin can be determined to each participant. The probability of all non-participants to participate on the measure is calculated based on probit estimates (propensity score). The statistical twin j of a participant i is that who has a propensity score (ps
j
) nearest to that of the participant. The absolute distance between i and j may not exceed a given value ϵ
$$\begin{aligned} |ps_i-ps_j| < \epsilon, \end{aligned}$$
where ϵ is a predetermined tolerance (caliper). A quarter of a standard deviation of the sample estimated propensity scores is suggested as the caliper size (Rosenbaum and Rubin 1985). If the control group is identified the causal effect can be estimated using the reduced sample (treatment observations and matched observations). In applications α from the model y=Xβ+αD+u or b
3 from the DiD approach is determined as causal effect. Both estimators implicitly assume that the causal effect is the same for all subgroups of individuals or firms and that no unobserved variables exist that are correlated with observed variables. Insofar matching procedures suffer from the same problem as OLS estimators.
If the interest is to detect whether and in which amount the effects of intervention variables differ between the percentiles of the distribution of the objective variable y a quantile regression analysis is an appropriate instrument. The objective is to determine quantile treatment effects (QTE). The distribution effect of a measure can be estimated by the difference Δ of the dependent variable with (y
1) and without (y
0) treatment (D=1; D=0) separate for specific quantiles Q
τ where 0<τ<1
$$\begin{aligned} \Delta^{\tau} = Q_{y^1}^{\tau}-Q_{y^0}^{\tau}. \end{aligned}$$
The empirical distribution function of an observed situation and that of the counterfactual is identified. From the view of modeling four major cases are developed in the literature that differ in the assumptions. The measure is assumed exogenous or endogenous and the effect on y is unconditional or conditional analogously to DiD.
|
Unconditional
|
Conditional
|
---|
Exogenous
|
(1) Firpo (2007)
|
(2) Koenker and Bassett (1978)
|
Endogenous
|
(3) Frölich and Melly (2012)
|
(4) Abadie et al. (2002)
|
In case (1) the quantile treatment effect \(Q_{y^{1}}^{\tau}-Q_{y^{0}}^{\tau }\) is estimated by
$$\begin{aligned} Q_{y^j}^{\tau}=\arg \min_{\alpha_0;\alpha_1}E\bigl[ \rho_{\tau}(y-q_j) (W|D=j)\bigr], \end{aligned}$$
where j=0;1, q
j
=α
0+α
1(D|D=j), ρ
τ
=a(τ−1(a≤0)) is a check function; a is a real number. The weights are
$$\begin{aligned} W=\frac{D}{p(X)} + \frac{1 - D}{1 - p(X)}. \end{aligned}$$
The estimation is characterized by two stages. First, the propensity score is determined by a large number of regressors X via a nonparametric method—\(\hat{p}(X)\). Second, in \(Q_{y^{j}}^{\tau}\) the probability p(X) is substituted by \(\hat{p}(X)\).
Case (2) follows Koenker and Bassett (1978).
$$\begin{aligned} &\sum _{(i|y_i\ge x_i'\beta)=1}^{n_1} \tau\cdot \bigl|y_i- \alpha(D_i|D_i=j)-x_i'\beta\bigr|\\ &\quad {}+ \sum _{(i|y_i<x_i\beta)=n_1+1}^{n} (1-\tau)\cdot \bigl|y_i- \alpha(D_i|D_i=j)-x_i'\beta\bigr| \end{aligned}$$
has to be minimized with respect to α and β, where τ is given. In other words,
$$\begin{aligned} Q_{y^j}^{\tau}=\arg \min_{\alpha;\beta}E\bigl[ \rho_{\tau}(y-q_j) (W|D=j)\bigr], \end{aligned}$$
where j=0;1, q
j
=α(D|D=j)+x′β.
The method of case (3) is developed by Frölich and Melly (2012). Due to the endogeneity of the intervention variable D, an instrumental variables estimator is used with only one instrument Z and this is a dummy. The quantiles follow from
$$\begin{aligned} Q_{y^j|c}^{\tau} = \arg \min_{\alpha_0;\alpha_1} E\bigl[ \rho_{\tau }(y-q_j)\cdot(W|D=j)\bigr], \end{aligned}$$
where j=0;1, q
j
=α
0+α
1(D|D=j), c means complier. The weights are
$$\begin{aligned} W = \frac{Z-p(X)}{p(X)(1-p(X))}(2D-1). \end{aligned}$$
Abadie et al. (2002) investigate case (4) and suggest a weighted linear quantile regression. The estimator is
$$\begin{aligned} Q_{y^j}^{\tau}=\arg \min_{\alpha,\beta}E\bigl[ \rho_{\tau}\bigl(y-\alpha D -x'\beta \bigr) (W|D=j)\bigr], \end{aligned}$$
where the weights are
$$\begin{aligned} W = 1 - \frac{D(1-Z)}{1-p(Z=1|X)}-\frac{(1-D)Z}{p(Z=1|X)}. \end{aligned}$$
Regression discontinuity (RD) design allows to determine treatment effects in a special situation. This approach uses information on institutional and legal regulations that are responsible that changes occur in the effects of economic measures. Thresholds are estimated indicating discontinuity of the effects. Two forms are distinguished: sharp and fuzzy RD. Either the change of the status is exactly effective at a fixed point or it is assumed that the probability of a treatment change or the mean of a treatment change is discontinuous.
In the case of sharp RD individuals or establishments (i=1,…,n) are assigned to the treatment or the control group on the base of the observed variable S. The latter is a continuous or an ordered categorial variable with many parameter values. If variable S
i
is not smaller than a fixed bound \(\bar{S}\) then i belongs to the treatment group (D=1)
$$\begin{aligned} D_i = 1[S_i\ge\bar{S}]. \end{aligned}$$
The following graph based on artificial data with n=40 demonstrates the design. Assuming we know that an institutional rule changes the conditions if \(S>\bar{S}=2.5\) and we want to determine the causal effect induced by the adoption of the new rule. This can be measured by the difference of the two estimated regressions at \(\bar{S}\).
In a simple regression model y=β
0+β
1
D+u the OLS estimator of β
1 would be inconsistent when D and u correlate. If, however, the conditional mean E(u|S,D)=E(u|S)=f(S) is additionally incorporated in the outcome equation (y=β
0+β
1
D+f(S)+ϵ, where ϵ=y−E(y|S,D)), the OLS estimator of β
1 is consistent. Assume f(S)=β
2
S, the estimator of β
1 corresponds to the difference of the two estimated intercepts of the parallel regressions
$$\begin{aligned} \hat{y}_0 =& \hat{E}(y|D=0)=\hat{\beta}_0+\hat{ \beta_2}S \\ \hat{y}_1 =& \hat{E}(y|D=1)=\hat{\beta}_0+\hat{ \beta}_1+\hat{\beta_2}S. \end{aligned}$$
The sharp RD approach identifies the causal effect by distinguishing between the nonlinear function due to the discontinuous character and the smoothed linear function. If, however, a nonlinear function of the general type f(S) is given, modifications have to be regarded.
Assume, the true function f(S) is a polynomial of pth order
$$\begin{aligned} y_i = \beta_0+\beta_1D_i+ \beta_{21}S_i+\beta_{22}S_i^2+ \cdots+\beta _{2p}S_i^p+u_i \end{aligned}$$
but two linear models are estimated, then the difference between the two intercepts, interpreted as the causal effect, is biased. What looks like a jump is in reality a neglected nonlinear effect.
Another strategy is to determine the treatment effect exactly at the fixed discontinuity point \(\bar{S}\) assuming a local linear regression. Two linear regressions are considered
$$\begin{aligned} y_0 - E(y_0|S=\bar{S}) =& \delta_0(S-\bar{S}) + u_0 \\ y_1 - E(y_1|S=\bar{S}) =& \delta_1(S-\bar{S}) + u_1, \end{aligned}$$
where y
j
=E(y|D=j) and j=0;1. In combination with
$$\begin{aligned} y = (1-D)y_0+Dy_1 \end{aligned}$$
follows
$$\begin{aligned} y =& (1-D) \bigl(E(y_0|S=\bar{S}) + \delta_0(S-\bar{S}) + u_0\bigr) \\ &{}+ D\bigl(E(y_1|S=\bar{S}) + \delta_1(S- \bar{S}) + u_1\bigr). \end{aligned}$$
The linear regression
$$\begin{aligned} y = \gamma_0 + \gamma_1D + \gamma_2(S-\bar{S}) + \gamma_3D(S-\bar{S}) + \tilde{u} \end{aligned}$$
can be estimated, where \(\tilde{u}=u_{0}+D(u_{1}-u_{0})\). This looks like the DiD estimator but now \(\gamma_{1}=E(y_{1}|S=\bar{S})-E(y_{0}|S=\bar{S})\) and not γ
3 is of interest. The estimated coefficient \(\hat{\gamma _{1}}\) is a global but not a localized average treatment effect.
The localized average follows if a small interval around \(\bar{S}\) is modeled, i.e. \(\bar{S}-\Delta S <S_{i}<\bar{S} + \Delta S\). The treatment effect corresponds to the difference of the two former determined intercepts, restricted to \(\bar{S}<S_{i}<\bar{S}+\Delta S\) on the one hand and to \(\bar{S}-\Delta S <S_{i}<\bar{S}\) on the other hand.
A combination of the latter linear RD model with the DiD approach leads to an extended interaction model. Again, two linear regressions are considered
$$\begin{aligned} y_0 =& \gamma_{00} + \gamma_{10}D + \gamma_{20}(S-\bar{S}) + \gamma _{30}D(S-\bar{S}) + \tilde{u}_0 \\ y_1 =& \gamma_{01} + \gamma_{11}D + \gamma_{21}(S-\bar{S}) + \gamma _{31}D(S-\bar{S}) + \tilde{u}_1, \end{aligned}$$
where the first index of γ
jt
with j=0;1 refers to the treatment and the second index with t=0;1 refers to the period. In contrast to the pure RD model, where y
j
and j=0;1 is considered, now the index of y is a time index, i.e. y
T
and T=0;1. Using
$$\begin{aligned} y = (1-T)y_0+Ty_1 \end{aligned}$$
follows
$$\begin{aligned} y =& \gamma_{00} + \gamma_{10}D + \gamma_{20}(S- \bar{S}) + \gamma_{30}D(S-\bar{S}) \\ &{}+(\gamma_{01}-\gamma_{00})T + (\gamma _{11}- \gamma_{10})DT\\ &{} + (\gamma_{21}-\gamma_{20}) (S-\bar{S})T \\ &{}+ (\gamma_{31}-\gamma _{30})D(S-\bar{S})T + \bigl( \tilde{u}_0 + (\tilde{u}_1-\tilde{u}_0)T\bigr) \\ =:& \beta_0 + \beta_1T + \beta_2D + \beta_3(S-\bar{S}) + \beta_4D(S-\bar{S}) \\ &{}+ \beta_5DT + \beta_6(S-\bar{S})T + \beta_7D(S-\bar{S})T + \tilde{u}. \end{aligned}$$
Now, it is possible to determine whether the treatment effect varies between T=1 and T=0. The difference follows by a DiD approach
$$\begin{aligned} &\bigl[(y_1|D=1)-(y_1|D=0)\bigr] - \bigl[(y_0|D=1)-(y_0|D=0) \bigr] \\ &\quad {}= (\gamma _{11}-\gamma_{10})+(\gamma_{31}- \gamma_{30}) (S-\bar{S}) \\ &\quad {}= \beta _5+\beta_7(S-\bar{S}) \end{aligned}$$
under the assumption that the disturbance term does not change between the periods. The hypothesis of a time-invariant break cannot be rejected if DT and \(D(S-\bar{S})T\) have no statistical influence on y.
The fuzzy RD assumes that the propensity score function of treatment P(D=1|S) is discontinuous with a jump in \(\bar{S}\)
$$\begin{aligned} P(D_i=1|S_i) =& \left \{ \begin{array}{l@{\quad }l} g_1(S_i) & \mbox{if}\ S_i\ge\bar{S}\\ g_0(S_i) & \mbox{if}\ S_i< \bar{S}, \end{array} \right . \end{aligned}$$
where it is assumed that \(g_{1}(\bar{S})>g_{0}(\bar{S})\). Therefore, treatment in \(S_{i}\ge\bar{S}\) is more likely. In principle, the functions g
1(S
i
) and g
0(S
i
) are arbitrary, e.g. a polynomial of pth order can be assumed but the values have to be within the interval [0;1] and different values in \(\bar{S}\) are necessary.
The conditional mean of D that depends on S is
$$\begin{aligned} E(D_i|S_i) =& P(D_i=1|S_i)\\ =& g_0(S_i) + \bigl(g_1(S_i)-g_0(S_i) \bigr)T_i, \end{aligned}$$
where \(T_{i}=1(S_{i}\ge\bar{S})\) is a dummy indicating the point where the mean is discontinuous. If a polynomial of pth order is assumed the interaction variables \(S_{i}T_{i}, S_{i}^{2}T_{i}\cdots S_{i}^{p}T_{i}\) and the dummy T
i
are instruments of D
i
. The simplest case is to use only T
i
as an instrument if g
1(S
i
) and g
0(S
i
) are discriminable constants.
We can determine the treatment effect around \(\bar{S}\)
$$\begin{aligned} \lim_{\Delta\rightarrow0}\frac{E(y_i|\bar{S} < S_i < \bar{S} + \Delta) -E(y_i|\bar{S} - \Delta< S_i < \bar{S})}{E(D_i|\bar{S} < S_i < \bar{S} + \Delta) -E(D_i|\bar{S} - \Delta< S_i < \bar{S})}. \end{aligned}$$
The empirical analogon is the Wald (1940) estimator that was first developed for the case of measurement errors
$$\begin{aligned} \frac{(\bar{y}|\bar{S}<S_i<\bar{S}+\Delta)-(\bar{y}|\bar{S}-\Delta <S_i<\bar{S})}{ (\bar{D}|\bar{S}<S_i<\bar{S}+\Delta)-(\bar{D}|\bar{S}-\Delta<S_i<\bar{S})}. \end{aligned}$$
QTE and RD analysis allow the determination of variable causal effects with a different intention. A further possibility is a separate estimation for subgroups, e.g. for industries or regions.