Models with interactions: test type and misinterpretation
Article Type: Tutorial Section From: Journal of Modelling in Management, Volume 8, Issue 1
Models which include interaction terms are common in management and social science research and are frequently reported in the literature. The interpretation of the parameters for such models is quite complex, however, and is a common source of difficulty for those attempting to understand how results apply to the population. In general, there is a lack of knowledge about how parameters associated with variables included in interactions may be interpreted and also how the estimates for significance are derived. As an example, Table I shows the output from a multi-level regression model of “average point score” for all key-stage 2 pupils in England and Wales. This model,although taken from the field of education, is fairly typical of many models reported in the management literature and provides an example from an easily understood context.

Table I A three-level regression model of “average point score” (APS)for key-stage 2 pupils in England and Wales
Having presented and discussed this particular analysis with students and researchers over a number of years, it has become clear that there is the potential for the model parameters to be misunderstood (and, as a direct consequence, misunderstanding what the model actually shows). For example, when trying to answer the question “what effect does being eligible for free-school-meals (fsm) have on the average point score?”, readers are often tempted to interpret the fsm main-effect statistic (−1.67224, t=−73.7, p<0.001) and conclude that “those who are eligible for fsm get aps scores that are on average 1.67 marks lower” (a common interpretation is that this is the “effect” of fsm on aps after taking account of its relationship with other variables). This appears to make sense as we would expect children from lower SES backgrounds (i.e. those who are eligible for fsm) to perform relatively poorly. Similarly, when attempting to answer the question “What effect does gender have on the average point score?”, it is tempting to interpret the main effect statistic for gender (−0.14147, t=−14.9663, p<0.001) as the “effect” of gender after taking into account it is relationship with the child’s special-educational-needs status (variable sen – a greater proportion of boys tend to have special educational needs). It looks as though boys get 0.14 marks lower than girls after taking into account other information in the model. This result also appears to “make sense” as there is evidence that girls tend to out-perform boys in academic tests.
Although these interpretations may look OK and lead to “believable”conclusions, the parameters for the main effects should not be interpreted in this way at all. The statistics presented for the main effects in this model are, to all intents and purposes, uninterpretable. This is difficult to show with this complicated example, but can be easily demonstrated using the following simple data set (Table II) which is depicted using a scatterplot in Figure 1.

Table II Made up data set to show an interaction between X and gender when predicting Y

Figure 1 A scatterplot showing the relationship between Y, X and gender
Figure 1 shows clearly that it is the combination of gender and X that is important for predicting Y. Gender or X when considered on their own cannot accurately predict Y. A regression model of Y (as Y is continuous, an OLS regression model is applied) accounting for X, gender and the interaction between them is shown in Table III.

Table III An OLS regression model: Y~X + gender+X: gender
Table III shows the interaction between X and gender (the “X: Gender”term in the output) and shows that Y increases by 1.92951 units for each unit increase in X for males compared to females (for each unit increase in X, Y increases by about one for males and decreases by about one for females –the relative difference between males and females is, therefore, about two). This increase is also significant as indicated by the t-statistic (t=28.61, p<2×10−16). Although the parameters for the interaction appear sensible, it is obvious from this output that there may be problems when trying to interpret the main effects for gender and X. This is most clearly demonstrated by the parameter for the gender main effect (−26.33449)which does not relate to any simple difference between males and females –the estimate is far greater than the entire range of Y values in the sample. It is also “difficult” to interpret the significance for this “main effect” as this is shown to be highly significant, when we know that gender“on it’s own” is not highly predictive of Y. When there is an interaction in the model, it is “difficult” to interpret the statistics for the main effects in terms of any hypothesis relating to their“effect” on Y.
The problem with the estimates of significance for the lower-order terms can be illustrated by considering how the significance values are computed. In the regression output, the significance values are computed by comparing the deviance of nested models. For example, the significance of the interaction term is determined by comparing two nested models - one that includes the interaction and one that does not.
Computing the significance of the interaction term “X: Gender”:
The difference in deviance between the models above is 676.21 (699.35 –23.14), which is highly significant (this equates to an F-value of 818.26, which is directly comparable to the t-statistic shown in Table III – the square root of 818.26 is 28.61; see Hutcheson and Moutinho,2008). Similarly, the significance of the main effect “Gender” is determined by comparing the deviances of a model that includes the lower-order term gender with a model that does not.
Computing the significance of the main-effect term “Gender”:
The difference in deviance between these models is 614.59 (637.73−23.14),which is highly significant (F=743.69, the square root of which is 27.27 – the t-statistic for gender shown in Table III). Although easy to compute, the difference between the two models is difficult to interpret as both models include information about gender and the effect of gender is realised only as the consequence of an interaction. The status of the gender term on its own is, therefore, difficult to define.
The tests above, are known as “type III” ANOVA tests which test each term in the model after all the others. A more readily-interpretable test is the type II ANOVA tests, which test each term in the model after all the others,except ignoring the term’s higher-order relatives. For our model, type II tests provide the same estimate of significance for the interaction, but provide“main-effect” terms without taking into account the interaction. The type II tests for the model above are shown in Table IV.

Table IV Type II tests for the model: Y~X + gender+X: gender
The type II tests in Table IV show that the interaction term is significant. The lower-order terms for gender and X are non-significant when the interaction is ignored, confirming the interpretation expected from the scatterplot in Figure 1. The type II test statistics can easily be verified by comparing the deviances for nested models. For example, the X main-effect can be verified by comparing the deviances of the models:
The difference between the two is 2.15,the SS statistic shown in Table IV. Table IV provides statistics that are more easily interpretable and clearly shows that it is just the interaction between X and gender that is significant and not X or gender on their own.
Returning to the original analysis shown in Table I, it is obvious that the main effects of all the variables which are also included in interaction terms are not directly interpretable, as they are type III tests. In order to answer questions such as “what effect does being eligible for fsm have on the average point score?” and “What effect does gender have on the average point score?”, we would need to have an analysis of deviance table showing the type II test results. Such tables are essential, but are,unfortunately, rarely published along with the results. The analysis as it stands is interesting, but neglects information essential to interpret many of the most interesting research questions.
Conclusions
Analysts need to be careful when interpreting and presenting statistics. It is easy to misinterpret regression analyses, particularly when these include interaction terms. No matter what is written in the body of the paper, many readers (myself included) will directly interpret the models and give only a cursory glance to any warnings. If readers are not familiar with interactions and the way that significance is computed by the software, there is a danger that the models will be misinterpreted. Without the appropriate test statistics,readers are also forced into “making up their own minds” as to the size, direction and significance of certain effects.
Tables of deviance need to be routinely presented along with the regression models using the type of test that is most appropriate for the research questions. In exploratory management research, this will often be type II tests. It is not enough to just reproduce the regression models output by the software,as many of the questions we (and those reading our research) want to ask of our data cannot be answered directly from this output.
Courses in statistical modelling need to make clear the issues surrounding the type of ANOVA tests that are applied. Many standard model outputs provide type III tests which can be difficult to interpret.
Analysts should consider “greying-out” those statistics that should not be directly interpreted. A useful addition to the analysis shown in Table I would be to shade all the main-effect estimates and explicitly note that the significance tests are type III and therefore uninterpretable in the presence of interactions.
Graeme HutchesonManchester University, Manchester, UK
Acknowledgements
All analyses were run in R (R development core team, 2012) via the R-commander graphical interface (Fox, 2005) and the car library (Fox and Weisburg, 2012). All graphics were produced using the TikZ package (Tantau, 2010) in conjunction with the tikzDevice R library (Sharpsteen and Bracken, 2012) and the Qtikz software to edit (www.hackenberger.at/blog/ktikz-editor-for-the-tikz-language/).
Call for papers
The Journal of Modelling in Management invites the submission of articles and examples that illustrate methodological and practical issues associated with data collection, recording, analysis, graphics and presentation. Articles for the tutorial section of no more than 5,000 words can be submitted via the journal web site at: www.emeraldinsight.com/jm2.htm
