Purpose
People seem to function according to different models, which implies that in business and social sciences, heterogeneity is a rule rather than an exception. Researchers can investigate such heterogeneity through multigroup analysis (MGA). In the context of partial least squares path modeling (PLS-PM), MGA is currently applied to perform multiple comparisons of parameters across groups. However, this approach has significant drawbacks: first, the whole model is not considered when comparing groups, and second, the family-wise error rate is higher than the predefined significance level when the groups are indeed homogenous, leading to incorrect conclusions. Against this background, the purpose of this paper is to present and validate new MGA tests, which are applicable in the context of PLS-PM, and to compare their efficacy to existing approaches.
Design/methodology/approach
The authors propose two tests that adopt the squared Euclidean distance and the geodesic distance to compare the model-implied indicator correlation matrix across groups. The authors employ permutation to obtain the corresponding reference distribution to draw statistical inference about group differences. A Monte Carlo simulation provides insights into the sensitivity and specificity of both permutation tests and their performance, in comparison to existing approaches.
Findings
Both proposed tests provide a considerable degree of statistical power. However, the test based on the geodesic distance outperforms the test based on the squared Euclidean distance in this regard. Moreover, both proposed tests lead to rejection rates close to the predefined significance level in the case of no group differences. Hence, our proposed tests are more reliable than an uncontrolled repeated comparison approach.
Research limitations/implications
Current guidelines on MGA in the context of PLS-PM should be extended by applying the proposed tests in an early phase of the analysis. Beyond our initial insights, more research is required to assess the performance of the proposed tests in different situations.
Originality/value
This paper contributes to the existing PLS-PM literature by proposing two new tests to assess multigroup differences. For the first time, this allows researchers to statistically compare a whole model across groups by applying a single statistical test.
1. Introduction
The empirical testing of theories requires valid statistical methods to allow researchers to derive reliable implications. In the field of information systems (IS) and internet research, partial least squares path modeling (PLS-PM) is a widely used composite-based estimator for structural equation models with latent variables to investigate phenomena such as social networks (Cheung et al., 2015), internet addiction (Lu and Wang, 2008) and mobile banking (Tam and Oliveira, 2017). It was originally developed by Herman A. O. Wold in the 1970s as an alternative estimator for structural equation models (Wold, 1975).
The existing literature on PLS-PM has provided substantial methodological contributions that have increased its application in various disciplines, such as strategic management (Hulland, 1999), IS research (Ringle et al., 2012) and tourism research (Müller et al., 2018). Notable milestones include the proposal of the confirmatory tetrad analysis (Gudergan et al., 2008) and the heterotrait–monotrait ratio of correlations (Henseler et al., 2015) and the development of consistent PLS (PLSc; Dijkstra and Henseler, 2015a), which enhances traditional PLS-PM to consistently estimate structural models containing common factors.
Researchers often assume that data sets in empirical research stem from a single homogeneous population. Contrary to this assumption, data sets used in social sciences are regularly affected by heterogeneity, which implies that the data were collected from different homogenous populations. Ignoring this fact, i.e., not taking heterogeneity into account, leads to questionable conclusions (Jedidi et al., 1997). Hence, a multigroup analysis (MGA) can be conducted to investigate this issue caused by heterogeneity.
Heterogeneity has been recognized in the context of PLS-PM (e.g. Huma et al., 2017), and several approaches have been adopted to define groups in the case of unobserved heterogeneity based on genetic algorithm segmentation (Ringle et al., 2014) and iterative reweighted regression (Schlittgen et al., 2016). Moreover, parametric and non-parametric approaches (Keil et al., 2000; Chin and Dibbern, 2010; Henseler, 2012) have been proposed to assess parameter differences and, thus, heterogeneity across groups. However, the existing approaches have serious drawbacks. First, they do not compare the whole model but compare only specific parameters, e.g., path coefficients, to investigate heterogeneity. Second, the employed testing procedures rely on distributional assumptions, e.g., normal distributed data, which are often violated in empirical research. Finally, since the existing approaches rely on multiple comparisons, complex models with numerous relationships and more than two groups significantly affect the number of comparisons. Hence, researchers applying current approaches face the risk of a high family-wise error rate (FWER).
Against this background, this paper proposes two tests that allow for comparing a whole model across groups while maintaining the predefined significance level under the null hypothesis. For that purpose, we consider established distance measures, namely, the geodesic distance and the squared Euclidean distance, to measure the discrepancy of the model-implied correlation matrix of the indicators across groups. To obtain the reference distribution of the corresponding test statistic (distance measure) under the null hypothesis of no group differences, we employ a permutation procedure.
This paper is structured as follows: After the introduction, in Section 2, we review the existing literature on MGA in the context of PLS-PM and emphasize the importance of having a test that allows the comparison of the whole model across groups. In Section 3, we propose two novel MGA tests and show how permutation can be used for significance testing. In Section 4, these new tests are evaluated by means of a Monte Carlo simulation. Finally, we extend current MGA guidelines in the context of PLS-PM by proposing a comprehensive test procedure that integrates our proposed test into existing approaches and discuss opportunities for future research.
2. MGA using PLS-PM
MGA can be used to explore differences across groups defined by group variables. Heterogeneity across groups in MGA occurs if there are significant differences across at least two groups. To address heterogeneity, researchers can estimate separate models per group or control for heterogeneity by means of a categorical moderator variable (Sarstedt, Henseler and Ringle, 2011). Regardless of how it is addressed, ignoring heterogeneity affects the complete underlying research model.
In terms of unobserved heterogeneity, cluster analysis such as k-means clustering has been widely used in the context of PLS-PM to identify partitions that are used for group-specific estimations (Hair et al., 2018; Sarstedt and Mooi, 2014). A major shortcoming of this approach is that the structural model, which is a major aspect in structural equation modeling (SEM), is not taken into account. To overcome this drawback, the literature provides several approaches, such as finite mixture partial least squares (Hahn et al., 2002; Sarstedt, Becker, Ringle and Schwaiger, 2011), the prediction-oriented segmentation in PLS-PM (Becker et al., 2013) and the iterative reweighted regression segmentation method for PLS-PM (Schlittgen et al., 2016). For a more complete overview of techniques, we refer to Hair et al. (2016).
In addition to uncovering unobserved heterogeneity, the previous literature has also suggested different approaches to test for observed heterogeneity across groups (Hair et al., 2018). For two-group scenarios, a repeated application of unpaired sample t-tests has been proposed to identify differences between groups (Chin, 2000; Keil et al., 2000). In doing so, the test statistic is assumed to follow a t-distribution where the standard errors of the parameter estimates are obtained by the bootstrap or jackknife procedure (Keil et al., 2000). To overcome distributional assumptions, the previous literature has also provided a non-parametric test for MGA (Henseler, 2012). Although this test is similar to the former, it evaluates the bootstrap distribution of each group to analyze whether the estimates statistically differ between groups. Similarly, Chin (2003) and Chin and Dibbern (2010) propose a permutation test to evaluate group differences. Group-specific differences are compared with the corresponding reference distribution obtained by the permutation procedure. Apart from the analysis of two groups, approaches for multiple groups have been suggested, for example, the omnibus test of group differences, which is a combinatorial test comprising bootstrapping and permutation to mimic an overall F-test (Sarstedt, Henseler and Ringle, 2011).
A variety of tests allow testing heterogeneity across groups. However, these tests have some significant shortcomings: first, they do not consider the whole model; instead, they focus on only specific parameters, thus excluding information from the model. For instance, the procedure proposed by Chin (2003) and Chin and Dibbern (2010) suggests a simultaneous comparison of path coefficients. We argue that in the early stages of research or in complex models, a researcher might be interested not only in differences between path coefficients but also in comparing whole models across groups. Second, the use of repeated t-tests to investigate differences, e.g., in path coefficients, narrows the relaxation of distributional assumptions, which is considered a major advantage of PLS-PM. To maintain this advantage, a non-parametric test would be preferable. Finally, the simultaneous comparison of multiple parameters involves the risk of inducing an FWER, which is the probability that at least one single test makes one Type I error (falsely rejecting the null hypothesis) (Dudoit et al., 2004). However, for a proper single statistical test, the type I error rate is usually determined by the significance level, and the simultaneous application of multiple tests increases the FWER if not controlled. Hence, the aforementioned testing procedures in MGA face the risk of an FWER that is too high, i.e., the null hypothesis of no group differences is rejected too often. This issue is particularly relevant in scenarios with multiple groups and complex models, as the number of comparisons increases significantly. Let p be the number of parameters and G be the number of models; the number of overall comparisons c is calculated as follows:
For example, investigating whether four groups are heterogeneous with respect to ten parameters requires 60 statistical tests. Assuming a significance level of α=0.05, the probability of falsely rejecting the null hypothesis of homogeneity across groups (the FWER) is 1−(1−α)c. Without any further corrections, there is a 95.39 percent chance that at least one of the comparisons is statistically significant when the null hypothesis is indeed correct. This issue is also relevant when few parameters are compared (an overview is given in the Appendix, Table AI). Hence, MGA with repeated comparisons is required to take FWER into account.
FWER in MGA
| Number of groups | Number of parameters | Total number of comparisons (c) | FWER (α=5%) (%) |
|---|---|---|---|
| 2 | 4 | 4 | 18.55 |
| 8 | 8 | 33.66 | |
| 10 | 10 | 40.13 | |
| 3 | 4 | 12 | 45.96 |
| 8 | 24 | 70.80 | |
| 10 | 30 | 78.54 | |
| 4 | 4 | 24 | 70.80 |
| 8 | 48 | 91.47 | |
| 10 | 60 | 95.39 | |
| 5 | 4 | 40 | 87.15 |
| 8 | 80 | 98.35 | |
| 10 | 100 | 99.41 |
| Number of groups | Number of parameters | Total number of comparisons (c) | FWER (α=5%) (%) |
|---|---|---|---|
| 2 | 4 | 4 | 18.55 |
| 8 | 8 | 33.66 | |
| 10 | 10 | 40.13 | |
| 3 | 4 | 12 | 45.96 |
| 8 | 24 | 70.80 | |
| 10 | 30 | 78.54 | |
| 4 | 4 | 24 | 70.80 |
| 8 | 48 | 91.47 | |
| 10 | 60 | 95.39 | |
| 5 | 4 | 40 | 87.15 |
| 8 | 80 | 98.35 | |
| 10 | 100 | 99.41 |
To show how MGA is used in IS research and how the FWER is controlled, we conducted a literature review. We queried the Web of Science database, including publications from nine leading journals from the IS domain (European Journal of Information Systems, Information Systems Journal, Information Systems Research, Internet Research, Journal of the Association for Information Systems, Journal of Information Technology, Journal of Management Information Systems, Journal of Strategic Information Systems and Management Information Systems Quarterly). We used “multigroup” and “group differences” as keywords and included all articles applying PLS-PM. Since applying MGA is often a partial issue and therefore not mentioned explicitly in the papers’ abstract, we also included illustrative examples of references in the pertinent literature (Qureshi and Compeau, 2009). For each paper, we identified the grouping variable and its levels, how the significance level was adjusted and the number of path coefficients relevant for the MGA. Since the considered papers do not report any kind of correction of the significance level, we compute the FWER according to Equation (1). An overview is provided in Table I (sorted by year of publication).
Multigroup analysis in IS research
| Reference | Grouping variable | Paths | Comparisons | FWER (%) |
|---|---|---|---|---|
| Keil et al. (2000) | Culture (Finland, The Netherlands, Singapore) | 5 | 15 | 53.67 |
| Ahuja and Thatcher (2005) | Gender (male, female) | 5 | 5 | 22.62 |
| Srite and Karahanna (2006) | Cultural values (individualism, collectivism) | 4 | 4 | 18.55 |
| Zhu et al. (2006) | Users (EDI user, non-user) | 16 | 16 | 55.99 |
| Hsieh et al. (2008) | Economically (advantaged, disadvantaged) | 9 | 9 | 36.98 |
| Sia et al. (2009) | Cultural differences (Australia, Hong Kong) | 6 | 6 | 26.49 |
| Shen et al. (2010) | Gender (male, female) | 6 | 6 | 26.49 |
| Yeh et al. (2012) | Gender (male, female) | 4 | 4 | 18.55 |
| Dibbern et al. (2012) | Country (Germany, USA) | 5 | 5 | 22.62 |
| Zhou et al. (2015) | Indulgence (high indulgence, low indulgence) | 4 | 4 | 18.55 |
| Huma et al. (2017) | Organization (private, public) | 6 | 6 | 26.46 |
| Shi et al. (2018) | Gender (male, female) | 3 | 3 | 14.26 |
| Reference | Grouping variable | Paths | Comparisons | FWER (%) |
|---|---|---|---|---|
| Culture (Finland, The Netherlands, Singapore) | 5 | 15 | 53.67 | |
| Gender (male, female) | 5 | 5 | 22.62 | |
| Cultural values (individualism, collectivism) | 4 | 4 | 18.55 | |
| Users (EDI user, non-user) | 16 | 16 | 55.99 | |
| Economically (advantaged, disadvantaged) | 9 | 9 | 36.98 | |
| Cultural differences (Australia, Hong Kong) | 6 | 6 | 26.49 | |
| Gender (male, female) | 6 | 6 | 26.49 | |
| Gender (male, female) | 4 | 4 | 18.55 | |
| Country (Germany, USA) | 5 | 5 | 22.62 | |
| Indulgence (high indulgence, low indulgence) | 4 | 4 | 18.55 | |
| Organization (private, public) | 6 | 6 | 26.46 | |
| Gender (male, female) | 3 | 3 | 14.26 |
Our review indicates that interest in MGA has increased. Fundamental papers that paved the way for MGA in the context of PLS-PM (Keil et al., 2000; Qureshi and Compeau, 2009) have undoubtedly contributed to this development. At the same time, this review also shows that issues associated with multiple comparison tests, i.e., the FWER, have received little attention so far. Across all reviewed papers, there was no report of any kind of correction (e.g. Bonferroni correction). Hence, we are inclined to assume that a correction was not applied. In conclusion, most papers might be affected by a considerable inflation of the FWER (14.26 percent< FWER <55.99 percent). This issue particularly affects studies that investigate more than two groups and/or a high number of path coefficients.
3. A test for MGA
3.1 Measuring heterogeneity across groups
Here, we propose two new tests to compare whole models across groups to investigate heterogeneity. Similar to the test for overall model fit in PLS-PM (Dijkstra and Henseler, 2015b), which considers the discrepancy between the empirical indicator covariance matrix and the model-implied counterpart, we propose to investigate the distances between the indicator model-implied correlation matrices across groups[1].
To determine the differences between the model-implied indicator correlation matrices across groups, every measure that satisfies the properties of a distance (Deza and Deza, 2016, p. 3) can be used. Consequently, a distance greater than zero implies that two groups differ. If there is a statistically significant distance between the groups, further steps can be conducted to investigate the differences in more depth, e.g., investigation of specific path coefficients.
For the purpose of our research, i.e., assessing the distances between model-implied indicator correlation matrices, we consider two established distance measures: the geodesic distance and the squared Euclidean distance. While the squared Euclidean distance is well known, the geodesic distance is illustrated as follows: It belongs to Swain’s (1975) class of fitting functions and can be employed to estimate the model parameter in SEM. Properly scaled, it is asymptotically equal to the fitting function used in the maximum likelihood (ML) estimation for SEM.
In the case of two groups, the geodesic distance between the model-implied correlation matrix of group 1 (Σ(θ1)) and group 2 (Σ(θ2)) is calculated as follows:
where φi is the i-th eigenvalue of the matrix Σ(θ1)−1Σ(θ2) and K is the number of rows of one of these two matrices. When the two matrices are equal, the geodesic distance is zero since all eigenvalues of a unit matrix are one.
The squared Euclidean distance between Σ(θ1) and Σ(θ2) is calculated as follows:
where K is again the number of rows and σij,1 and σij,2 are elements of the respective matrix. If both matrices are identical, the squared Euclidean distance is zero; otherwise, this distance is greater than zero.
Since MGA is often conducted with more than two groups, we calculate the arithmetic mean of the distances of all possible pairs of groups. Note that the total number of group comparisons is G(G−1)/2, where G is the number of groups. Therefore, the average geodesic distance (Dg) for G groups is calculated as follows:
where φi is the i-th eigenvalue of the matrix Σ(θg)−1Σ(θh).
In a similar manner, we calculate the average squared Euclidean distance for more than two groups as follows:
where σij,g and σij,h are the elements of the corresponding model-implied correlation matrix. Since the squared Euclidean and the geodesic distance are either larger than or equal to zero, the two proposed average distances cannot be negative. Moreover, these distances are zero if all correlation matrices are equal; otherwise, they are larger than zero.
In terms of MGA, the considered null hypothesis is as follows: H0: Σ(θ1)=…=Σ(θg)=…=Σ(θG), where Σ(θg) is the model-implied population correlation matrix of the indicators for group g. To obtain the reference distribution of the distance measures including the average distances, we apply a permutation procedure, as described below.
3.2 Permutation tests
Permutation tests were introduced by Sir Ronald Fisher (1935) as a general approach for statistical inferences and have been considered the gold standard in medicine research (Edgington and Onghena, 2007, p. 9). There are three common types of permutation tests: exact permutation tests, moment–approximation permutation tests and resampling–approximation permutation tests (Berry et al., 2014). All three types of tests share the characteristic that they use permutation to obtain the distribution of the test statistic under the null hypothesis. The exact permutation test obtains the reference distribution by calculating the test statistic for all possible permutations of the original data set. Thus, the number of calculations can grow considerably with an increasing number of observations. Consequently, the application of an exact permutation test is not always reasonable. The moment–approximation permutation test requires the computation of the exact moments of the test statistic, which are then used to fit a specific distribution. In turn, this distribution is used for the calculation of the p-value. The resampling–approximation permutation test is similar to the exact permutation test; however, the reference distribution of the test statistic is based on only a subset of all possible permutations of the original sample. Due to its feasibility, this test is widely used.
Multiple resampling–approximation permutation tests have been developed in the context of PLS-PM; for example, a permutation test for compositional invariance (Henseler et al., 2016) and a permutation test to compare parameters across groups (Chin and Dibbern, 2010) have been developed. This type of permutation test has a distinct advantage compared to parametric tests. For example, it makes no assumptions about the distribution of the test statistic. Since PLS-PM also makes no distributional assumptions, this type of permutation test is perfectly in line with the PLS-PM’s spirit. Moreover, such permutation tests have favorable properties for small sample sizes (Ludbrook and Dudley, 1994), and they are robust against extreme values. Therefore, we also choose this type of permutation test to compare the model-implied indicator correlation matrices across groups.
4. Monte Carlo simulation
4.1 Simulation design
We used a Monte Carlo simulation to evaluate the sensitivity (power) and specificity (Parikh et al., 2008) of our two proposed permutation tests, where one is based on the average geodesic distance (Dg) and the other on the average squared Euclidean distance (De). While specificity is the test’s ability to correctly reveal homogeneity across groups, sensitivity is the ability to correctly detect heterogeneity across groups. To compare the performance of our two proposed permutation tests to existing testing procedures, we also included a test procedure based on repeated comparisons of path coefficients (RCPC), i.e., the path coefficients are compared across all groups (Chin, 2003; Chin and Dibbern, 2010).
Similar to the previous literature on MGA (e.g. Qureshi and Compeau, 2009), we used a structural population model with four constructs modeled as composites (Figure 1)[2]. All composites consist of three indicators. The population weights to form composites C2 to C4 are set to 0.3, 0.5 and 0.6. The weights from C1 vary according to the following five scenarios, in which we compare three groups: the groups are homogenous; the groups have small differences among their structural models; the groups have moderate differences among their structural models; the population weights of the first composite vary across groups; and in addition to the previous scenario, the structural models also show small differences across groups. Table II presents the manipulated population parameters.
Population parameters
| Scenario | Dg | De | g | β41 | w11 | w21 | w31 |
|---|---|---|---|---|---|---|---|
| (i) Homogeneity | 0 | 0 | 1 | 0 | 0.30 | 0.50 | 0.60 |
| 2 | 0 | 0.30 | 0.50 | 0.60 | |||
| 3 | 0 | 0.30 | 0.50 | 0.60 | |||
| (ii) Small structural difference | 0.0471 | 0.0133 | 1 | 0 | 0.30 | 0.50 | 0.60 |
| 2 | 0.1 | 0.30 | 0.50 | 0.60 | |||
| 3 | 0.2 | 0.30 | 0.50 | 0.60 | |||
| (iii) Moderate structural differences | 0.3293 | 0.0266 | 1 | 0 | 0.60 | 0.50 | 0.30 |
| 2 | 0.2 | 0.30 | 0.50 | 0.60 | |||
| 3 | 0.4 | 0.30 | 0.50 | 0.60 | |||
| (iv) Different weightsa | 0.2576 | 0.0337 | 1 | 0 | 0.60 | 0.50 | 0.30 |
| 2 | 0 | 0.80 | 0.30 | 0.30 | |||
| 3 | 0 | 0.38 | 0.38 | 0.66 | |||
| (v) Structural differences and different weightsa | 0.3138 | 0.0409 | 1 | 0 | 0.60 | 0.50 | 0.30 |
| 2 | 0.1 | 0.80 | 0.30 | 0.30 | |||
| 3 | 0.2 | 0.38 | 0.38 | 0.66 |
| Scenario | Dg | De | g | β41 | w11 | w21 | w31 |
|---|---|---|---|---|---|---|---|
| (i) Homogeneity | 0 | 0 | 1 | 0 | 0.30 | 0.50 | 0.60 |
| 2 | 0 | 0.30 | 0.50 | 0.60 | |||
| 3 | 0 | 0.30 | 0.50 | 0.60 | |||
| (ii) Small structural difference | 0.0471 | 0.0133 | 1 | 0 | 0.30 | 0.50 | 0.60 |
| 2 | 0.1 | 0.30 | 0.50 | 0.60 | |||
| 3 | 0.2 | 0.30 | 0.50 | 0.60 | |||
| (iii) Moderate structural differences | 0.3293 | 0.0266 | 1 | 0 | 0.60 | 0.50 | 0.30 |
| 2 | 0.2 | 0.30 | 0.50 | 0.60 | |||
| 3 | 0.4 | 0.30 | 0.50 | 0.60 | |||
| (iv) Different weightsa | 0.2576 | 0.0337 | 1 | 0 | 0.60 | 0.50 | 0.30 |
| 2 | 0 | 0.80 | 0.30 | 0.30 | |||
| 3 | 0 | 0.38 | 0.38 | 0.66 | |||
| (v) Structural differences and different weightsa | 0.3138 | 0.0409 | 1 | 0 | 0.60 | 0.50 | 0.30 |
| 2 | 0.1 | 0.80 | 0.30 | 0.30 | |||
| 3 | 0.2 | 0.38 | 0.38 | 0.66 |
Notes: Group (g); average geodesic distance (Dg); average squared Euclidean distance (De); aWeights are rounded (two digits)
Furthermore, we varied the sample size per group from 100 to 500 observations. Finally, to investigate the robustness of our tests, we consider normally and non-normally distributed samples. To generate the non-normal data, we multiplied the samples drawn from the multivariate standard normal distribution by a scale factor, as proposed by Dijkstra and Henseler (2015b), leading to a kurtosis of approximately 1.74. We expect that the tests perform slightly worse when non-normally distributed, in contrast to the results of tests using normally distributed data. In total, we consider 50 experimental designs (5 scenarios × 2 different distributions × 5 sample size). For each design, we conduct 300 runs.
For consistent estimation of the weights, we employed Mode B (Dijkstra, 1981). To obtain the reference distribution of the two test statistics, we used 499 permutation runs. The simulation was implemented in the statistical programming environment R (version 3.4.0., R Core Team, 2017) using the mvrnorm function of the MASS package to draw data from the multivariate normal distribution (version 7.3-47, Ripley et al., 2017) and the matrixpls package to estimate the specified model with the same structure as the population models (version 1.0.5, Rönkkö, 2017).
4.2 Simulation results
The produced rejection rates of the two permutation tests are displayed in Table III.
Rejection rates
| Normal data | Non-normal data | ||||||
|---|---|---|---|---|---|---|---|
| Scenario | n per group | Dg (%) | De (%) | RCPC (%) | Dg (%) | De (%) | RCPC (%) |
| (i) Homogeneity | 100 | 5.7 | 7.0 | 50.3 | 5.0 | 7.3 | 51.7 |
| 200 | 5.0 | 4.3 | 51.0 | 3.0 | 2.7 | 53.3 | |
| 300 | 2.3 | 4.7 | 52.7 | 6.7 | 5.7 | 54.3 | |
| 400 | 4.7 | 4.7 | 55.0 | 4.3 | 2.3 | 47.0 | |
| 500 | 4.0 | 6.3 | 48.0 | 4.7 | 5.0 | 56.0 | |
| (ii) Small structural difference | 100 | 12.3 | 9.0 | 68.7 | 8.7 | 4.7 | 64.7 |
| 200 | 19.7 | 11.7 | 77.0 | 16.7 | 15.3 | 71.3 | |
| 300 | 32.0 | 22.7 | 84.7 | 22.7 | 18.3 | 78.7 | |
| 400 | 45.7 | 28.3 | 91.0 | 30.7 | 24.3 | 86.7 | |
| 500 | 56.7 | 36.7 | 96.3 | 41.0 | 33.7 | 92.3 | |
| (iii) Moderate structural differences | 100 | 70.3 | 25.3 | 91.3 | 41.0 | 18.7 | 85.7 |
| 200 | 99.0 | 59.3 | 99.7 | 84.3 | 46.7 | 97.0 | |
| 300 | 100.0 | 86.3 | 100.0 | 99.0 | 76.7 | 100.0 | |
| 400 | 100.0 | 96.3 | 100.0 | 99.7 | 86.3 | 100.0 | |
| 500 | 100.0 | 99.3 | 100.0 | 100.0 | 96.3 | 100.0 | |
| (iv) Different weights | 100 | 54.3 | 59.0 | 52.7 | 37.3 | 42.7 | 56.3 |
| 200 | 97.7 | 97.0 | 58.0 | 85.3 | 83.7 | 58.0 | |
| 300 | 100.0 | 100.0 | 62.3 | 99.0 | 99.3 | 60.3 | |
| 400 | 100.0 | 100.0 | 62.0 | 99.7 | 99.7 | 59.3 | |
| 500 | 100.0 | 100.0 | 63.0 | 100.0 | 100.0 | 62.0 | |
| (v) Structural differences and different weights | 100 | 71.7 | 70.0 | 64.3 | 51.7 | 56.3 | 61.3 |
| 200 | 99.7 | 99.0 | 80.7 | 93.0 | 94.7 | 72.3 | |
| 300 | 100.0 | 100.0 | 89.7 | 100.0 | 100.0 | 82.0 | |
| 400 | 100.0 | 100.0 | 93.3 | 100.0 | 100.0 | 89.3 | |
| 500 | 100.0 | 100.0 | 97.0 | 100.0 | 100.0 | 95.3 | |
| Normal data | Non-normal data | ||||||
|---|---|---|---|---|---|---|---|
| Scenario | n per group | Dg (%) | De (%) | RCPC (%) | Dg (%) | De (%) | RCPC (%) |
| (i) Homogeneity | 100 | 5.7 | 7.0 | 50.3 | 5.0 | 7.3 | 51.7 |
| 200 | 5.0 | 4.3 | 51.0 | 3.0 | 2.7 | 53.3 | |
| 300 | 2.3 | 4.7 | 52.7 | 6.7 | 5.7 | 54.3 | |
| 400 | 4.7 | 4.7 | 55.0 | 4.3 | 2.3 | 47.0 | |
| 500 | 4.0 | 6.3 | 48.0 | 4.7 | 5.0 | 56.0 | |
| (ii) Small structural difference | 100 | 12.3 | 9.0 | 68.7 | 8.7 | 4.7 | 64.7 |
| 200 | 19.7 | 11.7 | 77.0 | 16.7 | 15.3 | 71.3 | |
| 300 | 32.0 | 22.7 | 84.7 | 22.7 | 18.3 | 78.7 | |
| 400 | 45.7 | 28.3 | 91.0 | 30.7 | 24.3 | 86.7 | |
| 500 | 56.7 | 36.7 | 96.3 | 41.0 | 33.7 | 92.3 | |
| (iii) Moderate structural differences | 100 | 70.3 | 25.3 | 91.3 | 41.0 | 18.7 | 85.7 |
| 200 | 99.0 | 59.3 | 99.7 | 84.3 | 46.7 | 97.0 | |
| 300 | 100.0 | 86.3 | 100.0 | 99.0 | 76.7 | 100.0 | |
| 400 | 100.0 | 96.3 | 100.0 | 99.7 | 86.3 | 100.0 | |
| 500 | 100.0 | 99.3 | 100.0 | 100.0 | 96.3 | 100.0 | |
| (iv) Different weights | 100 | 54.3 | 59.0 | 52.7 | 37.3 | 42.7 | 56.3 |
| 200 | 97.7 | 97.0 | 58.0 | 85.3 | 83.7 | 58.0 | |
| 300 | 100.0 | 100.0 | 62.3 | 99.0 | 99.3 | 60.3 | |
| 400 | 100.0 | 100.0 | 62.0 | 99.7 | 99.7 | 59.3 | |
| 500 | 100.0 | 100.0 | 63.0 | 100.0 | 100.0 | 62.0 | |
| (v) Structural differences and different weights | 100 | 71.7 | 70.0 | 64.3 | 51.7 | 56.3 | 61.3 |
| 200 | 99.7 | 99.0 | 80.7 | 93.0 | 94.7 | 72.3 | |
| 300 | 100.0 | 100.0 | 89.7 | 100.0 | 100.0 | 82.0 | |
| 400 | 100.0 | 100.0 | 93.3 | 100.0 | 100.0 | 89.3 | |
| 500 | 100.0 | 100.0 | 97.0 | 100.0 | 100.0 | 95.3 | |
Homogeneity (scenario (i))
The degree of specificity is shown in the first rows of Table III (“Homogeneity”). For this scenario, our two tests maintain the predefined significance level of 5 percent quite well, while the repeated comparison testing procedure rejects the null hypothesis of no group differences far too often (>48.0 percent).
Structural differences (scenarios (ii) and (iii))
Concerning small structural differences, both new tests are limited in terms of their rejection rates. In most of the conditions, the rejection rates are below the recommended threshold of 80 percent (Cohen, 1988). For moderate structural differences, both permutation tests reliably detect differences in most of the conditions. However, the repeated comparison test produces even higher rejection rates.
Different weights (scenarios (iv) and (v))
The results also confirm that our approach is able to detect heterogeneity in groups with different weights. In situations where only the weights differ, both new tests perform quite well, i.e., high rejection rates and outperform the RCPC approach in most conditions. It is notable that the new tests perform even better if both the structural model and the weights differ across groups.
Sample size and data distribution
Across all conditions, all tests benefit from an increasing sample size, which results in a higher statistical power. Moreover, our results confirm that all tests perform slightly worse once the data are non-normally distributed. However, with a sufficiently large sample size, heterogeneity across groups can still be detected. With regard to moderate structural differences, 200 observations are necessary to obtain a sufficient degree of power (Dg: 99.0 percent) for normally distributed data. For non-normal data, 300 observations per group are necessary to achieve a similar level of statistical power (Dg: 99.0 percent).
Summary
Overall, the test based on the average geodesic distance produces higher rejection rates than the test based on the average squared Euclidean distance. The highest rejection rates for structural differences are produced for scenario (iii), i.e., moderate differences in the structural model across groups. Here, the test based on the average geodesic distance provides acceptable results, even if the sample size is small (n=200; 99.0 percent). In contrast, the test based on the average squared Euclidean distance detects heterogeneity in only 59.3 percent of the cases.
5. Discussion
Despite the prevalence of heterogeneity in the social sciences, a test to compare a whole model across groups has not been available thus far. To allow for such a comparison, this study contributes to the existing literature by proposing two novel permutation tests based on the average geodesic distance (Dg) and the average squared Euclidean distance (De).
Our simulation study provides initial evidence that the two tests are viable options when the aim is to detect heterogeneity across multiple groups. Most importantly, the two proposed tests are capable of maintaining a predefined significance level. Hence, homogenous groups are not falsely classified as heterogeneous. This is a major advantage over the RCPC, which yields an inflation of FWER when not adjusted properly.
Based on the power results, the test based on the average geodesic distance is superior across the considered conditions. In particular, in situations with only small differences across groups, Dg outperforms De. As expected, an increasing sample size improves the power of all tests. Our simulation results also indicate that 100 observations are not sufficient for an acceptable power. Instead, 200 observations are required to exceed the threshold. This is in line with previous studies that highlighted the requirement of a sufficient sample size to detect heterogeneity (Qureshi and Compeau, 2009).
As indicated by the results of the variation in population composite weights, our approach is also able to detect differences within a composite model across groups. This highlights a major strength of this generic approach because it can be used for different purposes and is not limited to structural differences only. However, to apply MGA, it is important to establish measurement invariance in advance. Otherwise, an MGA may yield misguiding or incorrect conclusions (Henseler et al., 2016). Therefore, we recommend applying the presented tests after measurement invariance is established.
Against this background, current MGA guidelines in the context of PLS-PM need to be extended. If measurement invariances are established, we recommend initiating MGA by providing the results of one of the proposed tests, preferably the test based on the average geodesic distance. Only if there is a significant difference in the model-implied indicator correlation matrices across groups should existing techniques that allow for the investigation of single effects be applied. In fact, if a grouping variable does not lead to significant differences between groups, a researcher should either reject heterogeneity or respecify a grouping variable before conducting further analyses. Therefore, we propose new MGA guidelines that comprise our proposed tests and existing MGA procedures and consist of the following four steps displayed in Figure 2:
Establish measurement invariance: before conducting an MGA, a researcher should establish measurement invariance (Henseler et al., 2016). Otherwise, an MGA is not meaningful. If measurement invariance is established, the subsequent steps can be applied to test for heterogeneity.
Overall evaluation: testing differences across all groups is considered the starting point for MGA. With this initial test, a researcher is able to determine whether groups differ significantly. This initial effort is particularly important when more than two groups are considered. If the test does not support heterogeneity, a researcher can either reject heterogeneity or respecify the grouping variable.
Pair-wise evaluation: If heterogeneity was found in the previous step, the purpose of this step is to investigate the heterogeneity in more detail. For that purpose, the proposed tests can be used for each pair of groups to examine which groups actually differ.
Effect-wise evaluation: finally, the differences are investigated with regard to specific coefficients such as path coefficients. For that purpose, researchers can draw from parametric approaches (Chin, 2000; Keil et al., 2000) or non-parametric approaches (Henseler, 2012). As a result, we can determine whether there are group differences with respect to a specific effect.
6. Limitations and outlook
This paper presents initial insights into the efficacy of the tests for the comparison of the model-implied indicator correlation matrices across groups, while other questions remain unanswered and should be addressed by future research. The simulation study should be extended to further investigate the tests’ performance. Important extensions include the consideration of unequal group sizes, a population model with a non-saturated structural model and path coefficients with positive and negative signs. Moreover, since PLS-PM in its current form, i.e., PLSc, is a composite-based estimator that can be used to estimate models containing both composites and factors, future research could further investigate the performance of our proposed tests for that type of model. We argue that the permutation tests should be based on the model-implied indicator correlation matrix to compare the whole model across groups. However, the permutation tests may also be based on other matrices, such as the model-implied construct correlation matrix, so that differences in only the structural model are investigated across groups. We chose to utilize two established distance measures, namely, the squared Euclidean distance and the geodesic distance. Since the literature provides several other distance measures that may also be useful (Deza and Deza, 2016), future research could compare their performance in the context of our proposed testing procedure. Because PLSc (Dijkstra and Henseler, 2015a) encourages the use of PLS-PM for factor models, it also seems necessary to compare the test’s performance to the performance of tests typically used in factor-based SEM. Finally, although not explicitly emphasized, our tests for multigroup comparison are of a confirmatory nature. Hence, it should be used with caution when applied to groups created by cluster analysis or similar techniques.
Michael Klesel and Björn Niehaves acknowledge the support provided by the German Federal Ministry of Education and Research (BMBF, promotional reference 02L14A011).
Appendix. FWER in multiple comparison scenarios
Assuming that we have a fixed number of groups with a fixed number of parameters, the total number of comparisons can be determined according to Equation (1). Performing multiple comparisons without correction for Type I errors results in a 1−(1−α)c FWER, where c is the total number of comparisons and α is the predefined significance level for each comparison.
Notes
For a better comparison, we consider the model-implied correlation matrix of the indicators instead of their model-implied covariance matrix.
Since PLS-PM is often applied to models with more than four constructs (Ringle et al., 2012), we also ran the simulation with a larger model that includes eight composites. Since the results were comparable, they are not reported here.


