On proper regression modeling in varieties research
Abstract
One particularly prominent methodological development in linguistics is what has been termed the “quantitative turn”: Not only are more and more studies using statistical tools to explore data and to test hypotheses, the complexity of the statistical methods employed is growing as well. This development is particularly prominent in all kinds of corpus-linguistic studies: 20 years ago chi-squared tests, t-tests, and Pearson’s r reigned supreme, but now more and more corpus studies are using multivariate exploratory tools and, for hypothesis testing, multifactorial predictive modeling techniques, in particular regression models (and, increasingly, tree-based methods). However welcome this development is, it, and especially its pace as well as the fact that few places offer rigorous training in statistical methods, comes with its own risks, chief among them that analytical methods are misapplied, which can lead imprecise, incomplete, or wrong analyses. In this paper, I will revisit a recent regression-analytic study in the research area of English varieties (Parviainen & Fuchs 2018 on clause-final also and only in three Asian Englishes) to:
- highlight in particular three fundamental yet frequent mistakes that it exemplifies;
- discuss why and how each of these mistakes should be addressed;
- reanalyze the data (as far as is possible with what is available) and show briefly how that affects the analysis’s results and interpretation.
1 Parviainen & Fuchs (2018): what is explored and how
Their analyses are based on the private-conversation parts of the studied components, for which relevant metadata for the speakers are available. They retrieved all instances of clause-final also/only from these corpus parts and put together a data frame shown here as Table 1 (their Appendix 1 with two cosmetic changes and my bolding to be explained below):
| CASE | PARTICLE | VARIETY | GENDER | AGE | SPKRS | TOKENS | WORDS | TMWpaper | 
|---|---|---|---|---|---|---|---|---|
| C001 | also | HKE | female | 14–25 | 84 | 45 | 107796 | 417 | 
| C002 | also | HKE | female | 26–35 | 3 | 0 | 5304 | 0 | 
| C003 | also | HKE | female | 36–50 | 6 | 0 | 5888 | 0 | 
| C004 | also | HKE | female | >50 | 0 | 0 | 0 | 0 | 
| C005 | also | HKE | male | 14–25 | 15 | 5 | 16599 | 301 | 
| C006 | also | HKE | male | 26–35 | 2 | 0 | 2892 | 0 | 
| C007 | also | HKE | male | 36–50 | 4 | 0 | 3204 | 0 | 
| C008 | also | HKE | male | >50 | 1 | 0 | 902 | 0 | 
| C009 | also | IndE | female | 14–25 | 52 | 74 | 43410 | 1705 | 
| C010 | also | IndE | female | 26–35 | 33 | 33 | 29456 | 1120 | 
| C011 | also | IndE | female | 36–50 | 27 | 40 | 25282 | 1582 | 
| C012 | also | IndE | female | >50 | 9 | 11 | 6757 | 1628 | 
| C013 | also | IndE | male | 14–25 | 18 | 14 | 14236 | 983 | 
| C014 | also | IndE | male | 26–35 | 26 | 24 | 25082 | 957 | 
| C015 | also | IndE | male | 36–50 | 37 | 31 | 34771 | 892 | 
| C016 | also | IndE | male | >50 | 37 | 35 | 31927 | 1096 | 
| C017 | also | PhiE | female | 14–25 | 170 | 37 | 93197 | 397 | 
| C018 | also | PhiE | female | 26–35 | 78 | 9 | 25083 | 359 | 
| C019 | also | PhiE | female | 36–50 | 54 | 6 | 11674 | 514 | 
| C020 | also | PhiE | female | >50 | 0 | 2 | 3613 | 554 | 
| C021 | also | PhiE | male | 14–25 | 69 | 6 | 32386 | 185 | 
| C022 | also | PhiE | male | 26–35 | 75 | 5 | 17283 | 289 | 
| C023 | also | PhiE | male | 36–50 | 106 | 6 | 6190 | 969 | 
| C024 | also | PhiE | male | >50 | 1 | 0 | 3937 | 0 | 
| C025 | only | HKE | female | 14–25 | 84 | 29 | 107796 | 269 | 
| C026 | only | HKE | female | 26–35 | 3 | 2 | 5304 | 377 | 
| C027 | only | HKE | female | 36–50 | 6 | 0 | 5888 | 0 | 
| C028 | only | HKE | female | >50 | 0 | 0 | 0 | 0 | 
| C029 | only | HKE | male | 14–25 | 15 | 1 | 16599 | 60 | 
| C030 | only | HKE | male | 26–35 | 2 | 0 | 2892 | 0 | 
| C031 | only | HKE | male | 36–50 | 4 | 0 | 3204 | 0 | 
| C032 | only | HKE | male | >50 | 1 | 0 | 902 | 0 | 
| C033 | only | IndE | female | 14–25 | 52 | 41 | 43410 | 944 | 
| C034 | only | IndE | female | 26–35 | 33 | 20 | 29456 | 679 | 
| C035 | only | IndE | female | 36–50 | 27 | 17 | 25282 | 672 | 
| C036 | only | IndE | female | >50 | 9 | 0 | 6757 | 0 | 
| C037 | only | IndE | male | 14–25 | 18 | 17 | 14236 | 1194 | 
| C038 | only | IndE | male | 26–35 | 26 | 14 | 25082 | 558 | 
| C039 | only | IndE | male | 36–50 | 37 | 10 | 34771 | 288 | 
| C040 | only | IndE | male | >50 | 37 | 7 | 31927 | 219 | 
| C041 | only | PhiE | female | 14–25 | 170 | 12 | 93197 | 129 | 
| C042 | only | PhiE | female | 26–35 | 78 | 1 | 25083 | 40 | 
| C043 | only | PhiE | female | 36–50 | 54 | 1 | 11674 | 86 | 
| C044 | only | PhiE | female | >50 | 0 | 0 | 3613 | 0 | 
| C045 | only | PhiE | male | 14–25 | 69 | 0 | 32386 | 0 | 
| C046 | only | PhiE | male | 26–35 | 78 | 1 | 17283 | 58 | 
| C047 | only | PhiE | male | 36–50 | 106 | 0 | 6190 | 0 | 
| C048 | only | PhiE | male | >50 | 1 | 1 | 3937 | 254 | 
summary(m.final.also <- stepAIC(
   lm(TMW ~ 1, data=dd$also),
   # bidirectional model selection ...
   direction="both",
   # between these two extremes: lower null & upper full model:
   scope=list(lower= ~1, upper= ~AGE*GENDER*VARIETY),
   trace=0)) # do not show all steps
Call:
lm(formula = TMW ~ VARIETY + GENDER + VARIETY:GENDER, data = dd$also)
Residuals:
    Min      1Q  Median      3Q     Max
-388.46  -98.86  -65.25  101.85  608.34
Coefficients:
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)              1508.8      117.4  12.856 1.65e-10 ***
VARIETYHKE              -1404.4      166.0  -8.462 1.09e-07 ***
VARIETYPhiE             -1052.9      166.0  -6.344 5.61e-06 ***
GENDERmale               -526.8      166.0  -3.174  0.00526 **
VARIETYHKE:GENDERmale     497.7      234.7   2.120  0.04813 *
VARIETYPhiE:GENDERmale    431.9      234.7   1.840  0.08232 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 234.7 on 18 degrees of freedom
Multiple R-squared:  0.8635,    Adjusted R-squared:  0.8256
F-statistic: 22.78 on 5 and 18 DF,  p-value: 3.33e-07summary(m.final.only <- stepAIC(
   lm(TMW ~ 1, data=dd$only), direction="both",
   scope=list(lower= ~1, upper= ~AGE*GENDER*VARIETY),
   trace=0))
Call:
lm(formula = TMW ~ VARIETY + AGE + VARIETY:AGE, data = dd$only)
Residuals:
    Min      1Q  Median      3Q     Max
-192.41  -74.38    0.00   74.38  192.41
Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)            1069.3      106.8  10.011 3.54e-07 ***
VARIETYHKE             -904.7      151.1  -5.989 6.32e-05 ***
VARIETYPhiE           -1004.9      151.1  -6.653 2.35e-05 ***
AGE26–35               -450.7      151.1  -2.984 0.011401 *
AGE36–50               -589.3      151.1  -3.901 0.002105 **
AGE>50                 -959.7      151.1  -6.353 3.65e-05 ***
VARIETYHKE:AGE26–35     474.6      213.6   2.222 0.046281 *
VARIETYPhiE:AGE26–35    435.2      213.6   2.037 0.064288 .
VARIETYHKE:AGE36–50     424.7      213.6   1.988 0.070118 .
VARIETYPhiE:AGE36–50    567.8      213.6   2.658 0.020877 *
VARIETYHKE:AGE>50       795.1      213.6   3.722 0.002917 **
VARIETYPhiE:AGE>50     1022.3      213.6   4.786 0.000444 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 151.1 on 12 degrees of freedom
Multiple R-squared:  0.8935,    Adjusted R-squared:  0.7958
F-statistic: 9.149 on 11 and 12 DF,  p-value: 0.00030062 Problems of the hypotheses/operationalizations
2.1 Lack of precision
2.2 What do between-variety frequencies reveal?
3 Problems of the statistical modeling
3.1 A few minor mistakes
3.1.1 Errors in the data
   CASE PARTICLE VARIETY GENDER   AGE SPKRS TOKENS WORDS       TMW
22 C022     also    PhiE   male 26–35    75      5 17283 289.30163
46 C046     only    PhiE   male 26–35    78      1 17283  57.86033   CASE PARTICLE VARIETY GENDER AGE SPKRS TOKENS WORDS      TMW
20 C020     also    PhiE female >50     0      2  3613 553.55663.1.2 Under-reporting and -analyzing
3.1.3 Over-reporting
3.2 Four fundamental issues
3.2.1 Issue 1: no level-1 predictors
3.2.1.1 General introduction
3.2.1.2 Application to P&F
3.2.2 Issue 2: two models on one data set
3.2.2.1 General introduction
3.2.2.2 Application to P&F
summary(m.final.all <- stepAIC(
   # note the new data argument: all the data from both particles now
   lm(TMW ~ 1, data=d),
   # bidirectional model selection ...
   direction="both",
   # between these 2 extremes: lower null & the new upper/full model:
   scope=list(lower= ~1, upper= ~AGE*GENDER*VARIETY*PARTICLE),
   trace=0)) # do not show all steps3.2.3 Issue 3: not weighting observations
3.2.3.1 General introduction
3.2.3.2 Application to P&F
  CASE VARIETY GENDER   AGE SPKRS TOKENS  WORDS      TMW
1 C001     HKE female 14–25    84     45 107796 417.4552
3 C003     HKE female 36–50     6      0   5888   0.0000
5 C005     HKE   male 14–25    15      5  16599 301.22303.2.4 Issue 4: potentially wrong regression model
4 Re-analyzing their data
4.1 Methods
# this could be fitted on all level-1 uses of also/only, clause-final or not:
m.required <- glmer(
   CLAUSEFINAL ~                # binary response: no vs. yes
   1                          + # intercept
   PARTICLE*AGE*SEX*VARIETY   + # predictors & interactions
   # minimally necessary random-effects structure (more is likely needed)
   (1|SPEAKER),                 # speakers
   family=binomial,             # the response is binary
   data=unaggregated.dataframe) # a new unaggregated data of all also/only# Here are the contrast-coding changes:
# expanding the data
d.expanded <- as.data.frame( # make d.expanded a data frame of
   lapply(d[,-6],            # applying to d (w/out column 6)
   # an anonym. function that repeats each value in each column SPKRS times:
   \(af) rep(af, d[,6])))
# making AGE 'more ordinal'
contrasts(d.expanded$AGE) <- contr.sdif
# planned contrasts for VARIETY
ind_vs_other <- c(2/3, -1/3, -1/3)
hke_vs_phi   <- c(0  ,  1/2, -1/2)
contrasts(d.expanded$VARIETY) <- cbind(ind_vs_other, hke_vs_phi)
Call:
lm(formula = TMW ~ 1 + PARTICLE * (AGE + GENDER + VARIETY)^2,
    data = d.expanded)
Residuals:
     Min       1Q   Median       3Q      Max
-167.840  -26.955    4.813   23.663  240.163
Coefficients:
                                                Estimate Std. Error t value Pr(>|t|)
(Intercept)                                      622.757     11.746  53.019 < 2e-16 ***
PARTICLEonly                                    -366.174     16.611 -22.044 < 2e-16 ***
AGE26–35-14–25                                  -329.911     13.127 -25.132 < 2e-16 ***
AGE36–50-26–35                                   115.089     15.701   7.330 3.47e-13 ***
AGE>50-36–50                                    -125.960     44.433  -2.835 0.00464 **
GENDERmale                                      -140.878      9.435 -14.932 < 2e-16 ***
VARIETYind_vs_other                             1311.489     16.620  78.909 < 2e-16 ***
VARIETYhke_vs_phi                               -259.573     29.130  -8.911 < 2e-16 ***
AGE26–35-14–25:GENDERmale                        247.732     13.166  18.815 < 2e-16 ***
AGE36–50-26–35:GENDERmale                        207.769     13.876  14.973 < 2e-16 ***
AGE>50-36–50:GENDERmale                         -217.169     29.864  -7.272 5.28e-13 ***
AGE26–35-14–25:VARIETYind_vs_other              -163.289     21.384  -7.636 3.63e-14 ***
AGE36–50-26–35:VARIETYind_vs_other               -43.210     24.007  -1.800 0.07205 .
AGE>50-36–50:VARIETYind_vs_other                 583.758     54.798  10.653 < 2e-16 ***
AGE26–35-14–25:VARIETYhke_vs_phi                -359.560     34.136 -10.533  < 2e-16 ***
AGE36–50-26–35:VARIETYhke_vs_phi                -425.201     40.348 -10.538  < 2e-16 ***
AGE>50-36–50:VARIETYhke_vs_phi                   739.750    105.090   7.039 2.75e-12 ***
GENDERmale:VARIETYind_vs_other                  -590.630     14.789 -39.936 < 2e-16 ***
GENDERmale:VARIETYhke_vs_phi                       7.251     19.437   0.373 0.70917
PARTICLEonly:AGE26–35-14–25                      193.176     18.563  10.406 < 2e-16 ***
PARTICLEonly:AGE36–50-26–35                     -177.554     22.203  -7.997 2.27e-15 ***
PARTICLEonly:AGE>50-36–50                       -175.959     62.837  -2.800 0.00516 **
PARTICLEonly:GENDERmale                           95.218     13.340   7.138 1.38e-12 ***
PARTICLEonly:VARIETYind_vs_other                -861.414     23.504 -36.649 < 2e-16 ***
PARTICLEonly:VARIETYhke_vs_phi                   319.783     41.195   7.763 1.39e-14 ***
PARTICLEonly:AGE26–35-14–25:GENDERmale          -216.984     18.583 -11.676 < 2e-16 ***
PARTICLEonly:AGE36–50-26–35:GENDERmale          -345.374     19.591 -17.630 < 2e-16 ***
PARTICLEonly:AGE>50-36–50:GENDERmale             616.900     42.232  14.607 < 2e-16 ***
PARTICLEonly:AGE26–35-14–25:VARIETYind_vs_other -206.081     30.240  -6.815 1.29e-11 ***
PARTICLEonly:AGE36–50-26–35:VARIETYind_vs_other    1.377     33.948   0.041 0.96766
PARTICLEonly:AGE>50-36–50:VARIETYind_vs_other   -962.302     77.496 -12.417 < 2e-16 ***
PARTICLEonly:AGE26–35-14–25:VARIETYhke_vs_phi    427.858     48.272   8.864 < 2e-16 ***
PARTICLEonly:AGE36–50-26–35:VARIETYhke_vs_phi    179.836     57.056   3.152 0.00165 **
PARTICLEonly:AGE>50-36–50:VARIETYhke_vs_phi     -847.091    148.619  -5.700 1.40e-08 ***
PARTICLEonly:GENDERmale:VARIETYind_vs_other      629.971     20.911  30.126 < 2e-16 ***
PARTICLEonly:GENDERmale:VARIETYhke_vs_phi       -135.348     27.487  -4.924 9.26e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 71.97 on 1781 degrees of freedom
Multiple R-squared:  0.9731,    Adjusted R-squared:  0.9725
F-statistic:  1838 on 35 and 1781 DF,  p-value: < 2.2e-16Single term deletions
Model:
TMW ~ 1 + PARTICLE * (AGE + GENDER + VARIETY)^2
                        Df Sum of Sq      RSS   AIC F value    Pr(>F)
<none>                                9224376 15575
PARTICLE:AGE:GENDER      3   5115524 14339900 16371 329.227 < 2.2e-16 ***
PARTICLE:AGE:VARIETY     6   2963271 12187648 16070  95.356 < 2.2e-16 ***
PARTICLE:GENDER:VARIETY  2   5551651 14776027 16428 535.944 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 14.2 Selected results
5 Concluding remarks
6 References
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base
other attached packages:
[1] MASS_7.3-60.2  lme4_1.1-35.3  Matrix_1.7-0   effects_4.2-2  car_3.1-2
[6] carData_3.0-5  STGmisc_1.0    Rcpp_1.0.12    magrittr_2.0.3
loaded via a namespace (and not attached):
 [1] nlme_3.1-164       cli_3.6.2          knitr_1.46         rlang_1.1.3
 [5] xfun_0.44          estimability_1.5.1 DBI_1.2.2          mitools_2.4
 [9] survey_4.4-2       minqa_1.2.7        jsonlite_1.8.8     colorspace_2.1-0
[13] nnet_7.3-19        htmltools_0.5.8.1  rmarkdown_2.27     grid_4.4.0
[17] evaluate_0.23      abind_1.4-5        fastmap_1.2.0      yaml_2.3.8
[21] insight_0.19.11    htmlwidgets_1.6.4  rstudioapi_0.16.0  lattice_0.22-6
[25] digest_0.6.35      nloptr_2.0.3       splines_4.4.0      tools_4.4.0
[29] boot_1.3-30        survival_3.6-4