Ling 104, session 05: distr. & freq. (key)

Author

Affiliation

UC Santa Barbara & JLU Giessen

Published

04 Feb 2025 12-34-56

Load the package magrittr and the file _input/partplacement.csv into a data frame d. This file contains data from a corpus study on the alternation of particle placement that was introduced in Section 1.3; you can find information about this data set in _input/partplacement.r.

rm(list=ls(all=TRUE)); library(magrittr)
summary(d <- read.delim(
   "_input/partplacement.csv",
   stringsAsFactors=TRUE))

      CASE          CONSTRUCTION     MEDIUM       DO_COMPLX     DO_LENSYLL
 Min.   :  1.00   v_do_prt:100   spoken :100   clausmod:  6   Min.   : 1.00
 1st Qu.: 50.75   v_prt_do:100   written:100   phrasmod: 67   1st Qu.: 2.00
 Median :100.50                                simple  :127   Median : 3.00
 Mean   :100.50                                               Mean   : 4.72
 3rd Qu.:150.25                                               3rd Qu.: 6.00
 Max.   :200.00                                               Max.   :31.00
      DO_ANIM        DO_CONC      PP
 animate  : 27   abstract: 95   no :167
 inanimate:173   concrete:105   yes: 33

1 Exercise 01

Question: Across all verb-particle constructions, are abstract and concrete DOs equally frequent? (This might be interesting because of the diachrony of these constructions as well as because of how children might learn from their input what verb-particle constructions are used for in general.)

1.1 Hypotheses

The

dependent/response variable is DO_CONC;
independent/predictor variable is none because we are not considering any other variables as ‘determining’ the behavior of DO_CONC.

What are the hypotheses?

text hypotheses:
- H₁: The frequencies of abstract and concrete DOs in verb-particle constructions differ;
- H₀: The frequencies of abstract and concrete DOs in verb-particle constructions don’t differ;
statistical hypotheses:
- H₁: Χ²>0;
- H₀: Χ²=0.

1.2 Descriptive stats/visualization

We first describe the data:

(con <- table(d$DO_CONC))


abstract concrete
      95      105

prop.table(con)


abstract concrete
   0.475    0.525

dotchart(main="The frequency distribution of DO_CONC", # the main heading
   xlab="Observed frequency", xlim=c(0, nrow(d)), # x-axis stuff
   ylab="Concreteness of DO", # y-axis label
   x=con, pch=16)             # what's to be plotted & with what
abline(v=mean(con), lty=2) # vertical line at H0
text(con, seq(con), # text at these coordinates
     con,           # the frequencies
     pos=c(2,4))    # one below, the other above

1.3 Statistical testing

For this question, we would prefer to use a chi-squared test for goodness-of-fit:

(con_chisq <- chisq.test(x=con, p=c(0.5, 0.5)))


    Chi-squared test for given probabilities

data:  con
X-squared = 0.5, df = 1, p-value = 0.4795

Clearly not significant, but were we allowed to test it like this? (Of course we were …)

con_chisq$expected

abstract concrete
     100      100

What is the effect size of this ns result? It’s tiny, extremely close to 0:

(max_poss_chisq <- sum(con) * (1-0.5)/0.5) # compute max poss chi-squared

[1] 200

con_chisq$statistic/max_poss_chisq

X-squared
   0.0025

1.4 Write-up

In the verb-particle construction data, abstract and concrete DOs were observed 95 and 105 times respectively; the null hypothesis expectation was a uniform distribution. According to a chi-squared test for goodness-of-fit, the observed data do not differ significantly from the null hypothesis (Χ²=0.5, df=1, p=0.4795; effect size=0.0025).

1.5 Excursus

How might one test this using a simulation approach? Like this:

set.seed(123)  # set a random number generator
collector <- 0 # set a collector value to 0
for (i in 1:10000) { # do something 10K times, namely
   sampled_dist <- sample( # put into sampled_dist the result of sampling
      c("abstr", "conc"),  # from the vector c("abstr", "conc")
      200,                 # 200 elements
      replace=TRUE)        # with replacement; this is the embodiment of the null hypothesis
   collector <-   # make the new version of collector be
      collector + # the old version of collector plus
      # a logical test of whether there are 105+ concrete's, like in the real data
      (sum(sampled_dist=="conc")>=105)
}
collector/10000     # 0.2594

[1] 0.2594

2*(collector/10000) # 0.5188

[1] 0.5188

Conclusion: In the verb-particle construction data, abstract and concrete DOs were observed 95 and 105 times respectively; the null hypothesis expectation was a uniform distribution. According to a simulation study (using 10000 random samples) for goodness-of-fit, the observed data do not differ significantly from the null hypothesis (p=0.5188).

2 Exercise 02

Question: Are the lengths of direct objects in all verb-particle constructions normally distributed? (This would mostly be interesting because many other tests one might compute on the lengths would require that.)

2.1 Hypotheses

The

dependent/response variable is DO_LENSYLL;
independent/predictor variable is none because we are not considering any other variables as ‘determining’ the behavior of DO_LENSYLL.

What are the hypotheses?

text hypotheses:
- H₁: The distribution of the DO lengths in the verb-particle constructions differs from normality;
- H₀: The distribution of the DO lengths in the verb-particle constructions doesn’t differ from normality;
statistical hypotheses:
- H₁: D>0;
- H₀: D=0.

2.2 Descriptive stats/visualization

We first describe the data:

summary(d$DO_LENSYLL)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1.00    2.00    3.00    4.72    6.00   31.00

par(mfrow=c(1, 3)) # make the plotting window have 1 row & 3 columns
hist(d$DO_LENSYLL, main="")
hist(d$DO_LENSYLL, main="", breaks=16)
plot(ecdf(d$DO_LENSYLL), verticals=TRUE, main="",
     xlab="DO length in syllables",
     ylab="Cumulative %"); grid()
par(mfrow=c(1, 1)) # reset to default

2.3 Statistical testing

For this question, we would use a Lilliefors test:

nortest::lillie.test(d$DO_LENSYLL)


    Lilliefors (Kolmogorov-Smirnov) normality test

data:  d$DO_LENSYLL
D = 0.19409, p-value < 2.2e-16

2.4 Write-up

According to a Lilliefors test for normality, the DO lengths in the verb-particle constructions differ very significantly from a normal distribution (D=0.1941, p<10^-15).

2.5 Excursus

How might one explore/represent this using a visual simulation approach?

This would be how to do this with a histogram approach:

# first, we generate the regular histogram like above
hist(d$DO_LENSYLL, main="", breaks=16,
     freq=FALSE) # but we make it a density curve
# we add a density curve to that
lines(density(d$DO_LENSYLL), lwd=3)
set.seed(123) # we set a random number generator &
for (i in 1:100) { # do the following 100 times:
   lines(col="#FF000008", # draw light red lines
      x=density(  # namely density curves
         rnorm(   # for normally distributed values:
            n=length(d$DO_LENSYLL),  # 200 values with
            mean=mean(d$DO_LENSYLL), # the mean of the lengths &
            sd=sd(d$DO_LENSYLL))))   # the sd of the lengths
}

And this would be how to do this with an ecdf plot kind of approach:

# first, we generate the regular ecdf plot like above
plot(ecdf(d$DO_LENSYLL), verticals=TRUE, main="",
     xlab="DO length in syllables",
     ylab="Cumulative %"); grid()
set.seed(123) # we set a random number generator &
for (i in 1:100) { # do the following 100 times:
   lines(col="#FF000001", # draw light red lines:
         ecdf(rnorm(      # ecdf curve for normally distributed values
            n=length(d$DO_LENSYLL),  # 200 values with
            mean=mean(d$DO_LENSYLL), # the mean of the lengths &
            sd=sd(d$DO_LENSYLL))))   # the sd of the lengths
}

3 Exercise 03

Question: Does the choice of a verb-particle construction correlate with the complexity of the direct object? (This might be interesting (i) because, if that was so, it might be explainable with processing considerations of a more general type (short-before-long) and (ii) because of how children learn the alternation: they, somewhat astonishingly, use the discontinuous variant v_do_prt first, which might be because adults perhaps prefer that order with short DOs (we will check that) so children, who don’t use complex DOs, would then prefer the verb-particle construction that prefers the only kinds of DOs they can already handle.)

3.1 Hypotheses

The

dependent/response variable is CONSTRUCTION;
independent/predictor variable is DO_COMPLX.

What are the hypotheses?

text hypotheses:
- H₁: The frequencies of the two verb-particle constructions differ across the levels of the DO’s complexity;
- H₀: The frequencies of the two verb-particle constructions don’t differ across the levels of the DO’s complexity;
statistical hypotheses:
- H₁: Χ²>0;
- H₀: Χ²=0.

3.2 Descriptive stats/visualization

We first describe the data:

addmargins(com_con <- table(d$DO_COMPLX, d$CONSTRUCTION))


           v_do_prt v_prt_do Sum
  clausmod        1        5   6
  phrasmod        9       58  67
  simple         90       37 127
  Sum           100      100 200

round(prop.table(com_con, 1), 3)


           v_do_prt v_prt_do
  clausmod    0.167    0.833
  phrasmod    0.134    0.866
  simple      0.709    0.291

mosaicplot(x=com_con,         # a mosaic plot of the table (NOT transposed)
   main="", xlab="Complexity", ylab="Construction", # w/ no heading & these axis labels
   col=c("grey35", "grey75")) # w/ these colors

3.3 Statistical testing

For this question, we would prefer to use a chi-squared test for independence:

(com_con_chisq <- chisq.test(x=com_con, correct=FALSE))


    Pearson's Chi-squared test

data:  com_con
X-squared = 60.621, df = 2, p-value = 6.861e-14

Clearly highly significant, but were we allowed to test it like this?

com_con_chisq$expected


           v_do_prt v_prt_do
  clausmod      3.0      3.0
  phrasmod     33.5     33.5
  simple       63.5     63.5

Obviously not: one third of all expected frequencies are smaller than 5:

sum(com_con_chisq$expected<=5) / length(com_con_chisq$expected)

[1] 0.3333333

If we had been allowed to use the chi-squared test for independence, we would have continued with the residuals and the effect size Cramer’s V:

com_con_chisq$residuals # compute the residuals (still useful, actually, just not for inference)


            v_do_prt  v_prt_do
  clausmod -1.154701  1.154701
  phrasmod -4.232955  4.232955
  simple    3.325516 -3.325516

sqrt(
   com_con_chisq$statistic /               # numerator
   (sum(com_con) * (min(dim(com_con))-1))) # denominator

X-squared
0.5505479

But to be really safe, we need to compute the exact test, which actually returns a p-value that is very similar to that of the chi-squared test.

fisher.test(com_con)


    Fisher's Exact Test for Count Data

data:  com_con
p-value = 2.925e-15
alternative hypothesis: two.sided

3.4 Write-up

[Show com_con.] The observed data (and the residuals of a chi-squared test for independence) show that simple DOs prefer the construction where the DO precedes the particle whereas both kinds of modified DOs prefer the construction where the DO follows the particle. A chi-squared test for independence indicated that this pattern is highly significant (Χ²=60.621, df=2 , p<10^-13; Cramer’s V=0.55). However, 2 of the 6 expected frequencies were less than 5 (3), which is why an additional Fisher-Yates exact test was conducted, which fully confirmed the result of the chi-squared test (p<10^-14).

3.5 Excursus 1

Remember PRE measures (from session 04)? What is the PRE measure (lambda) for this correlation and how does one compute it efficiently?

error_rate_wout_pred <- com_con %>% colSums %>% prop.table %>% max %>% "-"(1, .)
error_rate_with_pred <- apply(com_con, 1, max) %>% sum %>% "/"(sum(com_con)) %>% "-"(1, .)
PRE <- (error_rate_wout_pred-error_rate_with_pred) / error_rate_wout_pred
setNames(
   c(error_rate_wout_pred, error_rate_with_pred, PRE),
   c("Error rate without predictor", "Error rate with predictor", "Proportional reduction of error"))

   Error rate without predictor       Error rate with predictor
                          0.500                           0.235
Proportional reduction of error
                          0.530

With this, the write-up might be changed to this:

[Show com_con.] The observed data (and the residuals of a chi-squared test for independence) show that simple DOs prefer the construction where the DO precedes the particle whereas both kinds of modified DOs prefer the construction where the DO follows the particle. The default of a chi-squared test for independence was not permitted since 2 of the 6 expected frequencies were less than 5 (3), which is why an Fisher-Yates exact test was conducted; according to this test, the correlation between the complexity of the DO and its position relative to the particle is highly significant (p<10^-14) and comes with a high PRE effect size (Goodman & Kruskal’s lambda=0.53).

3.6 Excursus 2

Question: Do the data permit us to not even distinguish between phrasally- and clausally-modified DOs? It seems like they behave ‘the same’ statistically, look at how similar rows 1 and 2 are in this table:

round(prop.table(com_con, 1), 3)


           v_do_prt v_prt_do
  clausmod    0.167    0.833
  phrasmod    0.134    0.866
  simple      0.709    0.291

Second, from a PRE perspective, distinguishing between phrasally- and clausally-modified DOs certainly makes no difference, given that for each we would always predict v_prt_do.

Finally, it also makes linguistic/conceptual sense to conflate them because they are the the two ‘modified’ levels – we would not consider conflation if simple had been statistically very similar to clausmod.

With the right kind of method – regression modeling – you would find out that Occam’s razor says to conflate the two kinds of modified DOs because distinguishing them makes no significant contribution (p=0.8296547).

4 Exercise 04

Question: Given that the choice of a verb-particle construction is correlated with the complexity of the direct object (see previous exercise), one might wonder whether the general distribution of the DO lengths differ across the two constructions. (This might be expected and interesting for the same reasons as the question in exercise 03 because the length and the complexity of the DO are probably highly correlated in the first place.)

4.1 Hypotheses

The

dependent/response variable is DO_LENSYLL;
independent/predictor variable is CONSTRUCTION.

What are the hypotheses?

text hypotheses:
- H₁: The distributions of the DOs’ lengths differ across the two constructions;
- H₀: The distributions of the DOs’ lengths don’t differ across the two constructions;
statistical hypotheses:
- H₁: D>0;
- H₀: D=0.

4.2 Descriptive stats/visualization

We first describe the data:

par(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columns
with(d, { # look again how I avoid having to type d$ all the time
   hist(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[1]], # plot a histogram for the DO lengths of one construction
        xlim=c(0, 35), ylim=c(0, 70),             # w/ x-axis limits for the range of all values
        main=levels(CONSTRUCTION)[1], xlab="")    # and the relevant level as a heading but no x-axis label
   hist(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[2]], # plot a histogram for the DO lengths of the other construction
        xlim=c(0, 35), ylim=c(0, 70),             # w/ x-axis limits for the range of all values
        main=levels(CONSTRUCTION)[2], xlab="")    # and the relevant level as a heading but no x-axis label
})        # this is where the with(d, { ... is closed!
par(mfrow=c(1, 1)) # reset to default
# or, shorter: tapply(DO_LENSYLL, CONSTRUCTION hist, xlim=range(DO_LENSYLL), ylim=c(0, 70))

A shorter, but slightly less nice, version would be this:

par(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columns
tapply(                      # apply to
   d$DO_LENSYLL,             # these values
   d$CONSTRUCTION,           # a grouping by these values
   hist,                     # then apply hist to each group
   xlim=range(d$DO_LENSYLL), # make this an additional argument to hist
   ylim=c(0, 70))            # make this an additional argument to hist

$v_do_prt
$breaks
[1] 1 2 3 4 5 6 7 8 9

$counts
[1] 63 15 11  5  3  0  2  1

$density
[1] 0.63 0.15 0.11 0.05 0.03 0.00 0.02 0.01

$mids
[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5

$xname
[1] "X[[i]]"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

$v_prt_do
$breaks
[1]  0  5 10 15 20 25 30 35

$counts
[1] 48 32 13  6  0  0  1

$density
[1] 0.096 0.064 0.026 0.012 0.000 0.000 0.002

$mids
[1]  2.5  7.5 12.5 17.5 22.5 27.5 32.5

$xname
[1] "X[[i]]"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

par(mfrow=c(1, 1)) # reset to default

And you could also use ecdf plots for this:

with(d, {
# plot the cumulative frequencies of lengths for one constr
plot(ecdf(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[1]]),
# with nice x-axis limits, vertical lines in blue, no heading, and a grid
   xlim=range(DO_LENSYLL), verticals=TRUE, col="blue", main=""); grid()
# plot the cumulative frequencies of lengths for other constr
lines(ecdf(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[2]]),
      verticals=TRUE, col="red") # with vertical lines in red
legend(10, 0.3,                     # put a legend at these coordinates
       xjust=0.5, yjust=0.5,        # centered along both x- and y-axis
       fill=c("blue", "red"),       # using the colors blue and red
       legend=levels(CONSTRUCTION), # for the two construction labels
       bty="n")                     # and no box
})

4.3 Statistical testing

For this question, we would prefer to use a Kolmogorov-Smirnov test for independence/differences:

ks.test(d$DO_LENSYLL ~ d$CONSTRUCTION)


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  d$DO_LENSYLL by d$CONSTRUCTION
D = 0.54, p-value = 4.335e-13
alternative hypothesis: two-sided

Clearly highly significant …

4.4 Write-up

[Show histograms or ecdf plots.] According to a Kolmogorov-Smirnov test for independence/differences of the lengths of the DOs for each construction, the distributions of lengths are significantly different from each other (D=0.54, p<10^-12): When the construction is v_prt_do, then there are many more longer lengths; when the construction is v_do_prt, then the vast majority of lengths is rather short (nearly 80% of all DO lengths in v_do_prt are 3 syllables and shorter).

4.5 Excursus

At the beginning of this exercise, we said “the length and the complexity of the DO are probably highly correlated” – are they? Check it (just visually).

par(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columns
spineplot(d$DO_COMPLX  ~ d$DO_LENSYLL)
boxplot  (d$DO_LENSYLL ~ d$DO_COMPLX)
par(mfrow=c(1, 1)) # reset to default

It certainly seems so.

How about when we compute lambdas? (A little awkward here because of the many different numeric values so this is just the dirtiest of heuristics …):

(len_com <- table(d$DO_LENSYLL, d$DO_COMPLX))


     clausmod phrasmod simple
  1         0        0     40
  2         0        1     37
  3         0        3     23
  4         0        9     11
  5         0       12      6
  6         1        8      5
  7         0        4      2
  8         1        6      1
  9         0        5      0
  10        0        5      0
  11        0        4      0
  12        1        2      0
  13        1        1      1
  14        1        1      0
  15        0        1      0
  16        0        2      1
  17        0        1      0
  18        0        1      0
  19        0        1      0
  31        1        0      0

# predicting from DO_LENSYLL to DO_COMPLX
error_rate_wout_pred <- len_com %>% colSums %>% prop.table %>% max %>% "-"(1, .)
error_rate_with_pred <- apply(len_com, 1, max) %>% sum %>% "/"(sum(len_com)) %>% "-"(1, .)
(PRE <- (error_rate_wout_pred-error_rate_with_pred) / error_rate_wout_pred)

[1] 0.5342466

# predicting from DO_COMPLX to DO_LENSYLL
error_rate_wout_pred <- len_com %>% rowSums %>% prop.table %>% max %>% "-"(1, .)
error_rate_with_pred <- apply(len_com, 2, max) %>% sum %>% "/"(sum(len_com)) %>% "-"(1, .)
(PRE <- (error_rate_wout_pred-error_rate_with_pred) / error_rate_wout_pred)

[1] 0.08125

Ways to check this for real, not just with plots, will again include regression models.

5 Homework

To prepare for next session, read (and work through!) SFLWR³: Sections 4.2-4.3.1.

6 Session info

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] STGmisc_1.0    Rcpp_1.0.14    magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] digest_0.6.37     fastmap_1.2.0     xfun_0.50         nortest_1.0-4
 [5] knitr_1.49        htmltools_0.5.8.1 rmarkdown_2.29    cli_3.6.3
 [9] rstudioapi_0.17.1 tools_4.4.2       evaluate_1.0.3    yaml_2.3.10
[13] rlang_1.1.5       jsonlite_1.8.9    htmlwidgets_1.6.4 MASS_7.3-64