Load the package magrittr and the file _input/partplacement.csv into a data frame d. This file contains data from a corpus study on the alternation of particle placement that was introduced in Section 1.3; you can find information about this data set in _input/partplacement.r.
CASE CONSTRUCTION MEDIUM DO_COMPLX DO_LENSYLL
Min. : 1.00 v_do_prt:100 spoken :100 clausmod: 6 Min. : 1.00
1st Qu.: 50.75 v_prt_do:100 written:100 phrasmod: 67 1st Qu.: 2.00
Median :100.50 simple :127 Median : 3.00
Mean :100.50 Mean : 4.72
3rd Qu.:150.25 3rd Qu.: 6.00
Max. :200.00 Max. :31.00
DO_ANIM DO_CONC PP
animate : 27 abstract: 95 no :167
inanimate:173 concrete:105 yes: 33
1 Exercise 01
Question: Across all verb-particle constructions, are abstract and concrete DOs equally frequent? (This might be interesting because of the diachrony of these constructions as well as because of how children might learn from their input what verb-particle constructions are used for in general.)
1.1 Hypotheses
The
dependent/response variable is DO_CONC;
independent/predictor variable is none because we are not considering any other variables as ‘determining’ the behavior of DO_CONC.
What are the hypotheses?
text hypotheses:
H1: The frequencies of abstract and concrete DOs in verb-particle constructions differ;
H0: The frequencies of abstract and concrete DOs in verb-particle constructions don’t differ;
statistical hypotheses:
H1: Χ2>0;
H0: Χ2=0.
1.2 Descriptive stats/visualization
We first describe the data:
(con <-table(d$DO_CONC))
abstract concrete
95 105
prop.table(con)
abstract concrete
0.475 0.525
dotchart(main="The frequency distribution of DO_CONC", # the main headingxlab="Observed frequency", xlim=c(0, nrow(d)), # x-axis stuffylab="Concreteness of DO", # y-axis labelx=con, pch=16) # what's to be plotted & with whatabline(v=mean(con), lty=2) # vertical line at H0text(con, seq(con), # text at these coordinates con, # the frequenciespos=c(2,4)) # one below, the other above
1.3 Statistical testing
For this question, we would prefer to use a chi-squared test for goodness-of-fit:
(con_chisq <-chisq.test(x=con, p=c(0.5, 0.5)))
Chi-squared test for given probabilities
data: con
X-squared = 0.5, df = 1, p-value = 0.4795
Clearly not significant, but were we allowed to test it like this? (Of course we were …)
con_chisq$expected
abstract concrete
100 100
What is the effect size of this ns result? It’s tiny, extremely close to 0:
(max_poss_chisq <-sum(con) * (1-0.5)/0.5) # compute max poss chi-squared
[1] 200
con_chisq$statistic/max_poss_chisq
X-squared
0.0025
1.4 Write-up
In the verb-particle construction data, abstract and concrete DOs were observed 95 and 105 times respectively; the null hypothesis expectation was a uniform distribution. According to a chi-squared test for goodness-of-fit, the observed data do not differ significantly from the null hypothesis (Χ2=0.5, df=1, p=0.4795; effect size=0.0025).
1.5 Excursus
How might one test this using a simulation approach? Like this:
set.seed(123) # set a random number generatorcollector <-0# set a collector value to 0for (i in1:10000) { # do something 10K times, namely sampled_dist <-sample( # put into sampled_dist the result of samplingc("abstr", "conc"), # from the vector c("abstr", "conc")200, # 200 elementsreplace=TRUE) # with replacement; this is the embodiment of the null hypothesis collector <-# make the new version of collector be collector +# the old version of collector plus# a logical test of whether there are 105+ concrete's, like in the real data (sum(sampled_dist=="conc")>=105)}collector/10000# 0.2594
[1] 0.2594
2*(collector/10000) # 0.5188
[1] 0.5188
Conclusion: In the verb-particle construction data, abstract and concrete DOs were observed 95 and 105 times respectively; the null hypothesis expectation was a uniform distribution. According to a simulation study (using 10000 random samples) for goodness-of-fit, the observed data do not differ significantly from the null hypothesis (p=0.5188).
2 Exercise 02
Question: Are the lengths of direct objects in all verb-particle constructions normally distributed? (This would mostly be interesting because many other tests one might compute on the lengths would require that.)
2.1 Hypotheses
The
dependent/response variable is DO_LENSYLL;
independent/predictor variable is none because we are not considering any other variables as ‘determining’ the behavior of DO_LENSYLL.
What are the hypotheses?
text hypotheses:
H1: The distribution of the DO lengths in the verb-particle constructions differs from normality;
H0: The distribution of the DO lengths in the verb-particle constructions doesn’t differ from normality;
statistical hypotheses:
H1: D>0;
H0: D=0.
2.2 Descriptive stats/visualization
We first describe the data:
summary(d$DO_LENSYLL)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 3.00 4.72 6.00 31.00
par(mfrow=c(1, 3)) # make the plotting window have 1 row & 3 columnshist(d$DO_LENSYLL, main="")hist(d$DO_LENSYLL, main="", breaks=16)plot(ecdf(d$DO_LENSYLL), verticals=TRUE, main="",xlab="DO length in syllables",ylab="Cumulative %"); grid()par(mfrow=c(1, 1)) # reset to default
2.3 Statistical testing
For this question, we would use a Lilliefors test:
nortest::lillie.test(d$DO_LENSYLL)
Lilliefors (Kolmogorov-Smirnov) normality test
data: d$DO_LENSYLL
D = 0.19409, p-value < 2.2e-16
2.4 Write-up
According to a Lilliefors test for normality, the DO lengths in the verb-particle constructions differ very significantly from a normal distribution (D=0.1941, p<10-15).
2.5 Excursus
How might one explore/represent this using a visual simulation approach?
This would be how to do this with a histogram approach:
# first, we generate the regular histogram like abovehist(d$DO_LENSYLL, main="", breaks=16,freq=FALSE) # but we make it a density curve# we add a density curve to thatlines(density(d$DO_LENSYLL), lwd=3)set.seed(123) # we set a random number generator &for (i in1:100) { # do the following 100 times:lines(col="#FF000008", # draw light red linesx=density( # namely density curvesrnorm( # for normally distributed values:n=length(d$DO_LENSYLL), # 200 values withmean=mean(d$DO_LENSYLL), # the mean of the lengths &sd=sd(d$DO_LENSYLL)))) # the sd of the lengths}
And this would be how to do this with an ecdf plot kind of approach:
# first, we generate the regular ecdf plot like aboveplot(ecdf(d$DO_LENSYLL), verticals=TRUE, main="",xlab="DO length in syllables",ylab="Cumulative %"); grid()set.seed(123) # we set a random number generator &for (i in1:100) { # do the following 100 times:lines(col="#FF000001", # draw light red lines:ecdf(rnorm( # ecdf curve for normally distributed valuesn=length(d$DO_LENSYLL), # 200 values withmean=mean(d$DO_LENSYLL), # the mean of the lengths &sd=sd(d$DO_LENSYLL)))) # the sd of the lengths}
3 Exercise 03
Question: Does the choice of a verb-particle construction correlate with the complexity of the direct object? (This might be interesting (i) because, if that was so, it might be explainable with processing considerations of a more general type (short-before-long) and (ii) because of how children learn the alternation: they, somewhat astonishingly, use the discontinuous variant v_do_prt first, which might be because adults perhaps prefer that order with short DOs (we will check that) so children, who don’t use complex DOs, would then prefer the verb-particle construction that prefers the only kinds of DOs they can already handle.)
3.1 Hypotheses
The
dependent/response variable is CONSTRUCTION;
independent/predictor variable is DO_COMPLX.
What are the hypotheses?
text hypotheses:
H1: The frequencies of the two verb-particle constructions differ across the levels of the DO’s complexity;
H0: The frequencies of the two verb-particle constructions don’t differ across the levels of the DO’s complexity;
mosaicplot(x=com_con, # a mosaic plot of the table (NOT transposed)main="", xlab="Complexity", ylab="Construction", # w/ no heading & these axis labelscol=c("grey35", "grey75")) # w/ these colors
3.3 Statistical testing
For this question, we would prefer to use a chi-squared test for independence:
But to be really safe, we need to compute the exact test, which actually returns a p-value that is very similar to that of the chi-squared test.
fisher.test(com_con)
Fisher's Exact Test for Count Data
data: com_con
p-value = 2.925e-15
alternative hypothesis: two.sided
3.4 Write-up
[Show com_con.] The observed data (and the residuals of a chi-squared test for independence) show that simple DOs prefer the construction where the DO precedes the particle whereas both kinds of modified DOs prefer the construction where the DO follows the particle. A chi-squared test for independence indicated that this pattern is highly significant (Χ2=60.621, df=2 , p<10-13; Cramer’s V=0.55). However, 2 of the 6 expected frequencies were less than 5 (3), which is why an additional Fisher-Yates exact test was conducted, which fully confirmed the result of the chi-squared test (p<10-14).
3.5 Excursus 1
Remember PRE measures (from session 04)? What is the PRE measure (lambda) for this correlation and how does one compute it efficiently?
error_rate_wout_pred <- com_con %>% colSums %>% prop.table %>% max %>%"-"(1, .)error_rate_with_pred <-apply(com_con, 1, max) %>% sum %>%"/"(sum(com_con)) %>%"-"(1, .)PRE <- (error_rate_wout_pred-error_rate_with_pred) / error_rate_wout_predsetNames(c(error_rate_wout_pred, error_rate_with_pred, PRE),c("Error rate without predictor", "Error rate with predictor", "Proportional reduction of error"))
Error rate without predictor Error rate with predictor
0.500 0.235
Proportional reduction of error
0.530
With this, the write-up might be changed to this:
[Show com_con.] The observed data (and the residuals of a chi-squared test for independence) show that simple DOs prefer the construction where the DO precedes the particle whereas both kinds of modified DOs prefer the construction where the DO follows the particle. The default of a chi-squared test for independence was not permitted since 2 of the 6 expected frequencies were less than 5 (3), which is why an Fisher-Yates exact test was conducted; according to this test, the correlation between the complexity of the DO and its position relative to the particle is highly significant (p<10-14) and comes with a high PRE effect size (Goodman & Kruskal’s lambda=0.53).
3.6 Excursus 2
Question: Do the data permit us to not even distinguish between phrasally- and clausally-modified DOs? It seems like they behave ‘the same’ statistically, look at how similar rows 1 and 2 are in this table:
Second, from a PRE perspective, distinguishing between phrasally- and clausally-modified DOs certainly makes no difference, given that for each we would always predict v_prt_do.
Finally, it also makes linguistic/conceptual sense to conflate them because they are the the two ‘modified’ levels – we would not consider conflation if simple had been statistically very similar to clausmod.
With the right kind of method – regression modeling – you would find out that Occam’s razor says to conflate the two kinds of modified DOs because distinguishing them makes no significant contribution (p=0.8296547).
4 Exercise 04
Question: Given that the choice of a verb-particle construction is correlated with the complexity of the direct object (see previous exercise), one might wonder whether the general distribution of the DO lengths differ across the two constructions. (This might be expected and interesting for the same reasons as the question in exercise 03 because the length and the complexity of the DO are probably highly correlated in the first place.)
4.1 Hypotheses
The
dependent/response variable is DO_LENSYLL;
independent/predictor variable is CONSTRUCTION.
What are the hypotheses?
text hypotheses:
H1: The distributions of the DOs’ lengths differ across the two constructions;
H0: The distributions of the DOs’ lengths don’t differ across the two constructions;
statistical hypotheses:
H1: D>0;
H0: D=0.
4.2 Descriptive stats/visualization
We first describe the data:
par(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columnswith(d, { # look again how I avoid having to type d$ all the timehist(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[1]], # plot a histogram for the DO lengths of one constructionxlim=c(0, 35), ylim=c(0, 70), # w/ x-axis limits for the range of all valuesmain=levels(CONSTRUCTION)[1], xlab="") # and the relevant level as a heading but no x-axis labelhist(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[2]], # plot a histogram for the DO lengths of the other constructionxlim=c(0, 35), ylim=c(0, 70), # w/ x-axis limits for the range of all valuesmain=levels(CONSTRUCTION)[2], xlab="") # and the relevant level as a heading but no x-axis label}) # this is where the with(d, { ... is closed!par(mfrow=c(1, 1)) # reset to default# or, shorter: tapply(DO_LENSYLL, CONSTRUCTION hist, xlim=range(DO_LENSYLL), ylim=c(0, 70))
A shorter, but slightly less nice, version would be this:
par(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columnstapply( # apply to d$DO_LENSYLL, # these values d$CONSTRUCTION, # a grouping by these values hist, # then apply hist to each groupxlim=range(d$DO_LENSYLL), # make this an additional argument to histylim=c(0, 70)) # make this an additional argument to hist
with(d, {# plot the cumulative frequencies of lengths for one constrplot(ecdf(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[1]]),# with nice x-axis limits, vertical lines in blue, no heading, and a gridxlim=range(DO_LENSYLL), verticals=TRUE, col="blue", main=""); grid()# plot the cumulative frequencies of lengths for other constrlines(ecdf(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[2]]),verticals=TRUE, col="red") # with vertical lines in redlegend(10, 0.3, # put a legend at these coordinatesxjust=0.5, yjust=0.5, # centered along both x- and y-axisfill=c("blue", "red"), # using the colors blue and redlegend=levels(CONSTRUCTION), # for the two construction labelsbty="n") # and no box})
4.3 Statistical testing
For this question, we would prefer to use a Kolmogorov-Smirnov test for independence/differences:
ks.test(d$DO_LENSYLL ~ d$CONSTRUCTION)
Asymptotic two-sample Kolmogorov-Smirnov test
data: d$DO_LENSYLL by d$CONSTRUCTION
D = 0.54, p-value = 4.335e-13
alternative hypothesis: two-sided
Clearly highly significant …
4.4 Write-up
[Show histograms or ecdf plots.] According to a Kolmogorov-Smirnov test for independence/differences of the lengths of the DOs for each construction, the distributions of lengths are significantly different from each other (D=0.54, p<10-12): When the construction is v_prt_do, then there are many more longer lengths; when the construction is v_do_prt, then the vast majority of lengths is rather short (nearly 80% of all DO lengths in v_do_prt are 3 syllables and shorter).
4.5 Excursus
At the beginning of this exercise, we said “the length and the complexity of the DO are probably highly correlated” – are they? Check it (just visually).
par(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columnsspineplot(d$DO_COMPLX ~ d$DO_LENSYLL)boxplot (d$DO_LENSYLL ~ d$DO_COMPLX)par(mfrow=c(1, 1)) # reset to default
It certainly seems so.
How about when we compute lambdas? (A little awkward here because of the many different numeric values so this is just the dirtiest of heuristics …):
# predicting from DO_LENSYLL to DO_COMPLXerror_rate_wout_pred <- len_com %>% colSums %>% prop.table %>% max %>%"-"(1, .)error_rate_with_pred <-apply(len_com, 1, max) %>% sum %>%"/"(sum(len_com)) %>%"-"(1, .)(PRE <- (error_rate_wout_pred-error_rate_with_pred) / error_rate_wout_pred)
[1] 0.5342466
# predicting from DO_COMPLX to DO_LENSYLLerror_rate_wout_pred <- len_com %>% rowSums %>% prop.table %>% max %>%"-"(1, .)error_rate_with_pred <-apply(len_com, 2, max) %>% sum %>%"/"(sum(len_com)) %>%"-"(1, .)(PRE <- (error_rate_wout_pred-error_rate_with_pred) / error_rate_wout_pred)
[1] 0.08125
Ways to check this for real, not just with plots, will again include regression models.
5 Homework
To prepare for next session, read (and work through!) SFLWR3: Sections 4.2-4.3.1.
6 Session info
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] STGmisc_1.0 Rcpp_1.0.14 magrittr_2.0.3
loaded via a namespace (and not attached):
[1] digest_0.6.37 fastmap_1.2.0 xfun_0.50 nortest_1.0-4
[5] knitr_1.49 htmltools_0.5.8.1 rmarkdown_2.29 cli_3.6.3
[9] rstudioapi_0.17.1 tools_4.4.2 evaluate_1.0.3 yaml_2.3.10
[13] rlang_1.1.5 jsonlite_1.8.9 htmlwidgets_1.6.4 MASS_7.3-64
Source Code
---title: "Ling 104, session 05: distr. & freq. (key)"author: - name: "[Stefan Th. Gries](https://www_stgries.info)" affiliation: "UC Santa Barbara & JLU Giessen" orcid: 0000-0002-6497-3958date: "2025-02-04 12:34:56"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 6 fig-height: 6 fig-format: png fig-dpi: 300 fig-align: center embed-resources: trueexecute: cache: false echo: true eval: true warning: false---Load the package `magrittr` and the file [_input/partplacement.csv](_input/partplacement.csv) into a data frame `d`. This file contains data from a corpus study on the alternation of particle placement that was introduced in Section 1.3; you can find information about this data set in [_input/partplacement.r](_input/partplacement.r).```{r prepworkspace}rm(list=ls(all=TRUE)); library(magrittr)summary(d <- read.delim( "_input/partplacement.csv", stringsAsFactors=TRUE))```# Exercise 01Question: Across all verb-particle constructions, are abstract and concrete DOs equally frequent? (This might be interesting because of the diachrony of these constructions as well as because of how children might learn from their input what verb-particle constructions are used for in general.)## HypothesesThe* dependent/response variable is `DO_CONC`;* independent/predictor variable is none because we are not considering any other variables as 'determining' the behavior of `DO_CONC`.What are the hypotheses?* text hypotheses: + H~1~: The frequencies of abstract and concrete DOs in verb-particle constructions differ; + H~0~: The frequencies of abstract and concrete DOs in verb-particle constructions don't differ;* statistical hypotheses: + H~1~: *Χ*^2^>0; + H~0~: *Χ*^2^=0.## Descriptive stats/visualizationWe first describe the data:```{r exercise01desc}#| fig-height: 4(con <- table(d$DO_CONC))prop.table(con)dotchart(main="The frequency distribution of DO_CONC", # the main heading xlab="Observed frequency", xlim=c(0, nrow(d)), # x-axis stuff ylab="Concreteness of DO", # y-axis label x=con, pch=16) # what's to be plotted & with whatabline(v=mean(con), lty=2) # vertical line at H0text(con, seq(con), # text at these coordinates con, # the frequencies pos=c(2,4)) # one below, the other above```## Statistical testingFor this question, we would prefer to use a chi-squared test for goodness-of-fit:```{r exercise01testa}(con_chisq <- chisq.test(x=con, p=c(0.5, 0.5)))```Clearly not significant, but were we allowed to test it like this? (Of course we were ...)```{r exercise01testb}con_chisq$expected```What is the effect size of this ns result? It's tiny, extremely close to 0:```{r exercise01testc}(max_poss_chisq <- sum(con) * (1-0.5)/0.5) # compute max poss chi-squaredcon_chisq$statistic/max_poss_chisq```## Write-upIn the verb-particle construction data, abstract and concrete DOs were observed 95 and 105 times respectively; the null hypothesis expectation was a uniform distribution. According to a chi-squared test for goodness-of-fit, the observed data do not differ significantly from the null hypothesis (*Χ*^2^=`r con_chisq$statistic`, *df*=`r con_chisq$parameter`, *p*=`r round(con_chisq$p.value, 4)`; effect size=`r con_chisq$statistic/max_poss_chisq`).## ExcursusHow might one test this using a simulation approach? Like this:```{r exercise01excursus}set.seed(123) # set a random number generatorcollector <- 0 # set a collector value to 0for (i in 1:10000) { # do something 10K times, namely sampled_dist <- sample( # put into sampled_dist the result of sampling c("abstr", "conc"), # from the vector c("abstr", "conc") 200, # 200 elements replace=TRUE) # with replacement; this is the embodiment of the null hypothesis collector <- # make the new version of collector be collector + # the old version of collector plus # a logical test of whether there are 105+ concrete's, like in the real data (sum(sampled_dist=="conc")>=105)}collector/10000 # 0.25942*(collector/10000) # 0.5188``````{r exercise01excursusb}#| echo: false#| eval: falselibrary(boot)bootstrapping_props <- function (dataframe2bootstrap, elements2bootstrap) { dataframe2bootstrap[elements2bootstrap,"DO_CONC"] %>% table %>% prop.table}set.seed(123); boot_results <- boot( data=d, statistic=bootstrapping_props, R=2000)lapply(1:2, \(af) boot.ci(boot_results, type=c("perc", "bca"), index=af))```Conclusion: In the verb-particle construction data, abstract and concrete DOs were observed 95 and 105 times respectively; the null hypothesis expectation was a uniform distribution. According to a simulation study (using 10000 random samples) for goodness-of-fit, the observed data do not differ significantly from the null hypothesis (*p*=`r 2*(collector/10000)`).# Exercise 02Question: Are the lengths of direct objects in all verb-particle constructions normally distributed? (This would mostly be interesting because many other tests one might compute on the lengths would require that.)## HypothesesThe* dependent/response variable is `DO_LENSYLL`;* independent/predictor variable is none because we are not considering any other variables as 'determining' the behavior of `DO_LENSYLL`.What are the hypotheses?* text hypotheses: + H~1~: The distribution of the DO lengths in the verb-particle constructions differs from normality; + H~0~: The distribution of the DO lengths in the verb-particle constructions doesn't differ from normality;* statistical hypotheses: + H~1~: *D*>0; + H~0~: *D*=0.## Descriptive stats/visualizationWe first describe the data:```{r exercise02desc}#| fig-width: 9#| fig-show: holdsummary(d$DO_LENSYLL)par(mfrow=c(1, 3)) # make the plotting window have 1 row & 3 columnshist(d$DO_LENSYLL, main="")hist(d$DO_LENSYLL, main="", breaks=16)plot(ecdf(d$DO_LENSYLL), verticals=TRUE, main="", xlab="DO length in syllables", ylab="Cumulative %"); grid()par(mfrow=c(1, 1)) # reset to default```## Statistical testingFor this question, we would use a Lilliefors test:```{r exercise02test}nortest::lillie.test(d$DO_LENSYLL)```## Write-upAccording to a Lilliefors test for normality, the DO lengths in the verb-particle constructions differ very significantly from a normal distribution (*D*=0.1941, *p*<10^-15^).## ExcursusHow might one explore/represent this using a visual simulation approach?This would be how to do this with a histogram approach:```{r exercise02excursus1}# first, we generate the regular histogram like abovehist(d$DO_LENSYLL, main="", breaks=16, freq=FALSE) # but we make it a density curve# we add a density curve to thatlines(density(d$DO_LENSYLL), lwd=3)set.seed(123) # we set a random number generator &for (i in 1:100) { # do the following 100 times: lines(col="#FF000008", # draw light red lines x=density( # namely density curves rnorm( # for normally distributed values: n=length(d$DO_LENSYLL), # 200 values with mean=mean(d$DO_LENSYLL), # the mean of the lengths & sd=sd(d$DO_LENSYLL)))) # the sd of the lengths}```And this would be how to do this with an ecdf plot kind of approach:```{r exercise02excursus2}# first, we generate the regular ecdf plot like aboveplot(ecdf(d$DO_LENSYLL), verticals=TRUE, main="", xlab="DO length in syllables", ylab="Cumulative %"); grid()set.seed(123) # we set a random number generator &for (i in 1:100) { # do the following 100 times: lines(col="#FF000001", # draw light red lines: ecdf(rnorm( # ecdf curve for normally distributed values n=length(d$DO_LENSYLL), # 200 values with mean=mean(d$DO_LENSYLL), # the mean of the lengths & sd=sd(d$DO_LENSYLL)))) # the sd of the lengths}```# Exercise 03Question: Does the choice of a verb-particle construction correlate with the complexity of the direct object? (This might be interesting (i) because, if that was so, it might be explainable with processing considerations of a more general type (short-before-long) and (ii) because of how children learn the alternation: they, somewhat astonishingly, use the discontinuous variant *v_do_prt* first, which might be because adults perhaps prefer that order with short DOs (we will check that) so children, who don't use complex DOs, would then prefer the verb-particle construction that prefers the only kinds of DOs they can already handle.)## HypothesesThe* dependent/response variable is `CONSTRUCTION`;* independent/predictor variable is `DO_COMPLX`.What are the hypotheses?* text hypotheses: + H~1~: The frequencies of the two verb-particle constructions differ across the levels of the DO's complexity; + H~0~: The frequencies of the two verb-particle constructions don't differ across the levels of the DO's complexity;* statistical hypotheses: + H~1~: *Χ*^2^>0; + H~0~: *Χ*^2^=0.## Descriptive stats/visualizationWe first describe the data:```{r exercise03desc}addmargins(com_con <- table(d$DO_COMPLX, d$CONSTRUCTION))round(prop.table(com_con, 1), 3)mosaicplot(x=com_con, # a mosaic plot of the table (NOT transposed) main="", xlab="Complexity", ylab="Construction", # w/ no heading & these axis labels col=c("grey35", "grey75")) # w/ these colors```## Statistical testingFor this question, we would prefer to use a chi-squared test for independence:```{r exercise03testa}(com_con_chisq <- chisq.test(x=com_con, correct=FALSE))```Clearly highly significant, but were we allowed to test it like this?```{r exercise03testb}com_con_chisq$expected```Obviously not: one third of all expected frequencies are smaller than 5:```{r exercise03testc}sum(com_con_chisq$expected<=5) / length(com_con_chisq$expected)```If we had been allowed to use the chi-squared test for independence, we would have continued with the residuals and the effect size Cramer's *V*:```{r exercise03testd}com_con_chisq$residuals # compute the residuals (still useful, actually, just not for inference)sqrt( com_con_chisq$statistic / # numerator (sum(com_con) * (min(dim(com_con))-1))) # denominator```But to be really safe, we need to compute the exact test, which actually returns a *p*-value that is very similar to that of the chi-squared test.```{r exercise03teste}fisher.test(com_con)```## Write-up[Show `com_con`.] The observed data (and the residuals of a chi-squared test for independence) show that simple DOs prefer the construction where the DO precedes the particle whereas both kinds of modified DOs prefer the construction where the DO follows the particle. A chi-squared test for independence indicated that this pattern is highly significant (*Χ*^2^=`r round(com_con_chisq$statistic, 3)`, *df*=`r com_con_chisq$parameter` , *p*<10^-13^; Cramer's *V*=0.55). However, 2 of the 6 expected frequencies were less than 5 (3), which is why an additional Fisher-Yates exact test was conducted, which fully confirmed the result of the chi-squared test (*p*<10^-14^).## Excursus 1Remember PRE measures (from session 04)? What is the PRE measure (*lambda*) for this correlation and how does one compute it efficiently?```{r exercise03excursus1}error_rate_wout_pred <- com_con %>% colSums %>% prop.table %>% max %>% "-"(1, .)error_rate_with_pred <- apply(com_con, 1, max) %>% sum %>% "/"(sum(com_con)) %>% "-"(1, .)PRE <- (error_rate_wout_pred-error_rate_with_pred) / error_rate_wout_predsetNames( c(error_rate_wout_pred, error_rate_with_pred, PRE), c("Error rate without predictor", "Error rate with predictor", "Proportional reduction of error"))``````{r}#| echo: false#| eval: falseGoodmanKruskal.lambda(com_con)$`rows helping to predict columns`GoodmanKruskal.gamma(com_con)```With this, the write-up might be changed to this:[Show `com_con`.] The observed data (and the residuals of a chi-squared test for independence) show that simple DOs prefer the construction where the DO precedes the particle whereas both kinds of modified DOs prefer the construction where the DO follows the particle. The default of a chi-squared test for independence was not permitted since 2 of the 6 expected frequencies were less than 5 (3), which is why an Fisher-Yates exact test was conducted; according to this test, the correlation between the complexity of the DO and its position relative to the particle is highly significant (*p*<10^-14^) and comes with a high PRE effect size (Goodman & Kruskal's lambda=0.53).## Excursus 2Question: Do the data permit us to not even distinguish between phrasally- and clausally-modified DOs? It seems like they behave 'the same' statistically, look at how similar rows 1 and 2 are in this table:```{r exercise03excursus2a}round(prop.table(com_con, 1), 3)```Second, from a PRE perspective, distinguishing between phrasally- and clausally-modified DOs certainly makes no difference, given that for each we would always predict *v_prt_do*.Finally, it also makes linguistic/conceptual sense to conflate them because they are the the two 'modified' levels -- we would not consider conflation if `simple` had been statistically very similar to `clausmod`.```{r exercise03excursus2b}#| echo: falsed$DO_COMPLX_confl <- d$DO_COMPLX levels(d$DO_COMPLX_confl) <- c("modified", "modified", "simple")# table(d$DO_COMPLX, d$DO_COMPLX_confl)m_01 <- glm(CONSTRUCTION ~ DO_COMPLX , family=binomial, data=d)m_02 <- glm(CONSTRUCTION ~ DO_COMPLX_confl, family=binomial, data=d)qwe <- anova(m_01, m_02, test="Chisq")```With the right kind of method -- regression modeling -- you would find out that Occam's razor says to conflate the two kinds of modified DOs because distinguishing them makes no significant contribution (*p*=`r qwe[["Pr(>Chi)"]][2]`).# Exercise 04Question: Given that the choice of a verb-particle construction is correlated with the complexity of the direct object (see previous exercise), one might wonder whether the general distribution of the DO lengths differ across the two constructions. (This might be expected and interesting for the same reasons as the question in exercise 03 because the length and the complexity of the DO are probably highly correlated in the first place.)## HypothesesThe* dependent/response variable is `DO_LENSYLL`;* independent/predictor variable is `CONSTRUCTION`.What are the hypotheses?* text hypotheses: + H~1~: The distributions of the DOs' lengths differ across the two constructions; + H~0~: The distributions of the DOs' lengths don't differ across the two constructions;* statistical hypotheses: + H~1~: *D*>0; + H~0~: *D*=0.## Descriptive stats/visualizationWe first describe the data:```{r exercise04desca}#| fig-show: holdpar(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columnswith(d, { # look again how I avoid having to type d$ all the time hist(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[1]], # plot a histogram for the DO lengths of one construction xlim=c(0, 35), ylim=c(0, 70), # w/ x-axis limits for the range of all values main=levels(CONSTRUCTION)[1], xlab="") # and the relevant level as a heading but no x-axis label hist(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[2]], # plot a histogram for the DO lengths of the other construction xlim=c(0, 35), ylim=c(0, 70), # w/ x-axis limits for the range of all values main=levels(CONSTRUCTION)[2], xlab="") # and the relevant level as a heading but no x-axis label}) # this is where the with(d, { ... is closed!par(mfrow=c(1, 1)) # reset to default# or, shorter: tapply(DO_LENSYLL, CONSTRUCTION hist, xlim=range(DO_LENSYLL), ylim=c(0, 70))```A shorter, but slightly less nice, version would be this:```{r exercise04descb}#| fig-show: holdpar(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columnstapply( # apply to d$DO_LENSYLL, # these values d$CONSTRUCTION, # a grouping by these values hist, # then apply hist to each group xlim=range(d$DO_LENSYLL), # make this an additional argument to hist ylim=c(0, 70)) # make this an additional argument to histpar(mfrow=c(1, 1)) # reset to default```And you could also use ecdf plots for this:```{r exercise04descc}with(d, {# plot the cumulative frequencies of lengths for one constrplot(ecdf(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[1]]),# with nice x-axis limits, vertical lines in blue, no heading, and a grid xlim=range(DO_LENSYLL), verticals=TRUE, col="blue", main=""); grid()# plot the cumulative frequencies of lengths for other constrlines(ecdf(DO_LENSYLL[CONSTRUCTION==levels(CONSTRUCTION)[2]]), verticals=TRUE, col="red") # with vertical lines in redlegend(10, 0.3, # put a legend at these coordinates xjust=0.5, yjust=0.5, # centered along both x- and y-axis fill=c("blue", "red"), # using the colors blue and red legend=levels(CONSTRUCTION), # for the two construction labels bty="n") # and no box})```## Statistical testingFor this question, we would prefer to use a Kolmogorov-Smirnov test for independence/differences:```{r exercise04test}ks.test(d$DO_LENSYLL ~ d$CONSTRUCTION)```Clearly highly significant ...## Write-up[Show histograms or ecdf plots.] According to a Kolmogorov-Smirnov test for independence/differences of the lengths of the DOs for each construction, the distributions of lengths are significantly different from each other (*D*=0.54, *p*<10^-12^): When the construction is *v_prt_do*, then there are many more longer lengths; when the construction is *v_do_prt*, then the vast majority of lengths is rather short (nearly 80% of all DO lengths in *v_do_prt* are 3 syllables and shorter).## ExcursusAt the beginning of this exercise, we said "the length and the complexity of the DO are probably highly correlated" -- are they? Check it (just visually).```{r exercise04excursus}#| fig-show: holdpar(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columnsspineplot(d$DO_COMPLX ~ d$DO_LENSYLL)boxplot (d$DO_LENSYLL ~ d$DO_COMPLX)par(mfrow=c(1, 1)) # reset to default```It certainly seems so.How about when we compute lambdas? (A little awkward here because of the many different numeric values so this is just the dirtiest of heuristics ...):```{r}(len_com <-table(d$DO_LENSYLL, d$DO_COMPLX))# predicting from DO_LENSYLL to DO_COMPLXerror_rate_wout_pred <- len_com %>% colSums %>% prop.table %>% max %>%"-"(1, .)error_rate_with_pred <-apply(len_com, 1, max) %>% sum %>%"/"(sum(len_com)) %>%"-"(1, .)(PRE <- (error_rate_wout_pred-error_rate_with_pred) / error_rate_wout_pred)# predicting from DO_COMPLX to DO_LENSYLLerror_rate_wout_pred <- len_com %>% rowSums %>% prop.table %>% max %>%"-"(1, .)error_rate_with_pred <-apply(len_com, 2, max) %>% sum %>%"/"(sum(len_com)) %>%"-"(1, .)(PRE <- (error_rate_wout_pred-error_rate_with_pred) / error_rate_wout_pred)``````{r}#| echo: false#| eval: falseGoodmanKruskal.lambda(len_com)GoodmanKruskal.tau(len_com)```Ways to check this for real, not just with plots, will again include regression models.# HomeworkTo prepare for next session, read (and work through!) SFLWR^3^: Sections 4.2-4.3.1.# Session info```{r sessionInfo}sessionInfo()```