Ling 104, session 04: descr. stats 2 (key)

Author

Affiliation

UC Santa Barbara & JLU Giessen

Published

01 Apr 2026 12-34-56

rm(list=ls(all=TRUE)); library(magrittr)

1 Exercise 14

Ten bilingual students (English/German) took one dictation in English and one in German. They made the following numbers of mistakes in English and German respectively:

ENGLISH <- c(29, 20, 10, 16, 12, 15, 25, 22, 20, 23)
GERMAN  <- c(21, 19, 28, 28, 26, 18, 16, 22, 20, 28)

Compute a measure of correlation to quantify the association between the numbers of errors.

cor(ENGLISH, GERMAN)

[1] -0.4611367

Illustrate the correlation in a graph and interpret the results (in one sentence).

Note: the assignment says “the correlation”, which is bidirectional: cor(a, b) is the same as cor(b, a). But for plotting, you need to decide what to put on the y-axis, which is traditionally reserved for the response variable. The safest way is therefore to plot both directions:

op <- par(mar=c(4, 4, 2, 1)) # customize the plotting margins
par(mfrow=c(1,2)) # make the plotting window have 1 row & 2 columns
plot(ENGLISH ~ GERMAN, # plot English as a function of German mistakes
     xlab="German dictation" , xlim=c(0, 35), # x-axis stuff
     ylab="English dictation", ylim=c(0, 35)) # y-axis stuff
abline(lm(ENGLISH ~ GERMAN), col="blue"); grid() # add regression line in blue & a grid
plot(GERMAN ~ ENGLISH, # plot German as a function of English mistakes
     xlab="English dictation", xlim=c(0, 35), # x-axis stuff
     ylab="German dictation" , ylim=c(0, 35)) # y-axis stuff
abline(lm(GERMAN ~ ENGLISH), col="blue"); grid() # add regression line in blue & a grid
par(op) # reset to defaults

There is a moderate negative correlation (r≈-0.461): higher values of mistakes in one dictation are correlated with lower values of mistakes in the other.

2 Exercise 15

Compute the number of mistakes expected from a student in the German dictation, if that student made 12 mistakes in the English dictation.

op <- par(mar=c(4, 4, 2, 1)) # customize the plotting margins
m <- lm(GERMAN ~ ENGLISH) # compute a regression model German as a function of English
predict(m,       # predict from that model
   newdata=list( # what happens when
   ENGLISH=12))  # the number of English mistakes is 12

       1 
25.14358

plot(GERMAN ~ ENGLISH, # plot German as a function of English mistakes
     xlab="English dictation", xlim=c(0, 35), # x-axis stuff
     ylab="German dictation" , ylim=c(0, 35)) # y-axis stuff
abline(lm(GERMAN ~ ENGLISH), col="blue"); grid() # add regression line in blue & a grid
# draw the vertical & horizontal dashed lines
segments(12, par("usr")[1], 12, 25.14, col="blue", lty=2)    # vertical
segments(12, 25.14, par("usr")[3], 25.14, col="blue", lty=2) # horizontal

par(op) # reset to defaults

3 Exercise 16

Now you also obtained the sexes of the students: students 2 to 6 were girls, the rest boys.

Enter this into R.

SEX <- factor(c("m", "f", "f", "f", "f", "f", "m", "m", "m", "m"))

Compute the average numbers of errors in the German dictation for boys and girls.

tapply(GERMAN, # apply to the values of German
       SEX,    # a grouping by sex & then
       mean)   # make those grouped values the 1st argum. to the function mean

   f    m 
23.8 21.4

Represent the numbers of mistakes in the German dictation as a function of the sex of the students graphically.

op <- par(mar=c(4, 4, 2, 1)) # customize the plotting margins
boxplot(GERMAN ~ SEX, # generate a boxplot of GERMAN as a function of / per SEX
   notch=TRUE,        # with notches
   ylim=c(0, 35),     # and these y-axis limits
   xlab="Sex", ylab="Number of errors") # and these axis labels

par(op) # reset to defaults

4 Exercise 17

Standardize the numbers of errors in both dictations.

(ENGLISH-mean(ENGLISH))/sd(ENGLISH) # or scale(ENGLISH)

 [1]  1.6497080  0.1346700 -1.5487055 -0.5386802 -1.2120304 -0.7070177
 [7]  0.9763578  0.4713451  0.1346700  0.6396827

(GERMAN -mean(GERMAN)) /sd(GERMAN)  # or scale(GERMAN)

 [1] -0.3515752 -0.7910443  1.1865664  1.1865664  0.7470974 -1.0107788
 [7] -1.4502479 -0.1318407 -0.5713098  1.1865664

5 Exercise 18

50 students took a statistics exam, 80% passed. What is the 95%-confidence interval for this result?

round(binom.test(    # round the probabilities of a binomial test for
   x=40,             # this number of successes, here passes
   n=50,             # this number of trials, here the 50 students
   conf.level=0.95)$ # the confidence level we want; 0.95 = default
   conf.int, 3)      # return only the confidence interval, round to 3 decimals

[1] 0.663 0.900
attr(,"conf.level")
[1] 0.95

But of course you could also use a (percentile) bootstrapping approach, which returns results that are fairly comparable to the binom.test results (and that are also fairly comparable to those of a more advanced bootstrapping approach):

collector <- rep(NA, 2000) # set up a collector vector
set.seed(123); for (i in 1:2000) { # do something 2K times, namely
   RESULTS_sampled <- sample(
           c("fail", "pass"),
      prob=c( 0.2  ,  0.8),
      size=50, replace=TRUE)
   collector[i] <- sum(RESULTS_sampled=="pass")/50
}
quantile(collector, probs=c(0.025, 0.975)) # extract 'CI'

 2.5% 97.5% 
 0.68  0.90

6 Exercise 19

Load the file _input/partplacement.csv into a data frame d. This file contains data from a corpus study on the alternation of particle placement that was introduced in Section 1.3; you can find information about this data set in _input/partplacement.r.

summary(d <- read.delim(
   "_input/partplacement.csv",
   stringsAsFactors=TRUE))

      CASE          CONSTRUCTION     MEDIUM       DO_COMPLX     DO_LENSYLL   
 Min.   :  1.00   v_do_prt:100   spoken :100   clausmod:  6   Min.   : 1.00  
 1st Qu.: 50.75   v_prt_do:100   written:100   phrasmod: 67   1st Qu.: 2.00  
 Median :100.50                                simple  :127   Median : 3.00  
 Mean   :100.50                                               Mean   : 4.72  
 3rd Qu.:150.25                                               3rd Qu.: 6.00  
 Max.   :200.00                                               Max.   :31.00  
      DO_ANIM        DO_CONC      PP     
 animate  : 27   abstract: 95   no :167  
 inanimate:173   concrete:105   yes: 33

7 Exercise 20

Represent the correlation between the choice of construction and the complexity of the direct object graphically.

op <- par(mar=c(4, 4, 2, 1)) # customize the plotting margins
mosaicplot(            # show a mosaic plot
   main="",            # w/out a main heading
   x=table(            # of the table of 
      d$DO_COMPLX,     # DO_COMPLX &
      d$CONSTRUCTION), # CONSTRUCTION
   col=c("grey35", "grey75")) # w/ these colors
par(op) # reset to defaults
# plot(d$DO_COMPLEXITY, d$CONSTRUCTION) # or
# plot(table(d$DO_COMPLEXITY, d$CONSTRUCTION)) # or
# plot(d$CONSTRUCTION ~ d$DO_COMPLEXITY)

8 Exercise 21

Create a table representing the correlation between the choice of construction and the complexity of the direct object and briefly summarize the result.

table(
   d$DO_COMPLX,
   d$CONSTRUCTION) # or table(d$CONSTRUCTION, d$DO_COMPLX)

          
           v_do_prt v_prt_do
  clausmod        1        5
  phrasmod        9       58
  simple         90       37

# this is how you could avoid the d$...:
# nested functions
with(d, table(
   DO_COMPLX,
   CONSTRUCTION))

          CONSTRUCTION
DO_COMPLX  v_do_prt v_prt_do
  clausmod        1        5
  phrasmod        9       58
  simple         90       37

# w/ the pipe
d %>% with(table(
   DO_COMPLX,
   CONSTRUCTION))

          CONSTRUCTION
DO_COMPLX  v_do_prt v_prt_do
  clausmod        1        5
  phrasmod        9       58
  simple         90       37

Simple direct object prefer the construction where the particle follows the direct object, but phrasally and clausally modified objects prefer the construction where the particle precedes the direct object. (Note: This is not yet a significance test so it is not clear yet whether this preference is in fact significant.)

9 Exercise 22

Represent the correlation between the choice of construction and the length of the direct object graphically and briefly summarize the result.

This would be the best solution:

op <- par(mar=c(4, 4, 2, 1)) # customize the plotting margins
spineplot(d$CONSTRUCTION ~ d$DO_LENSYLL)

spineplot(d$CONSTRUCTION ~ d$DO_LENSYLL, breaks=12)

par(op) # reset to defaults

The shorter the direct object, the more the particle gets positioned behind the direct object. (Note: This is not yet a significance test so it is not clear yet whether this preference is in fact significant.)

10 Exercise 23

Investigate whether the choice of construction depends on the animacy of the referent of the direct objects and the presence/absence of a directional prepositional phrase.

This question is in fact ambiguous: did I mean

did I mean ’whether CONSTRUCTION varies as
1. a function of DO_ANIM and separately
2. a function of PP’?
or did I mean ‘whether CONSTRUCTION varies as a function of DO_ANIM and PP together’?

The answers to these questions are different; this is what you would do for 1a) and 1b):

table(d$CONSTRUCTION, d$DO_ANIM) # question 1a

          
           animate inanimate
  v_do_prt      19        81
  v_prt_do       8        92

table(d$CONSTRUCTION, d$PP)      # question 1b

          
           no yes
  v_do_prt 75  25
  v_prt_do 92   8

And this is what you would do for 2):

table(d$CONSTRUCTION, d$DO_ANIM, d$PP) # question 2

, ,  = no

          
           animate inanimate
  v_do_prt      13        62
  v_prt_do       7        85

, ,  = yes

          
           animate inanimate
  v_do_prt       6        19
  v_prt_do       1         7

Although this would be better:

ftable(             # a flat contingency table w/
   d$DO_ANIM, d$PP, # these row variables
   d$CONSTRUCTION)  # this column variable

               v_do_prt v_prt_do
                                
animate   no         13        7
          yes         6        1
inanimate no         62       85
          yes        19        7

And this might be even better:

round(digits=3,            # round to 3 digits
   prop.table(margin=1,    # the row percentages of
      x=ftable(            # a flat contingency table w/
         d$DO_ANIM, d$PP,  # these row variables
         d$CONSTRUCTION))) # this column variable

               v_do_prt v_prt_do
                                
animate   no      0.650    0.350
          yes     0.857    0.143
inanimate no      0.422    0.578
          yes     0.731    0.269

But this one would actually benefit from the pipe, I think, both in terms of brevity/readability and in terms of the output:

with(d, ftable(DO_ANIM, PP, CONSTRUCTION)) %>%
   prop.table(1) %>% round(3)

              CONSTRUCTION v_do_prt v_prt_do
DO_ANIM   PP                                
animate   no                  0.650    0.350
          yes                 0.857    0.143
inanimate no                  0.422    0.578
          yes                 0.731    0.269

11 Homework

To prepare for next week, read (and work through!) SFLWR³: Section 4.1.

12 Session info

sessionInfo()

R version 4.5.2 Patched (2025-11-09 r88994 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods  
[8] base     

other attached packages:
[1] STGmisc_1.06   Rcpp_1.1.1     magrittr_2.0.4

loaded via a namespace (and not attached):
 [1] digest_0.6.39     fastmap_1.2.0     xfun_0.57         knitr_1.51       
 [5] htmltools_0.5.9   rmarkdown_2.30    cli_3.6.5         rstudioapi_0.18.0
 [9] tools_4.5.2       evaluate_1.0.5    yaml_2.3.12       otel_0.2.0       
[13] htmlwidgets_1.6.4 rlang_1.1.7       jsonlite_2.0.0    MASS_7.3-65