Ling 105: all assignments

Author

Affiliations

Stefan Th. Gries

UC Santa Barbara

JLU Giessen

Published

01 Apr 2024 12-34-56

1 Assignment 01

Central question: How many X does the phrase some X next to a Y refer to? Your predictors are

OBJECT: the sizes of the objects X: large vs. small;
REFPOINT: the sizes of the reference points Y: large vs. small.

Analyze the data properly with a regression model and summarize the results (briefly). [Difficulty level: 1]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/quantifyingsome.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

      CASE         OBJECT   REFPOINT    ESTIMATE
 Min.   : 1.00   large:8   large:8   Min.   : 2.0
 1st Qu.: 4.75   small:8   small:8   1st Qu.:38.5
 Median : 8.50                       Median :44.0
 Mean   : 8.50                       Mean   :51.5
 3rd Qu.:12.25                       3rd Qu.:73.0
 Max.   :16.00                       Max.   :91.0

2 Assignment 02

Central question: What determines the number of praises in child-caretaker interaction? The data come from recording of different children and contain the following variables :

PRAISES: the response variable, the number of times the children are praised by their caretakers;
CHILD: the name of each child;
SEX: the sex of each child;
CAN: the number of verb phrases where the caretakers use can when speaking about actions of the child;
WANT: the number of verb phrases where the caretakers use want when speaking about actions of the child;
SHOULD_SHALL: the number of verb phrases where the caretakers use should/shall when speaking about actions of the child;
DIRECTIVE: the number of verb phrases where the caretakers uses a directive when speaking about actions of the child;
SUCCESS: the number of times the child does something as intended;
FAILURE: the number of times the child does something not as intended.

You now want to determine to what degree the number of praises is a function of

all predictors as main effects
and interaction of a predictor with SEX.

Analyze the data with properly and summarize the results (briefly). [Difficulty level: 3]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/praises.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

     CHILD    SEX       PRAISES          CAN              WANT
 aRetha : 1   f:15   Min.   : 0.0   Min.   : 0.000   Min.   : 0.00
 aRnold : 1   m:13   1st Qu.: 2.0   1st Qu.: 1.000   1st Qu.: 0.75
 baRbara: 1          Median : 5.0   Median : 4.000   Median : 2.00
 beRnard: 1          Mean   : 5.5   Mean   : 4.321   Mean   : 3.25
 chRis  : 1          3rd Qu.: 7.5   3rd Qu.: 5.250   3rd Qu.: 6.00
 chRissy: 1          Max.   :13.0   Max.   :18.000   Max.   :10.00
 (Other):22
  SHOULD_SHALL      DIRECTIVE        SUCCESS          FAILURE
 Min.   :0.0000   Min.   : 0.00   Min.   : 0.000   Min.   :0.000
 1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 4.000   1st Qu.:1.000
 Median :0.0000   Median :12.00   Median : 6.500   Median :3.000
 Mean   :0.8929   Mean   :15.61   Mean   : 7.679   Mean   :3.286
 3rd Qu.:1.2500   3rd Qu.:19.50   3rd Qu.:10.000   3rd Qu.:5.250
 Max.   :6.0000   Max.   :46.00   Max.   :18.000   Max.   :8.000

3 Assignment 03

Central question: is the choice of of- vs. s-genitives (the car of my father vs. my father’s car) dependent in some way on the animacy of the possessor (my father) and/or the possessed (the car)? Your predictors are

POSSESSOR: the animacy of the possessor: abstract vs. animate vs. concrete;
POSSESSED: the animacy of the possessed: abstract vs. animate vs. concrete.

Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/genitivesem.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

      CASE        GENITIVE    POSSESSOR      POSSESSED
 Min.   :  1.00   of:150   abstract:139   abstract:206
 1st Qu.: 75.75   s :150   animate :118   animate : 20
 Median :150.50            concrete: 43   concrete: 74
 Mean   :150.50
 3rd Qu.:225.25
 Max.   :300.00

4 Assignment 04

Central question: is the choice of try to- vs. try and-constructions (I’m gonna try to fix this problem vs. I’m gonna try and fix this problem, which is in the column TRY) dependent in some way on the following 3 predictors and all their interactions:

MODE: whether the data represent spoken (spk) or written (wrt) English;
VARIETY: whether the data represent American (amer) or British English (brit);
CLAUSE: does the clause in which _try is used with to or and already involve another to (as in we’re going -> to <- try and beat this thing) or not (other)?

(Source: Hommerberg, Charlotte & Gunnel Tottie. 2007. Try to or Try and? Verb complementation in British and American English. ICAME Journal 31. 45-64.)

Analyze the data like we discussed and summarize the results (briefly). [Difficulty level: 1]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/tryandtryto.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

      CASE       TRY       VARIETY      MODE        CLAUSE
 Min.   :   1   and:1631   amer:1187   spk:2257   other:1662
 1st Qu.: 808   to :1598   brit:2042   wrt: 972   to   :1567
 Median :1615
 Mean   :1615
 3rd Qu.:2422
 Max.   :3229

5 Assignment 05

Central question: is the choice of I vs. you , which is represented in the column MATCH dependent in some way on the following 3 predictors and all their pairwise interactions:

SEX: whether the speaker is female or male;
SENTENCE: where in the file I or you was used on a scale from 0 (first sentence) to 1 (last sentence);
DISTANCE: where in the sentence I or you was used on a scale from 0 (first character) to ≈1 (last character).

The following loads the data and prepares the variable DISTANCE:

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(     # summarize d, the result of loading
   file="_input/IvsYou.csv", # this file
   stringsAsFactors=FALSE))  # don't change categorical variables into factors (!)

      CASE           FILE             SPEAKER              SEX
 Min.   :    1   Length:21102       Length:21102       Length:21102
 1st Qu.: 5276   Class :character   Class :character   Class :character
 Median :10552   Mode  :character   Mode  :character   Mode  :character
 Mean   :10552
 3rd Qu.:15827
 Max.   :21102
    SENTENCE       PRECEDING            MATCH            SUBSEQUENT
 Min.   :0.0000   Length:21102       Length:21102       Length:21102
 1st Qu.:0.2394   Class :character   Class :character   Class :character
 Median :0.5147   Mode  :character   Mode  :character   Mode  :character
 Mean   :0.5014
 3rd Qu.:0.7573
 Max.   :1.0000

d$SENTLENGTH <- nchar(d$PRECEDING)  +
                nchar(d$MATCH)      +
                nchar(d$SUBSEQUENT)
d$DISTANCE <- nchar(d$PRECEDING)/d$SENTLENGTH
d <- d[,c(1:3,7,4:5,9:10)]; d[,2:5] <- lapply(d[,2:5], as.factor)
summary(d)

      CASE            FILE         SPEAKER      MATCH       SEX
 Min.   :    1   KRL    :4610   PS5VN  : 1248   i  :    2    : 1043
 1st Qu.: 5276   KRH    :3590   PS62L  :  852   I  :11637   f: 6676
 Median :10552   KRT    :3093   PS63K  :  785   you: 8619   m:12480
 Mean   :10552   KRP    :1997   PS5T8  :  655   You:  844   u:  903
 3rd Qu.:15827   KR0    :1445   PS5VL  :  647
 Max.   :21102   KRG    :1385   PS59B  :  632
                 (Other):4982   (Other):16283
    SENTENCE        SENTLENGTH        DISTANCE
 Min.   :0.0000   Min.   :   1.0   Min.   :0.0000
 1st Qu.:0.2394   1st Qu.:  65.0   1st Qu.:0.0351
 Median :0.5147   Median : 141.0   Median :0.2453
 Mean   :0.5014   Mean   : 181.3   Mean   :0.3197
 3rd Qu.:0.7573   3rd Qu.: 250.0   3rd Qu.:0.5600
 Max.   :1.0000   Max.   :1353.0   Max.   :0.9978

Analyze the data properly and summarize the results (briefly). [Difficulty level: 4]

6 Assignment 06

Central question: Do n-grams returned early by an algorithm (BINRANK: early) get rated better (ordinal response: RATING) than returned late by that algorithm (BINRANK: late) if one controls for the length of the n-gram (SIZE)? The data frame contains the following variables :

RATING: the response variable, integers from 1 to 7;
SIZE: the number of parts of each n-gram;
BINRANK: the main predictor as per the above.

Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/MERGErating.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

      CASE             GRAM       PARTICIPANT       SCORE            SIZE
 Min.   :   1.0   GRAM001:   5   A1     :  80   Min.   :1.000   Min.   :2.00
 1st Qu.: 400.8   GRAM002:   5   A2     :  80   1st Qu.:1.000   1st Qu.:2.75
 Median : 800.5   GRAM003:   5   A3     :  80   Median :3.000   Median :3.50
 Mean   : 800.5   GRAM004:   5   A4     :  80   Mean   :3.758   Mean   :3.50
 3rd Qu.:1200.2   GRAM005:   5   A5     :  80   3rd Qu.:7.000   3rd Qu.:4.25
 Max.   :1600.0   GRAM006:   5   B1     :  80   Max.   :7.000   Max.   :5.00
                  (Other):1570   (Other):1120
  BINRANK
 early:800
 late :800

7 Assignment 07

Central question: Are results on subordinate clause ordering from the studies of Hampe and Diessel comparable/compatible? Here are the data:

CASE: the usual numbering column;
STUDY: a column indicating to which study a data point in a row belongs: diessel vs. hampe;
ORDER: the response variable in each study, the order of main and subordinate clause (and you know this response from another study in the book);
CONJ: the predictor in each study, the subordinate conjunction used in the subordinate clause:

rm(list=ls(all.names=TRUE))
d <- data.frame(
   STUDY=rep(c("diessel", "hampe"), 8),
   ORDER=rep(c("sc-mc", "mc-sc"), each=8),
   CONJ =rep(rep(c("after", "before", "once", "until"), each=2), 2),
   FREQ =c(27, 82, 6, 105, 77, 236, 5, 41, 70, 200, 81, 425, 21, 74, 94, 346))
d <- data.frame(lapply(d[, -4], \(af) { rep(af, d$FREQ) }))
d <- data.frame(lapply(d, as.factor))
summary(d <- cbind(CASE=seq(nrow(d)), d))

      CASE            STUDY        ORDER          CONJ
 Min.   :   1.0   diessel: 381   mc-sc:1311   after :379
 1st Qu.: 473.2   hampe  :1509   sc-mc: 579   before:617
 Median : 945.5                               once  :408
 Mean   : 945.5                               until :486
 3rd Qu.:1417.8
 Max.   :1890.0

Are Hampe’s and Diessel’s findings ‘the same’? Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

8 Assignment 08

Central question: What determines how speakers rate the acceptability (the 7-level response variable RATING) of to- vs. -ing complementation (as in I like to swim vs. I like swimming) in an experiment?

CX_NOW: whether the current experimental stimulus is a to or an -ing construction?
VNOW_PREF: whether the verb in the current experimental stimulus generally prefers to appear with to or an -ing constructions?
CX_PRV: whether the previous experimental stimulus was a to or an -ing construction?
any interactions of these predictors?

Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/toingpriming.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

      CASE           RATING        CXNOW     VNOW_PREF CXPREV
 Min.   :  1.0   Min.   :-3.0000   ing:270   ing:280   ing:278
 1st Qu.:139.8   1st Qu.:-1.0000   to :286   to :276   to :278
 Median :278.5   Median : 0.0000
 Mean   :278.5   Mean   : 0.3705
 3rd Qu.:417.2   3rd Qu.: 2.0000
 Max.   :556.0   Max.   : 3.0000

9 Assignment 09

Central question: Do children and their caretakers exhibit different correlations (measured in Cramer’s V values) between tense (past vs. non-past) and aspect (perfective vs. imperfective) such that

adults’ correlation values don’t change over time anymore;
children’s correlation values change over time and approximate the adults’ value(s).

You have data from a corpus study and these are the variables in the data frame:

AGE: the age of the child at recording time: YEAR;MONTH.DAY;
KID: the Cramer’s V value for the child’s tense-aspect correlation in this recording;
CARETAKER: the Cramer’s V value for the caretaker’s tense-aspect correlation in this recording

Note: Whatever graphs involving time you use, the axis representing the age of the child must of course be proportional to the age, not just to the position of an age in the vector of ages. I don’t care about how you do that, if you do that in a spreadsheet software, that’s fine, too.

Analyze the data with properly and summarize the results (briefly). [Difficulty level: 3]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(    # summarize d, the result of loading
   file="_input/russaspect.csv", # this file
   stringsAsFactors=FALSE)) # don't change categorical variables into factors (!)

      CASE           AGE                 KID            CARETAKER
 Min.   : 1.00   Length:80          Min.   :0.01645   Min.   :0.1627
 1st Qu.:20.75   Class :character   1st Qu.:0.31861   1st Qu.:0.3004
 Median :40.50   Mode  :character   Median :0.44217   Median :0.3554
 Mean   :40.50                      Mean   :0.45170   Mean   :0.3640
 3rd Qu.:60.25                      3rd Qu.:0.57247   3rd Qu.:0.4355
 Max.   :80.00                      Max.   :1.00000   Max.   :0.5586

10 Assignment 10

Central question: what factors co-determine how English changed from a 3rd-person singular -th (e.g., He giveth) to the current 3rd-person singular -s (e.g., He gives)? You have data from a corpus study on how the third person singular form in English changed across five time periods (from P1 at about 1480 to P5 at about 1700). This data set contains annotation for third person singular verbs (extracted from letters) with regard to the following variables:

VARIANT: the response variable: the third person singular form as found in the corpus file: es vs. th;
TIME5: the time period: P1 vs. P2 vs. P3 vs. P4 vs. P5;
SENGEND: the sex of the sender of the letter: female vs. male;
RECGEND: the sex of the recipient of the letter: female vs. male;
CLOSEFAM: whether the recipient of the letter is a close family member of the sender or not: no vs. yes;
FINSYB: whether the verb stem ends in a sibilant: no (e.g., see) vs. yes (e.g., seize);
FOLFRIC: what the word following the third person singular form begins with: s (e.g., he sees seagulls) vs. th (e.g., he sees the seagulls) vs. other (e.g., he sees many seagulls);
GRAM: whether the verb in question is used as a grammatical or a lexical verb: yes (grammatical, i.e. be, do and aux. have) vs. no (lexical/other).

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(    # summarize d, the result of loading
   file="_input/thirdpers.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

 VARIANT    AUTH_GEND    REC_SAME_GEND CLOSE_FAM  VNCPERIOD FIN_SYB
 es:1524   female: 784   no :1210      no :1917   P1: 505   no :3953
 th:2619   male  :3359   yes:2933      yes:2226   P2:  99   yes: 190
                                                  P3:1508
                                                  P4:1096
                                                  P5: 935
  FOL_FRIC     GRAM
 es   : 189   no :2867
 other:3666   yes:1276
 th   : 288

You want to characterize how the predictors and their pairwise interactions with TIME are correlated with the change from -(e)th to -(e)s. Analyze the data properly and summarize the results (briefly). Note: you must conflate the 3 early time periods into one, but once you’re done with everything, you should figure out why. [Difficulty level: 4]