Ling 105: all assignments

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

01 Apr 2024 12-34-56

1 Assignment 01

Central question: How many X does the phrase some X next to a Y refer to? Your predictors are

  • OBJECT: the sizes of the objects X: large vs. small;
  • REFPOINT: the sizes of the reference points Y: large vs. small.

Analyze the data properly with a regression model and summarize the results (briefly). [Difficulty level: 1]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/quantifyingsome.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
      CASE         OBJECT   REFPOINT    ESTIMATE   
 Min.   : 1.00   large:8   large:8   Min.   : 2.0  
 1st Qu.: 4.75   small:8   small:8   1st Qu.:38.5  
 Median : 8.50                       Median :44.0  
 Mean   : 8.50                       Mean   :51.5  
 3rd Qu.:12.25                       3rd Qu.:73.0  
 Max.   :16.00                       Max.   :91.0  

2 Assignment 02

Central question: What determines the number of praises in child-caretaker interaction? The data come from recording of different children and contain the following variables :

  • PRAISES: the response variable, the number of times the children are praised by their caretakers;
  • CHILD: the name of each child;
  • SEX: the sex of each child;
  • CAN: the number of verb phrases where the caretakers use can when speaking about actions of the child;
  • WANT: the number of verb phrases where the caretakers use want when speaking about actions of the child;
  • SHOULD_SHALL: the number of verb phrases where the caretakers use should/shall when speaking about actions of the child;
  • DIRECTIVE: the number of verb phrases where the caretakers uses a directive when speaking about actions of the child;
  • SUCCESS: the number of times the child does something as intended;
  • FAILURE: the number of times the child does something not as intended.

You now want to determine to what degree the number of praises is a function of

  • all predictors as main effects
  • and interaction of a predictor with SEX.

Analyze the data with properly and summarize the results (briefly). [Difficulty level: 3]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/praises.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
     CHILD    SEX       PRAISES          CAN              WANT      
 aRetha : 1   f:15   Min.   : 0.0   Min.   : 0.000   Min.   : 0.00  
 aRnold : 1   m:13   1st Qu.: 2.0   1st Qu.: 1.000   1st Qu.: 0.75  
 baRbara: 1          Median : 5.0   Median : 4.000   Median : 2.00  
 beRnard: 1          Mean   : 5.5   Mean   : 4.321   Mean   : 3.25  
 chRis  : 1          3rd Qu.: 7.5   3rd Qu.: 5.250   3rd Qu.: 6.00  
 chRissy: 1          Max.   :13.0   Max.   :18.000   Max.   :10.00  
 (Other):22                                                         
  SHOULD_SHALL      DIRECTIVE        SUCCESS          FAILURE     
 Min.   :0.0000   Min.   : 0.00   Min.   : 0.000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 4.000   1st Qu.:1.000  
 Median :0.0000   Median :12.00   Median : 6.500   Median :3.000  
 Mean   :0.8929   Mean   :15.61   Mean   : 7.679   Mean   :3.286  
 3rd Qu.:1.2500   3rd Qu.:19.50   3rd Qu.:10.000   3rd Qu.:5.250  
 Max.   :6.0000   Max.   :46.00   Max.   :18.000   Max.   :8.000  
                                                                  

3 Assignment 03

Central question: is the choice of of- vs. s-genitives (the car of my father vs. my father’s car) dependent in some way on the animacy of the possessor (my father) and/or the possessed (the car)? Your predictors are

  • POSSESSOR: the animacy of the possessor: abstract vs. animate vs. concrete;
  • POSSESSED: the animacy of the possessed: abstract vs. animate vs. concrete.

Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/genitives.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
      CASE      GENITIVE  SPEAKER       MODALITY      POR_LENGTH    
 Min.   :   2   of:2720   nns:2666   spoken :1685   Min.   :  1.00  
 1st Qu.:1006   s : 880   ns : 934   written:1915   1st Qu.:  8.00  
 Median :2018                                       Median : 11.00  
 Mean   :2012                                       Mean   : 14.58  
 3rd Qu.:3017                                       3rd Qu.: 17.00  
 Max.   :4040                                       Max.   :204.00  
   PUM_LENGTH         POR_ANIMACY   POR_FINAL_SIB        POR_DEF    
 Min.   :  2.00   animate   : 920   absent :2721   definite  :2349  
 1st Qu.:  6.00   collective: 607   present: 879   indefinite:1251  
 Median :  9.00   inanimate :1671                                   
 Mean   : 10.35   locative  : 243                                   
 3rd Qu.: 13.00   temporal  : 159                                   
 Max.   :109.00                                                     

4 Assignment 04

Central question: is the choice of try to- vs. try and-constructions (I’m gonna try to fix this problem vs. I’m gonna try and fix this problem, which is in the column TRY) dependent in some way on the following 3 predictors and all their interactions:

  • MODE: whether the data represent spoken (spk) or written (wrt) English;
  • VARIETY: whether the data represent American (amer) or British English (brit);
  • CLAUSE: does the clause in which try is used with to or and already involve another to (as in we’re going -> to <- try and beat this thing) or not (other)?

(Source: Hommerberg, Charlotte & Gunnel Tottie. 2007. Try to or Try and? Verb complementation in British and American English. ICAME Journal 31. 45-64.)

Analyze the data like we discussed and summarize the results (briefly). [Difficulty level: 1]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/tryandtryto.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
      CASE       TRY       VARIETY      MODE        CLAUSE    
 Min.   :   1   and:1631   amer:1187   spk:2257   other:1662  
 1st Qu.: 808   to :1598   brit:2042   wrt: 972   to   :1567  
 Median :1615                                                 
 Mean   :1615                                                 
 3rd Qu.:2422                                                 
 Max.   :3229                                                 

5 Assignment 05

Central question: is the choice of I vs. you , which is represented in the column MATCH dependent in some way on the following 3 predictors and all their pairwise interactions:

  • SEX: whether the speaker is female or male;
  • SENTENCE: where in the file I or you was used on a scale from 0 (first sentence) to 1 (last sentence);
  • DISTANCE: where in the sentence I or you was used on a scale from 0 (first character) to ≈1 (last character).

The following loads the data and prepares the variable DISTANCE:

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(     # summarize d, the result of loading
   file="_input/IvsYou.csv", # this file
   stringsAsFactors=FALSE))  # don't change categorical variables into factors (!)
      CASE           FILE             SPEAKER              SEX           
 Min.   :    1   Length:21102       Length:21102       Length:21102      
 1st Qu.: 5276   Class :character   Class :character   Class :character  
 Median :10552   Mode  :character   Mode  :character   Mode  :character  
 Mean   :10552                                                           
 3rd Qu.:15827                                                           
 Max.   :21102                                                           
    SENTENCE       PRECEDING            MATCH            SUBSEQUENT       
 Min.   :0.0000   Length:21102       Length:21102       Length:21102      
 1st Qu.:0.2394   Class :character   Class :character   Class :character  
 Median :0.5147   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.5014                                                           
 3rd Qu.:0.7573                                                           
 Max.   :1.0000                                                           
d$SENTLENGTH <- nchar(d$PRECEDING)  +
                nchar(d$MATCH)      +
                nchar(d$SUBSEQUENT)
d$DISTANCE <- nchar(d$PRECEDING)/d$SENTLENGTH
d <- d[,c(1:3,7,4:5,9:10)]; d[,2:5] <- lapply(d[,2:5], as.factor)
summary(d)
      CASE            FILE         SPEAKER      MATCH       SEX      
 Min.   :    1   KRL    :4610   PS5VN  : 1248   i  :    2    : 1043  
 1st Qu.: 5276   KRH    :3590   PS62L  :  852   I  :11637   f: 6676  
 Median :10552   KRT    :3093   PS63K  :  785   you: 8619   m:12480  
 Mean   :10552   KRP    :1997   PS5T8  :  655   You:  844   u:  903  
 3rd Qu.:15827   KR0    :1445   PS5VL  :  647                        
 Max.   :21102   KRG    :1385   PS59B  :  632                        
                 (Other):4982   (Other):16283                        
    SENTENCE        SENTLENGTH        DISTANCE     
 Min.   :0.0000   Min.   :   1.0   Min.   :0.0000  
 1st Qu.:0.2394   1st Qu.:  65.0   1st Qu.:0.0351  
 Median :0.5147   Median : 141.0   Median :0.2453  
 Mean   :0.5014   Mean   : 181.3   Mean   :0.3197  
 3rd Qu.:0.7573   3rd Qu.: 250.0   3rd Qu.:0.5600  
 Max.   :1.0000   Max.   :1353.0   Max.   :0.9978  
                                                   

Analyze the data properly and summarize the results (briefly). [Difficulty level: 4]

6 Assignment 06

Central question: Do n-grams returned early by an algorithm (BINRANK: early) get rated better (ordinal response: RATING) than returned late by that algorithm (BINRANK: late) if one controls for the length of the n-gram (SIZE)? The data frame contains the following variables :

  • RATING: the response variable, integers from 1 to 7;
  • SIZE: the number of parts of each n-gram;
  • BINRANK: the main predictor as per the above.

Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/MERGErating.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
      CASE             GRAM       PARTICIPANT       SCORE            SIZE     
 Min.   :   1.0   GRAM001:   5   A1     :  80   Min.   :1.000   Min.   :2.00  
 1st Qu.: 400.8   GRAM002:   5   A2     :  80   1st Qu.:1.000   1st Qu.:2.75  
 Median : 800.5   GRAM003:   5   A3     :  80   Median :3.000   Median :3.50  
 Mean   : 800.5   GRAM004:   5   A4     :  80   Mean   :3.758   Mean   :3.50  
 3rd Qu.:1200.2   GRAM005:   5   A5     :  80   3rd Qu.:7.000   3rd Qu.:4.25  
 Max.   :1600.0   GRAM006:   5   B1     :  80   Max.   :7.000   Max.   :5.00  
                  (Other):1570   (Other):1120                                 
  BINRANK   
 early:800  
 late :800  
            
            
            
            
            

7 Assignment 07

Central question: Are results on subordinate clause ordering from the studies of Hampe and Diessel comparable/compatible? Here are the data:

  • CASE: the usual numbering column;
  • STUDY: a column indicating to which study a data point in a row belongs: diessel vs. hampe;
  • ORDER: the response variable in each study, the order of main and subordinate clause (and you know this response from another study in the book);
  • CONJ: the predictor in each study, the subordinate conjunction used in the subordinate clause:
rm(list=ls(all.names=TRUE))
d <- data.frame(
   STUDY=rep(c("diessel", "hampe"), 8),
   ORDER=rep(c("sc-mc", "mc-sc"), each=8),
   CONJ =rep(rep(c("after", "before", "once", "until"), each=2), 2),
   FREQ =c(27, 82, 6, 105, 77, 236, 5, 41, 70, 200, 81, 425, 21, 74, 94, 346))
d <- data.frame(lapply(d[, -4], \(af) { rep(af, d$FREQ) }))
d <- data.frame(lapply(d, as.factor))
summary(d <- cbind(CASE=seq(nrow(d)), d))
      CASE            STUDY        ORDER          CONJ    
 Min.   :   1.0   diessel: 381   mc-sc:1311   after :379  
 1st Qu.: 473.2   hampe  :1509   sc-mc: 579   before:617  
 Median : 945.5                               once  :408  
 Mean   : 945.5                               until :486  
 3rd Qu.:1417.8                                           
 Max.   :1890.0                                           

Are Hampe’s and Diessel’s findings ‘the same’? Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

8 Assignment 08

Central question: What determines how speakers rate the acceptability (the 7-level response variable RATING) of to- vs. -ing complementation (as in I like to swim vs. I like swimming) in an experiment?

  • CX_NOW: whether the current experimental stimulus is a to or an -ing construction?
  • VNOW_PREF: whether the verb in the current experimental stimulus generally prefers to appear with to or an -ing constructions?
  • CX_PRV: whether the previous experimental stimulus was a to or an -ing construction?
  • any interactions of these predictors?

Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/toingpriming.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
      CASE           RATING        CXNOW     VNOW_PREF CXPREV   
 Min.   :  1.0   Min.   :-3.0000   ing:270   ing:280   ing:278  
 1st Qu.:139.8   1st Qu.:-1.0000   to :286   to :276   to :278  
 Median :278.5   Median : 0.0000                                
 Mean   :278.5   Mean   : 0.3705                                
 3rd Qu.:417.2   3rd Qu.: 2.0000                                
 Max.   :556.0   Max.   : 3.0000                                

9 Assignment 09

Central question: Do children and their caretakers exhibit different correlations (measured in Cramer’s V values) between tense (past vs. non-past) and aspect (perfective vs. imperfective) such that

  • adults’ correlation values don’t change over time anymore;
  • children’s correlation values change over time and approximate the adults’ value(s).

You have data from a corpus study and these are the variables in the data frame:

  • AGE: the age of the child at recording time: YEAR;MONTH.DAY;
  • KID: the Cramer’s V value for the child’s tense-aspect correlation in this recording;
  • CARETAKER: the Cramer’s V value for the caretaker’s tense-aspect correlation in this recording

Note: Whatever graphs involving time you use, the axis representing the age of the child must of course be proportional to the age, not just to the position of an age in the vector of ages. I don’t care about how you do that, if you do that in a spreadsheet software, that’s fine, too.

Analyze the data with properly and summarize the results (briefly). [Difficulty level: 3]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(    # summarize d, the result of loading
   file="_input/russaspect.csv", # this file
   stringsAsFactors=FALSE)) # don't change categorical variables into factors (!)
      CASE           AGE                 KID            CARETAKER     
 Min.   : 1.00   Length:80          Min.   :0.01645   Min.   :0.1627  
 1st Qu.:20.75   Class :character   1st Qu.:0.31861   1st Qu.:0.3004  
 Median :40.50   Mode  :character   Median :0.44217   Median :0.3554  
 Mean   :40.50                      Mean   :0.45170   Mean   :0.3640  
 3rd Qu.:60.25                      3rd Qu.:0.57247   3rd Qu.:0.4355  
 Max.   :80.00                      Max.   :1.00000   Max.   :0.5586  

10 Assignment 10

Central question: what factors co-determine how English changed from a 3rd-person singular -th (e.g., He giveth) to the current 3rd-person singular -s (e.g., He gives)? You have data from a corpus study on how the third person singular form in English changed across five time periods (from P1 at about 1480 to P5 at about 1700). This data set contains annotation for third person singular verbs (extracted from letters) with regard to the following variables:

  • VARIANT: the response variable: the third person singular form as found in the corpus file: es vs. th;
  • TIME5: the time period: P1 vs. P2 vs. P3 vs. P4 vs. P5;
  • SENGEND: the sex of the sender of the letter: female vs. male;
  • RECGEND: the sex of the recipient of the letter: female vs. male;
  • CLOSEFAM: whether the recipient of the letter is a close family member of the sender or not: no vs. yes;
  • FINSYB: whether the verb stem ends in a sibilant: no (e.g., see) vs. yes (e.g., seize);
  • FOLFRIC: what the word following the third person singular form begins with: s (e.g., he sees seagulls) vs. th (e.g., he sees the seagulls) vs. other (e.g., he sees many seagulls);
  • GRAM: whether the verb in question is used as a grammatical or a lexical verb: yes (grammatical, i.e. be, do and aux. have) vs. no (lexical/other).
rm(list=ls(all.names=TRUE))
summary(d <- read.delim(    # summarize d, the result of loading
   file="_input/thirdpers.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
 VARIANT    AUTH_GEND    REC_SAME_GEND CLOSE_FAM  VNCPERIOD FIN_SYB   
 es:1524   female: 784   no :1210      no :1917   P1: 505   no :3953  
 th:2619   male  :3359   yes:2933      yes:2226   P2:  99   yes: 190  
                                                  P3:1508             
                                                  P4:1096             
                                                  P5: 935             
  FOL_FRIC     GRAM     
 es   : 189   no :2867  
 other:3666   yes:1276  
 th   : 288             
                        
                        

You want to characterize how the predictors and their pairwise interactions with TIME are correlated with the change from -(e)th to -(e)s. Analyze the data properly and summarize the results (briefly). Note: you must conflate the 3 early time periods into one, but once you’re done with everything, you should figure out why. [Difficulty level: 4]