Assignment files for modeling case studies

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

03 Aug 2025 12-34-56

1 <datives.csv>

This data set is concerned with the dative alternation in English, i.e. the question is whether a speaker says [NPAgentJohn] gave [NPRecipientMary] [NPPatienta dead cat] (ditransitive) or [NPAgentJohn] gave [NPPatienta dead cat] [PPto [NPRecipientMary]] (prepositional dative); the question is what the factors are that govern this linguistic/structural choice. This is the corpus-based data set:

rm(list=ls(all.names=TRUE))
summary(x <- read.delim("_input/datives.csv", stringsAsFactors=TRUE))
      CASE             CONSTRUCTION    V_CHANGPOSS    AGENT_ACT   
 Min.   :  1.0   ditransitive:200   change   :252   Min.   :0.00  
 1st Qu.:100.8   prep_dative :200   no_change:146   1st Qu.:2.00  
 Median :200.5                      NA's     :  2   Median :4.00  
 Mean   :200.5                                      Mean   :4.38  
 3rd Qu.:300.2                                      3rd Qu.:7.00  
 Max.   :400.0                                      Max.   :9.00  
    REC_ACT        PAT_ACT     
 Min.   :0.00   Min.   :0.000  
 1st Qu.:2.00   1st Qu.:2.000  
 Median :5.00   Median :4.000  
 Mean   :4.63   Mean   :4.407  
 3rd Qu.:7.00   3rd Qu.:7.000  
 Max.   :9.00   Max.   :9.000  

These are the variables in this data set (see here for the roxygenized comments):

  • CASE: a case number (can be ignored);
  • CONSTRUCTION: the response variable encoding which construction a speaker used: ditransitive or prep_dative;
  • V_CHANGPOSS: a predictor encoding whether the verb in the clause encodes a change of possession of the patient from the agent to the recipient(yes, e.g., give or hand) or not (e.g., promise);
  • AGENT_ACT: a predictor encoding how discourse-given the referent of the agent (John in the above example):
    • 0 means ‘the referent of the agent is completely new to the conversation’;
    • 9 means ‘the referent of the agent was mentioned in the immediately preceding clause’;
  • REC_ACT: a predictor encoding how discourse-given the referent of the recipient (Mary in the above example):
    • 0 means ‘the referent of the recipient is completely new to the conversation’;
    • 9 means ‘the referent of the recipient was mentioned in the immediately preceding clause’;
  • PAT_ACT: a predictor encoding how discourse-given the referent of the patient (a dead cat in the above example):
    • 0 means ‘the referent of the patient is completely new to the conversation’;
    • 9 means ‘the referent of the patient was mentioned in the immediately preceding clause’.

2 <toingpriming.csv>

This data set is concerned with the to-/-ing alternation in English, i.e. the question is whether a speaker says I like to swim or I like swimming. Native speaker of English were presented two sentences in an experiment, a prime/context sentence that already involved a to-/-ing alternation sentence and a target sentence involving another to-/-ing alternation sentence that the subjects were supposed with regard to its acceptability on a 7-point scale from -3 to +3; this is the experimental data set:

rm(list=ls(all.names=TRUE))
summary(x <- read.delim("_input/toingpriming.csv", stringsAsFactors=TRUE))
      CASE           RATING        CXPREV    CXNOW     VNOW_PREF
 Min.   :  1.0   Min.   :-3.0000   ing:278   ing:270   ing:280  
 1st Qu.:139.8   1st Qu.:-1.0000   to :278   to :286   to :276  
 Median :278.5   Median : 0.0000                                
 Mean   :278.5   Mean   : 0.3705                                
 3rd Qu.:417.2   3rd Qu.: 2.0000                                
 Max.   :556.0   Max.   : 3.0000                                

These are the variables in this data set (see here for the roxygenized comments):

  • CASE: a case number (can be ignored);
  • CONSTRUCTION_PREV: a predictor encoding whether the prime sentence was a to or an ing construction;
  • CONSTRUCTION_NOW: a predictor encoding whether the target sentence to be rated was a to or an ing construction;
  • V_PREF: a predictor encoding whether the target sentence contained a verb that is know to prefer to-constructions or a verb that is know to prefer ing-constructions ;
  • RESPONSE: the response variable encoding an acceptability judgment by a native speaker whether a subject was produced in a clause (yes) or not (no);
    • -3 means ‘the speaker considered the sentence completely unacceptable’;
    • 0 means ‘the speaker considered the sentence intermediately (un)acceptable’;
    • +3 means ‘the speaker considered the sentence perfectly acceptable’.

3 <thirdpers.csv>

This data set is concerned with the third-person singular suffix in diachronic English, specifically whether speakers wrote the old form (giveth) or the newer and now contemporary form (gives). The question is what made letter writers choose which form (because speakers were not consistently using one and the same form even in the same letter). This is the corpus-based data set (based on letters written between 1400 and 1700):

rm(list=ls(all.names=TRUE))
summary(x <- read.delim("_input/thirdpers.csv", stringsAsFactors=TRUE))
 VARIANT    AUTH_GEND    REC_SAME_GEND CLOSE_FAM  VNCPERIOD FIN_SYB   
 es:1524   female: 784   no :1210      no :1917   P1: 505   no :3953  
 th:2619   male  :3359   yes:2933      yes:2226   P2:  99   yes: 190  
                                                  P3:1508             
                                                  P4:1096             
                                                  P5: 935             
  FOL_FRIC     GRAM     
 es   : 189   no :2867  
 other:3666   yes:1276  
 th   : 288             
                        
                        

These are the variables in this data set (see here for the roxygenized comments):

  • VARIANT: the response variable: th vs. s;
  • AUTH_GEND: a predictor encoding the sex of the writer of the letter (female vs. male);
  • REC_SAME_GEND: a predictor encoding the sex of the recipient of the letter: the same as that of the writer (yes) or not (no);
  • CLOSE_FAM: a predictor encoding whether the recipient of the letter was a close family member of the writer of the letter (no vs. yes)
  • VNCPERIOD: a predictor encoding the time period: lower numbers indicate earlier times (P1 begins at 1400) and higher numbers indicate later times (P5 ends at 1700);
  • FIN_SYB: a predictor encoding whether the stem of the verb used with a third person singular ends in a sibilant: yes (as in promises) vs. no (as in does);
  • FOL_FRIC: a predictor encoding whether the the word after the verb used with a third person singular begins with an s (as in promises sleeping cats for everyone), a th (as in promises three cats for everyone), or something else (other, as in promises many cats for everyone);
  • GRAM: a predictor encoding whether the verb used with a third person singular is a grammatical verb ( yes, i.e. is a form of be, do, or have) or not (i.e. a lexical verb).