Ling 104, session 02: R basics (key)
1 Exercise 01
Generate a data frame abc that contains the letters from “a” to “j” in the first column and the integers from 10 to 1 in the second column. Make sure the first column is called “LETTER” and the second “NUMBER” and that columns with categorical variables are factors.
Here is the most stepwise way to do this: We first set up the vectors …
… and then put them in the data frame:
Here’s how to do this in one go, w/out creating LETTER and NUMBER separately first:
2 Exercise 02
Load the text file _input/dataframe1.csv into a data frame example such that the first row is recognized as containing the column names and columns with categorical variables are factors.
'data.frame': 12 obs. of 4 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ GRMRELATION : Factor w/ 2 levels "obj","subj": 1 1 1 1 1 1 2 2 2 2 ...
$ LENGTH : int 2 2 10 6 7 4 3 9 9 9 ...
$ DEFINITENESS: Factor w/ 2 levels "def","indef": 1 1 1 2 2 2 1 1 1 2 ...
3 Exercise 03
Extract from this data frame example
- the second and third column;
[1] obj obj obj obj obj obj subj subj subj subj subj subj
Levels: obj subj
[1] 2 2 10 6 7 4 3 9 9 9 7 9
GRMRELATION LENGTH
1 obj 2
2 obj 2
3 obj 10
4 obj 6
5 obj 7
6 obj 4
7 subj 3
8 subj 9
9 subj 9
10 subj 9
11 subj 7
12 subj 9
GRMRELATION LENGTH
1 obj 2
2 obj 2
3 obj 10
4 obj 6
5 obj 7
6 obj 4
7 subj 3
8 subj 9
9 subj 9
10 subj 9
11 subj 7
12 subj 9
- the third and fourth row.
4 Exercise 04
Split the data frame example up according to the content of the second column (enter ?split at the R prompt for help) and call the result example.split.
example.split <- split( # make example.split the result of splitting up
example, # the data frame example
example$GRMRELATION) # depending on the values of the column GRMRELATION
# separate manual alternatives:
subset( # show a subset
example, # of the data frame example, namely
example$GRMRELATION=="obj") # when GRMRELATION is "obj" CASE GRMRELATION LENGTH DEFINITENESS
1 1 obj 2 def
2 2 obj 2 def
3 3 obj 10 def
4 4 obj 6 indef
5 5 obj 7 indef
6 6 obj 4 indef
CASE GRMRELATION LENGTH DEFINITENESS
7 7 subj 3 def
8 8 subj 9 def
9 9 subj 9 def
10 10 subj 9 indef
11 11 subj 7 indef
12 12 subj 9 indef
5 Exercise 05
Change the value at the intersection of the third row and the fourth column into “indef” and save the changed data frame into _output/dataframe2.csv such that you can easily load/edit in a spreadsheet software.
6 Exercise 06
Generate the following data frame and call it EPP (for English personal pronouns):
| PRONOUN | PERSON | NUMBER |
|---|---|---|
| I | 1 | sg |
| you | 2 | sg |
| he | 3 | sg |
| she | 3 | sg |
| it | 3 | sg |
| we | 1 | pl |
| you | 2 | pl |
| they | 3 | pl |
PRONOUN PERSON NUMBER
1 I 1 sg
2 you 2 sg
3 he 3 sg
4 she 3 sg
5 it 3 sg
6 we 1 pl
7 you 2 pl
8 they 3 pl
7 Exercise 07
Extract from this data frame
- the value of the 4th row and the 2nd column;
- the values of the 3rd to 4th rows and the 1st to 2nd columns;
- the rows that have plural pronouns in them;
PRONOUN PERSON NUMBER
6 we 1 pl
7 you 2 pl
8 they 3 pl
PRONOUN PERSON NUMBER
6 we 1 pl
7 you 2 pl
8 they 3 pl
- the rows with 1st and 3rd person pronouns.
PRONOUN PERSON NUMBER
1 I 1 sg
3 he 3 sg
4 she 3 sg
5 it 3 sg
6 we 1 pl
8 they 3 pl
PRONOUN PERSON NUMBER
1 I 1 sg
3 he 3 sg
4 she 3 sg
5 it 3 sg
6 we 1 pl
8 they 3 pl
PRONOUN PERSON NUMBER
1 I 1 sg
3 he 3 sg
4 she 3 sg
5 it 3 sg
6 we 1 pl
8 they 3 pl
8 Exercise 08
Generate a vector FREQS of the frequencies with which the personal pronouns in EPP occurred in a small corpus: I: 8426, you: 9462, he: 6394, she: 4234, it: 6040, we: 2305, you: 8078, they: 2998. Then, make this vector the fourth column of EPP.
9 Exercise 09
Save the data frame into _output/dataframe3.csv such that you can easily load/edit in a spreadsheet software.
10 Exercise 10
The file _input/dataframe4.csv contains data for the VERB into VERBing construction in the BNC (e.g., He [V1 forced] him into [V2 speaking] about it). For each instance of one such construction, the file contains
- a column called
BNC: the file where the instance was found (A06 in the first case); - a column called
VERB_LEMMA: the lemma of the finite verb (force); - a column called
ING_FORM: the gerund (speaking); - a column called
ING_LEMMA: the lemma of the gerund (speak); - a column called
ING_TAG: the part-of-speech tag of the gerund (VVG in the first case).
Load this file into a data frame COV and display the first six rows of COV. Correct the typo in line 3 (use take).
BNC VERB_LEMMA ING_FORM ING_LEMMA ING_TAG
HH3 : 11 force : 101 thinking : 146 think : 147 VVG :1239
K5D : 11 trick : 92 believing: 104 believe: 104 NN1-VVG: 158
CBG : 10 fool : 77 making : 62 make : 62 AJ0-VVG: 108
EUU : 10 talk : 62 giving : 54 give : 54 VDG : 49
HGM : 10 mislead: 57 accepting: 51 accept : 51 VBG : 23
HXE : 10 coerce : 52 doing : 49 do : 49 VHG : 15
(Other):1538 (Other):1159 (Other) :1134 (Other):1133 (Other): 8
BNC VERB_LEMMA ING_FORM ING_LEMMA ING_TAG
1 A06 force speaking speak VVG
2 A08 nudge being be VBG
3 A0C talk taking tak VVG
4 A0F bully taking take VVG
5 A0H influence trying try VVG
6 A0H delude thinking think VVG
But something’s missing here – pay attention when you do the next exercise!
11 Exercise 11
What
- is the quickest way of identifying the numbers of verb lemma types and -ing lemma types?
'data.frame': 1600 obs. of 5 variables:
$ BNC : Factor w/ 929 levels "A06","A08","A0C",..: 1 2 3 4 5 5 6 6 7 8 ...
$ VERB_LEMMA: Factor w/ 208 levels "activate","aggravate",..: 76 126 186 26 96 51 75 149 152 186 ...
$ ING_FORM : Factor w/ 422 levels "abandoning","abdicating",..: 354 49 382 382 395 387 387 133 209 175 ...
$ ING_LEMMA : Factor w/ 417 levels "abandon","abdicate",..: 349 41 378 378 390 383 383 378 207 173 ...
$ ING_TAG : Factor w/ 10 levels "AJ0-NN1","AJ0-VVG",..: 10 7 10 10 10 10 10 1 10 10 ...
[1] 208
[1] 208
[1] 417
[1] 416
Why are the results for ING_LEMMA conflicting?
Code
[1] 416
[1] 416
- is the most frequent verb lemma?
BNC VERB_LEMMA ING_FORM ING_LEMMA ING_TAG
HH3 : 11 force : 101 thinking : 146 think : 147 VVG :1239
K5D : 11 trick : 92 believing: 104 believe: 104 NN1-VVG: 158
CBG : 10 fool : 77 making : 62 make : 62 AJ0-VVG: 108
EUU : 10 talk : 62 giving : 54 give : 54 VDG : 49
HGM : 10 mislead: 57 accepting: 51 accept : 51 VBG : 23
HXE : 10 coerce : 52 doing : 49 do : 49 VHG : 15
(Other):1538 (Other):1159 (Other) :1134 (Other):1133 (Other): 8
force
101
# but this is the best because it turns exactly what was asked for:
names( # show the names of
table(COV$VERB_LEMMA))[ # the frequency table of VERB_LEMMA, but only those
which( # where
table(COV$VERB_LEMMA) == # the frequency table of VERB_LEMMA is
max(table(COV$VERB_LEMMA)) # the max of that table
) # end of which()
] # end of subset[1] "force"
This is a good example to introduce the pipe (%>%) from the package magrittr that we loaded at the top. Here is a simpler example that uses the tail(1) approach:
… and here’s the one that would be able to handle situations where more than one verb lemma has the same highest frequency; check out this:
Here’s the proof that it indeed works in such situations:
- is the most frequent -ing lemma with this verb lemma?
[1] "make"
[1] "make"
12 Exercise 12
Changing and saving COV:
- Delete the column with the corpus files; the new data frame is to be called
COV.2.
- Delete the rows with the four rarest tags; the new data frame is to be called
COV.3.
.
AJ0-NN1 CJS UNC NN1 VHG VBG
1 1 2 4 15 23
# step 2: determine the vector of deletees
deletees <- which(COV.2$ING_TAG # the deletees are where the value for ING_TAG
%in% # are a member of this set:
c("AJ0-NN1", "CJS", "UNC", "NN1"))
# much better than this:
# deletees <- which( # the deletees are where
# COV.2$ING_TAG=="AJ0-NN1" | # COV.2$ING_TAG is "AJ0-NN1" or where
# COV.2$ING_TAG=="CJS" | # COV.2$ING_TAG is "CJS" or where
# COV.2$ING_TAG=="UNC" | # COV.2$ING_TAG is "UNC" or where
# COV.2$ING_TAG=="NN1") # COV.2$ING_TAG is "NN1"
# step 3: delete
COV.3 <- COV.2[-deletees,]- From
COV.3, create a new data frameCOV.4which is sorted according to, first, the columnVERB_LEMMA(ascending) and, second, theING_LEMMA(descending).
- Save the changed data frame into a text file _output/dataframe5.csv; use tab stops as separators, newlines as line breaks, and make sure you don’t have row numbers and no quotes.
13 Session info
R version 4.5.3 (2026-03-11)
Platform: x86_64-pc-linux-gnu
Running under: Linux Mint 22.3
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] STGmisc_1.06 Rcpp_1.1.1-1 magrittr_2.0.5
loaded via a namespace (and not attached):
[1] digest_0.6.39 fastmap_1.2.0 xfun_0.57 knitr_1.51
[5] htmltools_0.5.9 rmarkdown_2.31 cli_3.6.6 rstudioapi_0.18.0
[9] tools_4.5.3 evaluate_1.0.5 yaml_2.3.12 otel_0.2.0
[13] htmlwidgets_1.6.4 rlang_1.2.0 jsonlite_2.0.0 MASS_7.3-65