PARTOFSPEECH CLASS TOKENFREQS TYPEFREQS
1 ADJ open 421 271
2 ADV open 337 103
3 N open 1411 735
4 CONJ closed 458 18
5 PREP closed 455 37
Intro to descriptive statistics
1 Data frames & other data structures
Data frames are the most important data structure we will deal with directly, they correspond to ‘rectangular’ spreadsheets, i.e. tables where
- all rows have the same number of columns;
- all columns have the same number of rows.
We clear memory and load a very useful package:
1.1 Creating data frames
Creating data frames within R, which we will hardly ever do, can be done in two ways:
- we can first create each column that is supposed to go into the data frame and then we create a data frame by putting all those columns together into one structure;
- we create both the columns and the data frame together at the same time.
Here’s an example of approach 1. Creating a data structure like a column involves what is called
- something that is created (
PARTOFSPEECH
in line 1); - the assignment operator (
<-
); - what to put into the thing that is created; in line 1 here that is a data structure called a vector, of which we will use three types:
- character vectors (as in lines 1-2);
-
numeric vectors (as in lines 3-4); logical vectors (see Section 1.3.2 below). All vectors can be created with the function
c
(for ‘combine’?):
From those 4 vectors, we can then put together a data frame with the function data.frame
. Note:
- if we do an assignment within parentheses, the assigned object is printed;
- we use an additional argument called
stringsAsFactors=TRUE
, which converts the character vectors in the data frame into another kind of data structure that is called a factor, which has some advantages for some statistical methods (and some disadvantages for data wrangling …):
PARTOFSPEECH CLASS TOKENFREQS TYPEFREQS
1 ADJ open 421 271
2 ADV open 337 103
3 N open 1411 735
4 CONJ closed 458 18
5 PREP closed 455 37
If the data frame you want to create consists of all combinations of several variables, you can use expand.grid
, something we will likely use next week:
1.2 Loading data frames
Nearly always, we will load those from some spreadsheet software such as LibreOffice Calc. For example, here is a data frame that was created in a spreadsheet software like LibreOffice Calc and then saved from it as a .csv (comma-separated values) file, but actually using a tab stop between columns. Such files can be opened with any text editor, any spreadsheet software, or any statistics software. The most versatile version to load a data frame is the function read.table
, which can take a lot of arguments:
-
file
: a character vector with the path to a file (either one you typed like “_input/reactiontimes.csv” or an interactively chosen one withfile.choose()
); -
header
: a logical vector that states whether the data frame’s first row are the column headings; this should nearly always beTRUE
; -
sep
: a character vector that says what’s between columns; this should nearly always be"\t"
; -
stringsAsFactors
: a logical vector that states whether columns with character vectors should be made factors; this should nearly always beTRUE
; -
quote
: a character vector that tells R what characters indicate quotes; this should nearly always be""
; -
comment.char
: a character vector that tells R what characters indicate comments; this should nearly always be""
.
Here you see it in practice:
rm(list=ls(all=TRUE)) # clear memory
rts <- read.table( # make rts the result of loading
"_input/reactiontimes.csv", # data from this file path
header=TRUE, # the data frame's first row are the column headings
sep="\t", # the columns are separated by tab stops
stringsAsFactors=TRUE, # make the columns that contain strings factors
quote="", # don't interpret any quote characters as special characters
comment.char="") # don't interpret pound signs as preceding comments
# note: using a German operating system, you may have to add the argument dec="."
head(rts) # just show the first six rows
CASE RT FREQ FAM IMA MEAN
1 almond 650.9947 2 <NA> <NA> NA
2 ant 589.4347 7 med hi 415
3 apple 523.0493 10 hi hi 451
4 apricot 642.3342 2 lo lo NA
5 asparagus 696.2092 2 med lo 442
6 avocado 621.6382 12 <NA> <NA> NA
Here’s what the columns are/mean. Note: A cell containing NA
or <NA>
means ’there is no data available for that cell.
My recommendation: always make your data frame ‘simple’ such that when you prepare your data frames for statistical analysis in some spreadsheet software,
- you should always use the case-by-variable / long format:
- one row for each measurement of the dependent variable;
- one column for each variable (independent, dependent, confound, moderators, …);
- there should be no empty cells (use NA for those);
- use all caps for the column headers and all small letters for factor levels;
- avoid spaces and maybe even periods in variable names and levels.
If your data frame looks like that, you can make reading it in much simpler using read.delim
:
CASE RT FREQ FAM IMA MEAN
1 almond 650.9947 2 <NA> <NA> NA
2 ant 589.4347 7 med hi 415
3 apple 523.0493 10 hi hi 451
4 apricot 642.3342 2 lo lo NA
5 asparagus 696.2092 2 med lo 442
6 avocado 621.6382 12 <NA> <NA> NA
Here’re two functions that tell you a lot about a data frame:
CASE RT FREQ FAM IMA
almond : 1 Min. :523.0 Min. : 1.00 hi :12 hi :30
ant : 1 1st Qu.:589.4 1st Qu.: 2.00 lo :12 lo :23
apple : 1 Median :617.7 Median : 4.00 med :31 NA's:24
apricot : 1 Mean :631.9 Mean : 9.26 NA's:22
asparagus: 1 3rd Qu.:663.6 3rd Qu.: 10.00
avocado : 1 Max. :794.5 Max. :118.00
(Other) :71
MEAN
Min. :315.0
1st Qu.:409.8
Median :437.5
Mean :436.0
3rd Qu.:466.2
Max. :553.0
NA's :29
'data.frame': 77 obs. of 6 variables:
$ CASE: Factor w/ 77 levels "almond","ant",..: 1 2 3 4 5 6 7 8 9 10 ...
$ RT : num 651 589 523 642 696 ...
$ FREQ: int 2 7 10 2 2 12 5 19 4 12 ...
$ FAM : Factor w/ 3 levels "hi","lo","med": NA 3 1 2 3 NA 1 3 3 1 ...
$ IMA : Factor w/ 2 levels "hi","lo": NA 1 1 2 2 NA 1 2 1 1 ...
$ MEAN: int NA 415 451 NA 442 NA 449 412 412 441 ...
And this is how you get row and column numbers and names (if defined):
[1] 77
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
[46] "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60"
[61] "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74" "75"
[76] "76" "77"
[1] 6
[1] "CASE" "RT" "FREQ" "FAM" "IMA" "MEAN"
Some quick exercise(s): Load the data in _input/assignments.csv into a data frame ass
and determine how many rows and columns it has:
'data.frame': 300 obs. of 6 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ ASSIGNMENT: Factor w/ 3 levels "lab_report","oral_exam",..: 2 2 2 2 2 2 2 2 2 2 ...
$ SEX : Factor w/ 2 levels "female","male": 1 1 1 2 1 2 2 1 1 1 ...
$ REGION : Factor w/ 3 levels "central_europ",..: 1 1 1 1 1 3 1 2 1 1 ...
$ WORKHOURS : num 20.1 21.3 13.5 19.5 21.8 21.7 15.5 15.9 21.5 13.1 ...
$ MISTAKES : int 18 18 13 12 12 11 11 11 11 10 ...
[1] 300
[1] 6
1.3 Accessing parts of data frames
1.3.1 Picking things yourself
You can look at the top 6 or bottom 6 rows (or other numbers) of a data frame with head
and tail
:
CASE RT FREQ FAM IMA MEAN
1 almond 650.9947 2 <NA> <NA> NA
2 ant 589.4347 7 med hi 415
3 apple 523.0493 10 hi hi 451
4 apricot 642.3342 2 lo lo NA
5 asparagus 696.2092 2 med lo 442
6 avocado 621.6382 12 <NA> <NA> NA
CASE RT FREQ FAM IMA MEAN
1 almond 650.9947 2 <NA> <NA> NA
2 ant 589.4347 7 med hi 415
3 apple 523.0493 10 hi hi 451
CASE RT FREQ FAM IMA MEAN
72 tomato 611.0240 5 hi hi 489
73 tortoise 733.0323 4 lo lo 403
74 walnut 663.5908 12 med lo 468
75 wasp 725.7056 3 <NA> <NA> NA
76 whale 609.9745 1 med hi 474
77 woodpecker 686.3439 2 <NA> <NA> NA
CASE RT FREQ FAM IMA MEAN
75 wasp 725.7056 3 <NA> <NA> NA
76 whale 609.9745 1 med hi 474
77 woodpecker 686.3439 2 <NA> <NA> NA
You can access specific cells using a coordinate system of the type dataframe[rows, columns]
:
[1] 650.9947
RT FREQ
1 650.9947 2
CASE RT FAM
1 almond 650.9947 <NA>
2 ant 589.4347 med
If you specify no columns or no rows, you get all of them:
CASE RT FREQ FAM IMA MEAN
2 ant 589.4347 7 med hi 415
[1] 650.9947 589.4347 523.0493 642.3342 696.2092 621.6382 601.8914 589.6712
[9] 661.5156 567.0827 751.0928 638.0350 597.6108 672.3081 635.4823 636.0095
[17] 620.8766 567.6625 614.0209 588.0462 667.2140 699.0151 617.7367 583.6823
[25] 618.1155 581.6936 596.3469 616.3894 655.2160 614.8879 592.9443 776.6582
[33] 559.7334 576.6969 568.4144 660.0190 773.9089 624.4767 600.9778 682.1382
[41] 549.3614 634.6535 588.3237 745.4629 596.4692 655.4345 593.6087 620.0387
[49] 604.3841 602.2479 633.9500 578.9176 544.8413 637.3747 745.1318 571.8937
[57] 579.4982 559.0134 636.7750 670.7467 678.3105 754.3662 585.1791 556.1345
[65] 609.8235 562.3294 672.0730 711.7317 794.4522 612.7960 606.9915 611.0240
[73] 733.0323 663.5908 725.7056 609.9745 686.3439
Typically, though, we access columns by their names:
[1] almond ant apple apricot asparagus avocado
[7] banana bat beaver bee beetroot blackberry
[13] broccoli bunny butterfly camel carrot cat
[19] cherry chicken clove crocodile cucumber dog
[25] dolphin donkey eagle eggplant elephant fox
[31] frog gherkin goat goose grape gull
[37] hedgehog horse kiwi leek lemon lettuce
[43] lion magpie melon mole monkey moose
[49] mouse mushroom mustard olive orange owl
[55] paprika peanut pear pig pineapple potato
[61] radish reindeer shark sheep snake spider
[67] squid squirrel stork strawberry swan tomato
[73] tortoise walnut wasp whale woodpecker
77 Levels: almond ant apple apricot asparagus avocado banana bat beaver ... woodpecker
[1] almond ant apple apricot asparagus avocado
[7] banana bat beaver bee beetroot blackberry
[13] broccoli bunny butterfly camel carrot cat
[19] cherry chicken clove crocodile cucumber dog
[25] dolphin donkey eagle eggplant elephant fox
[31] frog gherkin goat goose grape gull
[37] hedgehog horse kiwi leek lemon lettuce
[43] lion magpie melon mole monkey moose
[49] mouse mushroom mustard olive orange owl
[55] paprika peanut pear pig pineapple potato
[61] radish reindeer shark sheep snake spider
[67] squid squirrel stork strawberry swan tomato
[73] tortoise walnut wasp whale woodpecker
77 Levels: almond ant apple apricot asparagus avocado banana bat beaver ... woodpecker
1.3.2 Having R pick things
Often, we want R to do the leg work and find things for us. We can do that in two ways, both of which involve what is called logical expressions, i.e. expressions that return logical vectors containing only either TRUE
s or FALSE
s. Here’s way 1: we first use a logical expression to create a logical vector (note the ==
, two equals signs) …
CASE RT FREQ FAM IMA MEAN
75 wasp 725.7056 3 <NA> <NA> NA
76 whale 609.9745 1 med hi 474
77 woodpecker 686.3439 2 <NA> <NA> NA
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE TRUE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE TRUE FALSE FALSE
… and then use that for what is called subsetting:
CASE RT FREQ FAM IMA MEAN
15 butterfly 635.4823 3 med hi 424
50 mushroom 602.2479 3 <NA> <NA> NA
54 owl 637.3747 3 med lo 397
55 paprika 745.1318 3 <NA> <NA> NA
66 spider 562.3294 3 med lo 392
75 wasp 725.7056 3 <NA> <NA> NA
CASE RT FREQ FAM IMA MEAN
15 butterfly 635.4823 3 med hi 424
50 mushroom 602.2479 3 <NA> <NA> NA
54 owl 637.3747 3 med lo 397
55 paprika 745.1318 3 <NA> <NA> NA
66 spider 562.3294 3 med lo 392
75 wasp 725.7056 3 <NA> <NA> NA
Way 2 is that we again create a logical vector first, but then also use the function which
to get a numeric vector that says where in the logical vector the TRUE
s are. Important: For reasons that are not relevant right now, this numeric vector way of doing this is the better way!
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE TRUE FALSE FALSE
[1] 15 50 54 55 66 75
Thus, we can do this:
[1] 15 50 54 55 66 75
CASE RT FREQ FAM IMA MEAN
15 butterfly 635.4823 3 med hi 424
50 mushroom 602.2479 3 <NA> <NA> NA
54 owl 637.3747 3 med lo 397
55 paprika 745.1318 3 <NA> <NA> NA
66 spider 562.3294 3 med lo 392
75 wasp 725.7056 3 <NA> <NA> NA
We can use the usual mathematical operators for this kind of expression: ==
, !=
, >
, and <
. The first two also apply to character vectors and factors:
CASE RT FREQ FAM IMA MEAN
3 apple 523.0493 10 hi hi 451
7 banana 601.8914 5 hi hi 449
10 bee 567.0827 12 hi hi 441
18 cat 567.6625 24 hi hi 487
20 chicken 588.0462 38 hi hi 462
24 dog 583.6823 76 hi hi 553
38 horse 624.4767 118 hi hi 471
42 lettuce 634.6535 1 hi hi 423
53 orange 544.8413 24 hi hi 467
57 pear 579.4982 7 hi lo 494
60 potato 670.7467 16 hi hi 484
72 tomato 611.0240 5 hi hi 489
Finally, we can ask for multiple conditions at the same time in two ways. One involves using &
for ‘and’ and |
for ‘or’ and is usually applied to multiple different variables:
rts[ # show rts, but only
which(rts$FAM=="hi" & rts$IMA=="hi") # rows meeting these conditions
,] # & then all columns
CASE RT FREQ FAM IMA MEAN
3 apple 523.0493 10 hi hi 451
7 banana 601.8914 5 hi hi 449
10 bee 567.0827 12 hi hi 441
18 cat 567.6625 24 hi hi 487
20 chicken 588.0462 38 hi hi 462
24 dog 583.6823 76 hi hi 553
38 horse 624.4767 118 hi hi 471
42 lettuce 634.6535 1 hi hi 423
53 orange 544.8413 24 hi hi 467
60 potato 670.7467 16 hi hi 484
72 tomato 611.0240 5 hi hi 489
rts[ # show rts, but only
which(rts$FAM=="hi" | rts$FREQ>20) # rows meeting these conditions
,] # & then all columns
CASE RT FREQ FAM IMA MEAN
3 apple 523.0493 10 hi hi 451
7 banana 601.8914 5 hi hi 449
10 bee 567.0827 12 hi hi 441
18 cat 567.6625 24 hi hi 487
20 chicken 588.0462 38 hi hi 462
24 dog 583.6823 76 hi hi 553
38 horse 624.4767 118 hi hi 471
42 lettuce 634.6535 1 hi hi 423
51 mustard 633.9500 21 med lo 388
53 orange 544.8413 24 hi hi 467
57 pear 579.4982 7 hi lo 494
60 potato 670.7467 16 hi hi 484
64 sheep 556.1345 24 med lo 444
65 snake 609.8235 45 med hi 458
72 tomato 611.0240 5 hi hi 489
The other way applies only to one variable at a time and involves the operator %in%
:
[1] 3 4 7 10 16 18 20 21 22 24 27 29 38 40 42 45 52 53 57 60 61 69 72 73
CASE RT FREQ FAM IMA MEAN
3 apple 523.0493 10 hi hi 451
4 apricot 642.3342 2 lo lo NA
7 banana 601.8914 5 hi hi 449
10 bee 567.0827 12 hi hi 441
16 camel 636.0095 2 lo lo 372
18 cat 567.6625 24 hi hi 487
20 chicken 588.0462 38 hi hi 462
21 clove 667.2140 2 lo lo 315
22 crocodile 699.0151 2 lo hi 387
24 dog 583.6823 76 hi hi 553
27 eagle 596.3469 6 lo hi 457
29 elephant 655.2160 8 lo hi 440
38 horse 624.4767 118 hi hi 471
40 leek 682.1382 1 lo lo NA
42 lettuce 634.6535 1 hi hi 423
45 melon 596.4692 2 lo <NA> NA
52 olive 578.9176 6 lo lo 351
53 orange 544.8413 24 hi hi 467
57 pear 579.4982 7 hi lo 494
60 potato 670.7467 16 hi hi 484
61 radish 678.3105 9 lo <NA> NA
69 stork 794.4522 1 lo lo NA
72 tomato 611.0240 5 hi hi 489
73 tortoise 733.0323 4 lo lo 403
# in this specific example, there is also this shorter option:
(famnotmed <- which( # which
rts$FAM!="med")) # values of rts$FAM are NOT "med"
[1] 3 4 7 10 16 18 20 21 22 24 27 29 38 40 42 45 52 53 57 60 61 69 72 73
CASE RT FREQ FAM IMA MEAN
3 apple 523.0493 10 hi hi 451
4 apricot 642.3342 2 lo lo NA
7 banana 601.8914 5 hi hi 449
10 bee 567.0827 12 hi hi 441
16 camel 636.0095 2 lo lo 372
18 cat 567.6625 24 hi hi 487
20 chicken 588.0462 38 hi hi 462
21 clove 667.2140 2 lo lo 315
22 crocodile 699.0151 2 lo hi 387
24 dog 583.6823 76 hi hi 553
27 eagle 596.3469 6 lo hi 457
29 elephant 655.2160 8 lo hi 440
38 horse 624.4767 118 hi hi 471
40 leek 682.1382 1 lo lo NA
42 lettuce 634.6535 1 hi hi 423
45 melon 596.4692 2 lo <NA> NA
52 olive 578.9176 6 lo lo 351
53 orange 544.8413 24 hi hi 467
57 pear 579.4982 7 hi lo 494
60 potato 670.7467 16 hi hi 484
61 radish 678.3105 9 lo <NA> NA
69 stork 794.4522 1 lo lo NA
72 tomato 611.0240 5 hi hi 489
73 tortoise 733.0323 4 lo lo 403
Some quick exercise(s): Which words have an RT of less than 560?
CASE RT FREQ FAM IMA MEAN
3 apple 523.0493 10 hi hi 451
33 goat 559.7334 7 med lo 402
41 lemon 549.3614 19 med hi 470
53 orange 544.8413 24 hi hi 467
58 pig 559.0134 9 med hi NA
64 sheep 556.1345 24 med lo 444
[1] apple goat lemon orange pig sheep
77 Levels: almond ant apple apricot asparagus avocado banana bat beaver ... woodpecker
[1] apple goat lemon orange pig sheep
77 Levels: almond ant apple apricot asparagus avocado banana bat beaver ... woodpecker
Use the function sort
and return the sorted reaction times for when both FAM
is lo and IMA
is hi:
Use the function mean
to compute the mean reaction time for all words with a frequency in the interval [5, 10]:
[1] 598.8744
[1] 598.8744
1.4 Changing values in data frames
1.4.1 Adding columns
You can add a new column to a data frame just by specifying that new column. Here’s an example where we add a logical vector based on information that’s already in the data frame:
CASE RT FREQ FAM IMA MEAN FREQHIGH
1 almond 650.9947 2 <NA> <NA> NA FALSE
2 ant 589.4347 7 med hi 415 FALSE
3 apple 523.0493 10 hi hi 451 TRUE
4 apricot 642.3342 2 lo lo NA FALSE
5 asparagus 696.2092 2 med lo 442 FALSE
6 avocado 621.6382 12 <NA> <NA> NA TRUE
But you can add any kind of information. Here, we’re adding as many random numbers between -10 and +10 to the data frame, as rts
has rows:
CASE RT FREQ FAM IMA MEAN FREQHIGH SOMECRAP
1 almond 650.9947 2 <NA> <NA> NA FALSE -0.5993406
2 ant 589.4347 7 med hi 415 FALSE -0.4795214
3 apple 523.0493 10 hi hi 451 TRUE -6.7335936
4 apricot 642.3342 2 lo lo NA FALSE -9.2649223
5 asparagus 696.2092 2 med lo 442 FALSE -2.8901470
6 avocado 621.6382 12 <NA> <NA> NA TRUE 2.1543869
But let’s get rid of these two columns again with negative subsetting:
CASE RT FREQ FAM IMA MEAN
1 almond 650.9947 2 <NA> <NA> NA
2 ant 589.4347 7 med hi 415
3 apple 523.0493 10 hi hi 451
4 apricot 642.3342 2 lo lo NA
5 asparagus 696.2092 2 med lo 442
6 avocado 621.6382 12 <NA> <NA> NA
And here we add a factor to the data frame that groups the frequency values not just into hi and lo, but into different user-defined groups:
rts$MEANBIN <- cut(rts$MEAN, # make rts$MEANBIN the cut-up version of # rts$MEAN
breaks=c(300, 350, 400, 450, 600), # cut up into 4 groups with these 5 boundaries ...
# note: boundary 1 is < the min of MEAN, boundary 5 is > the max of MEAN
labels=c("lo", "med", "hi", "superhi")) # ... and with these names
rts[15:25,]
CASE RT FREQ FAM IMA MEAN MEANBIN
15 butterfly 635.4823 3 med hi 424 hi
16 camel 636.0095 2 lo lo 372 med
17 carrot 620.8766 2 med lo 425 hi
18 cat 567.6625 24 hi hi 487 superhi
19 cherry 614.0209 7 med lo 497 superhi
20 chicken 588.0462 38 hi hi 462 superhi
21 clove 667.2140 2 lo lo 315 lo
22 crocodile 699.0151 2 lo hi 387 med
23 cucumber 617.7367 1 med hi 428 hi
24 dog 583.6823 76 hi hi 553 superhi
25 dolphin 618.1155 2 <NA> <NA> NA <NA>
If you don’t provide labels for the factor levels, check out the Wikipedia entry on interval notation:
-
()
= ‘excludes the boundary’; -
[]
= ‘includes the boundary’.
1.4.2 Changing cell values
Changing cell values is easy/obvious for numeric variables (numeric vectors) (and also for logical vectors):
CASE RT FREQ FAM IMA MEAN MEANBIN
1 almond 650.9947 2 <NA> <NA> NA <NA>
2 ant 589.4347 7 med hi 415 hi
3 apple 523.0493 10 hi hi 451 superhi
CASE RT FREQ FAM IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA> NA <NA>
2 ant 589.4347 7 med hi 415 hi
3 apple 523.0493 10 hi hi 451 superhi
It’s also easy/obvious for categorical variables (character vectors or factors) when you change something to an existing level:
CASE RT FREQ FAM IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA> NA <NA>
2 ant 589.4347 7 med hi 415 hi
3 apple 523.0493 10 hi hi 451 superhi
CASE RT FREQ FAM IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA> NA <NA>
2 ant 589.4347 7 hi hi 415 hi
3 apple 523.0493 10 hi hi 451 superhi
Where it’s less easy/obvious is for categorical variables that are factors when you change something to a new level:
CASE RT FREQ FAM IMA
almond : 1 Min. :523.0 Min. : 1.00 hi :13 hi :30
ant : 1 1st Qu.:589.4 1st Qu.: 2.00 lo :12 lo :23
apple : 1 Median :617.7 Median : 4.00 med :30 NA's:24
apricot : 1 Mean :631.9 Mean : 22.22 NA's:22
asparagus: 1 3rd Qu.:663.6 3rd Qu.: 10.00
avocado : 1 Max. :794.5 Max. :1000.00
(Other) :71
MEAN MEANBIN
Min. :315.0 lo : 1
1st Qu.:409.8 med : 8
Median :437.5 hi :21
Mean :436.0 superhi:18
3rd Qu.:466.2 NA's :29
Max. :553.0
NA's :29
CASE RT FREQ FAM IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA> NA <NA>
2 ant 589.4347 7 <NA> hi 415 hi
3 apple 523.0493 10 hi hi 451 superhi
CASE RT FREQ FAM IMA
almond : 1 Min. :523.0 Min. : 1.00 hi :12 hi :30
ant : 1 1st Qu.:589.4 1st Qu.: 2.00 lo :12 lo :23
apple : 1 Median :617.7 Median : 4.00 med :30 NA's:24
apricot : 1 Mean :631.9 Mean : 22.22 NA's:23
asparagus: 1 3rd Qu.:663.6 3rd Qu.: 10.00
avocado : 1 Max. :794.5 Max. :1000.00
(Other) :71
MEAN MEANBIN
Min. :315.0 lo : 1
1st Qu.:409.8 med : 8
Median :437.5 hi :21
Mean :436.0 superhi:18
3rd Qu.:466.2 NA's :29
Max. :553.0
NA's :29
The solution involves two steps:
- We let R know that that factor should allow that new level, too:
levels(rts$FAM) <- c( # make the levels of rts$FAM be
levels(rts$FAM), # the old levels of rts$FAM,
"superlo") # but now also "superlo"
summary(rts) # check
CASE RT FREQ FAM IMA
almond : 1 Min. :523.0 Min. : 1.00 hi :12 hi :30
ant : 1 1st Qu.:589.4 1st Qu.: 2.00 lo :12 lo :23
apple : 1 Median :617.7 Median : 4.00 med :30 NA's:24
apricot : 1 Mean :631.9 Mean : 22.22 superlo: 0
asparagus: 1 3rd Qu.:663.6 3rd Qu.: 10.00 NA's :23
avocado : 1 Max. :794.5 Max. :1000.00
(Other) :71
MEAN MEANBIN
Min. :315.0 lo : 1
1st Qu.:409.8 med : 8
Median :437.5 hi :21
Mean :436.0 superhi:18
3rd Qu.:466.2 NA's :29
Max. :553.0
NA's :29
- Now we can assign the new level:
CASE RT FREQ FAM IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA> NA <NA>
2 ant 589.4347 7 superlo hi 415 hi
3 apple 523.0493 10 hi hi 451 superhi
CASE RT FREQ FAM IMA
almond : 1 Min. :523.0 Min. : 1.00 hi :12 hi :30
ant : 1 1st Qu.:589.4 1st Qu.: 2.00 lo :12 lo :23
apple : 1 Median :617.7 Median : 4.00 med :30 NA's:24
apricot : 1 Mean :631.9 Mean : 22.22 superlo: 1
asparagus: 1 3rd Qu.:663.6 3rd Qu.: 10.00 NA's :22
avocado : 1 Max. :794.5 Max. :1000.00
(Other) :71
MEAN MEANBIN
Min. :315.0 lo : 1
1st Qu.:409.8 med : 8
Median :437.5 hi :21
Mean :436.0 superhi:18
3rd Qu.:466.2 NA's :29
Max. :553.0
NA's :29
But note what can happen (e.g., when we had introduced the new level or when we now change things back):
CASE RT FREQ FAM IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA> NA <NA>
2 ant 589.4347 7 med hi 415 hi
3 apple 523.0493 10 hi hi 451 superhi
CASE RT FREQ FAM IMA
almond : 1 Min. :523.0 Min. : 1.00 hi :12 hi :30
ant : 1 1st Qu.:589.4 1st Qu.: 2.00 lo :12 lo :23
apple : 1 Median :617.7 Median : 4.00 med :31 NA's:24
apricot : 1 Mean :631.9 Mean : 22.22 superlo: 0
asparagus: 1 3rd Qu.:663.6 3rd Qu.: 10.00 NA's :22
avocado : 1 Max. :794.5 Max. :1000.00
(Other) :71
MEAN MEANBIN
Min. :315.0 lo : 1
1st Qu.:409.8 med : 8
Median :437.5 hi :21
Mean :436.0 superhi:18
3rd Qu.:466.2 NA's :29
Max. :553.0
NA's :29
To avoid levels with no occurrences, it’s safest to always apply droplevels
to a data frame before you start your statistical exploration:
CASE RT FREQ FAM IMA
almond : 1 Min. :523.0 Min. : 1.00 hi :12 hi :30
ant : 1 1st Qu.:589.4 1st Qu.: 2.00 lo :12 lo :23
apple : 1 Median :617.7 Median : 4.00 med :31 NA's:24
apricot : 1 Mean :631.9 Mean : 22.22 NA's:22
asparagus: 1 3rd Qu.:663.6 3rd Qu.: 10.00
avocado : 1 Max. :794.5 Max. :1000.00
(Other) :71
MEAN MEANBIN
Min. :315.0 lo : 1
1st Qu.:409.8 med : 8
Median :437.5 hi :21
Mean :436.0 superhi:18
3rd Qu.:466.2 NA's :29
Max. :553.0
NA's :29
Some quick exercise(s): Create a data frame qwe
that contains only the rows of rts
where MEAN
< 400.
CASE RT FREQ FAM IMA MEAN MEANBIN
16 camel 636.0095 2 lo lo 372 med
21 clove 667.2140 2 lo lo 315 lo
22 crocodile 699.0151 2 lo hi 387 med
46 mole 655.4345 5 med lo 372 med
51 mustard 633.9500 21 med lo 388 med
52 olive 578.9176 6 lo lo 351 med
54 owl 637.3747 3 med lo 397 med
66 spider 562.3294 3 med lo 392 med
68 squirrel 711.7317 2 med hi 397 med
What is the highest FREQ
value when the RT
is <600?
How many FAM
values are med when IMA
is lo?
famISmed <- which(rts$FAM=="med")
imaISlo <- which(rts$IMA=="lo")
both <- famISmed[famISmed %in% imaISlo]
nrow(rts[both,])
[1] 15
2 Statistical exploration: 1 variable at a time
Let’s load a different data set and here’s what the columns are/mean:
rm(list=ls(all=TRUE)) # clear memory
clo <- read.delim("_input/clauseorders.csv", stringsAsFactors=TRUE)
head(clo) # look at the first 6 rows of the data frame
CASE ORDER SUBORDTYPE CONJ LENMC LENSC LENDIFF MORETHAN2CL
1 4777 sc-mc temp als/when 4 10 -6 no
2 1698 mc-sc temp als/when 7 6 1 no
3 953 sc-mc temp als/when 12 7 5 yes
4 1681 mc-sc temp als/when 6 15 -9 no
5 4055 sc-mc temp als/when 9 5 4 yes
6 967 sc-mc temp als/when 9 5 4 yes
CASE ORDER SUBORDTYPE CONJ LENMC
Min. : 2 mc-sc:275 caus:199 als/when : 93 Min. : 2.000
1st Qu.:1655 sc-mc:128 temp:204 bevor/before : 46 1st Qu.: 6.000
Median :2756 nachdem/after: 65 Median : 8.000
Mean :2764 weil/because :199 Mean : 9.266
3rd Qu.:4106 3rd Qu.:11.000
Max. :5053 Max. :31.000
LENSC LENDIFF MORETHAN2CL
Min. : 2.000 Min. :-32.00000 no :185
1st Qu.: 5.000 1st Qu.: -4.00000 yes:218
Median : 8.000 Median : 0.00000
Mean : 9.362 Mean : -0.09677
3rd Qu.:12.000 3rd Qu.: 3.00000
Max. :36.000 Max. : 25.00000
When it comes to deciding what statistical method to use, then even just for descriptive statistics, the most important first decision to make is always “what’s the dependent/response variable and what is its information level?” Then you can use this graph together with this document to guide you.
The following sections cover this ‘flow chart’.
2.1 Nominal/categorical data
For categorical data, it is usually best to begin with the descriptive numeric statistics and then visualize those with plots.
2.1.1 Descriptive stats
We can count levels and provide their proportions/percentages with table
and prop.table(table(...))
:
als/when bevor/before nachdem/after weil/because
93 46 65 199
als/when bevor/before nachdem/after weil/because
0.2307692 0.1141439 0.1612903 0.4937965
You should only provide a useful number of decimals, which for frequencies in the range of hundreds like here is probably 3-4 decimals:
als/when bevor/before nachdem/after weil/because
0.231 0.114 0.161 0.494
The measure of central tendency for categorical data is the mode, the most frequent level:
Can you imagine why I don’t just write this?
One measure of dispersion for categorical data is normalized entropy, for which no function is available in base R, which is why we write it ourselves:
Now we can use this function:
This is how normalized entropy behaves with extreme distributions:
2.1.2 Visualization
The best visualization really would be dot charts:
dotchart( # draw a dot chart
table(clo$CONJ), # of the frequencies of the levels of CONJ
xlim=c(0, 403), # with these x-axis limits (min, max)
pch=16) # using filled circles
abline( # draw a straight line
v=403/4, # vertical, at x=403/4 (even distribution)
lty=3) # make it dotted
The function dotchart
is a bit annoying because of the somewhat unintuitive ordering of the columns of the table: down-to-up. We might therefore want to reverse the order of the levels:
dotchart( # draw a dot chart
table(clo$CONJ)[4:1], # of the frequencies of the levels of CONJ, now reversed
xlim=c(0, 403), # with these x-axis limits (min, max)
pch=16) # using filled circles
abline( # draw a straight line
v=403/4, # vertical, at x=403/4 (even distribution)
lty=3) # make it dotted
A slightly less good alternative are bar charts:
qwe <- barplot( # create a bar chart (& save the mids of the bars into qwe)
table(clo$ORDER), # of this table
ylim=c(0, 300)) # with these y-axis limits
text(qwe, # plot text at these x-coordinates
table(clo$ORDER)/2, # and these y-coordinates
table(clo$ORDER)) # namely this text (the frequencies)
Here are several refinements:
qwe <- barplot(
table(clo$ORDER), main="Frequency of clause orders",
xlab="Order of clauses",
ylab="Observed frequency", ylim=c(0, 300))
text(qwe, # plot text at these x-coordinates
table(clo$ORDER), # and these y-coordinates
table(clo$ORDER), # namely this text (the frequencies)
pos=3) # but slightly higher than that
abline( # draw a straight line
h=mean(table(clo$ORDER)), # horizontal at the mean freq
lty=2) # make it dashed
text(qwe[2], # plot text above the 2nd bar
mean(table(clo$ORDER)), # and at the mean frequency
"mean frequency", # these words
pos=3, # but slightly higher than that
cex=0.75) # reduce font size by 25%
BTW, it is often useful to be able to save those graphs into files, especially for people who still use unspeakable things like Microsoft Word. This is how you do this for the previous plot:
png("_output/barplot_freqs.of.ORDER.png",
height=6, width=6, # dimensions of the plot
units="in", # in inches
res=300) # resolution in dpi, 300 is recommended
# now the plotting
qwe <- barplot(table(clo$ORDER), main="Frequency of clause orders",
xlab="Order of clauses", ylab="Observed frequency", ylim=c(0, 300))
text(qwe, table(clo$ORDER), table(clo$ORDER), pos=3)
abline(h=mean(table(clo$ORDER)), lty=2)
text(qwe[2], mean(table(clo$ORDER)), "mean frequency", pos=3, cex=0.75)
# now you close the plotting device:
dev.off()
png
2
This can now be cropped and then inserted into whatever software you use.
Some quick exercise(s): Generate a frequency table, a proportion table, and a dot chart for clo$ORDER
, state the mode and the normalized entropy:
mc-sc sc-mc
275 128
mc-sc sc-mc
0.6823821 0.3176179
mc-sc sc-mc
0.6823821 0.3176179
dotchart( # draw a dot chart
qwe, # of the frequencies of the levels of ORDER
xlim=c(0, 403), # with these x-axis limits (min, max)
pch=16) # using filled circles
abline( # draw a straight line
v=403/2, # vertical, at x=403/2 (even distribution)
lty=3) # make it dotted
text( # plot text
qwe, # at these x-coordinates
1:2, # at these y-coordinates
qwe, # namely the observed frequencies
pos=3) # plot the text above the points
[1] 275
[1] 0.9017721
2.2 Numeric data
For categorical data, we began with the descriptive numeric statistics and then visualized those with plots – for numeric data, it is better to begin with the visualization because, as you saw in Figure 1 above, for numeric variables the choice of statistic depends on whether the variable is normally/uniformly distributed or not. That’s why, here, we start with visualization and then compute the descriptive statistics.
2.2.1 Visualization
The easiest visualization is a histogram:
Since R decides on the bin width, you always want to do a second one where you change the bin width, e.g. like this:
Here are some other plot options: a box plot, a strip chart, and the best of all, an ecdf plot (for empirical cumulative distribution function):
par(mfrow=c(1, 3)) # set up a plotting panel with 1 row and 3 columns
boxplot(clo$LENSC, notch=TRUE); grid() # boxplot
stripchart(clo$LENSC, method="jitter"); grid() # strip chart
plot(ecdf(clo$LENSC), verticals=TRUE); grid() # ecdf plot
par(mfrow=c(1, 1)) # reset to default
All of them indicate that the variable is not normally distributed (see discussion in class and the textbook). What about the variable LENDIFF
?
par(mfrow=c(2, 2)) # set up a plotting panel with 2 row and 2 columns
clo$LENDIFF %>% hist(main="")
clo$LENDIFF %>% hist(breaks="FD", main="")
boxplot(clo$LENDIFF, notch=TRUE); grid() # boxplot
plot(ecdf(clo$LENDIFF), verticals=TRUE); grid() # ecdf plot
par(mfrow=c(1, 1)) # reset to default
All of these are at least broadly compatible with LENDIFF
being normally distributed.
2.2.2 Descriptive stats
For numeric data, the measure of central tendency one provides depends on the shape of the distribution. For non-normal data, we use the median as the measure of central tendency and quantiles/the interquartile range as measures of dispersion:
# central tendency & dispersion for ordinal data
# (or for non-normal data or data with outliers)
median(clo$LENSC) # median
## [1] 8
quantile(clo$LENSC) # quartiles
## 0% 25% 50% 75% 100%
## 2 5 8 12 36
IQR(clo$LENSC) # interquartile range
## [1] 7
quantile(clo$LENSC, probs=seq(0, 1, 0.1)) # deciles
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 2 4 5 6 7 8 9 11 13 17 36
For normal data, we use the mean as the measure of central tendency and the standard deviation/variance as measures of dispersion. Here’s a nice way in which we can represent multiple values in one vector:
setNames( # create a data structure w/ named elements
c(mean(clo$LENDIFF), sd(clo$LENDIFF), var(clo$LENDIFF)), # the elements
c("mean", "std dev", "variance")) # the names
mean std dev variance
-0.09677419 6.85188462 46.94832290
The standard deviation is computed like this:
A nicer way to represent this uses two new things that are extremely useful:
- in R, even things that don’t look like functions – like
+
,-
,*
,/
,^
– are functions so instead of writing1+2
one can write"+"(1, 2)
or instead of writing2^3
one can write"^"(2, 3)
; - instead of writing
function(argument)
, one can writeargument %>% function
, then the pipe (%>%
) makesargument
the first (!) argument offunction
:
And if function
has more than one argument, you can provide them later:
[1] 2
[1] 2
[1] 2
[1] 2
With that, we can compute the standard deviation as follows:
2.3 Ordinal data
Ordinal data are sometimes a bit of a tricky issue. One recommendation to deal with them would be this:
- if the number of ordinal levels you have is high (meaning, maybe ≥6), you might
- use descriptive statistics for categorical data, but maybe
- visualizations for numeric data
- if the number of ordinal levels you have is low (meaning, maybe <5), you probably want to treat them as categorical data.
But it’s often a judgment call, and often you may need to try both and see which representation of the data is most instructive.
Some quick exercise(s): The file _input/assignments.csv contains data regarding the question of which course-final assignment was chosen by what kind of student in a large international student exchange program; here’s what the data contain/represent, the response variable is ASSIGNMENT
:
CASE ASSIGNMENT SEX REGION
Min. : 1.00 lab_report:100 female:145 central_europ: 99
1st Qu.: 75.75 oral_exam :100 male :155 hispanic :100
Median :150.50 thesis :100 mid_east :101
Mean :150.50
3rd Qu.:225.25
Max. :300.00
WORKHOURS MISTAKES
Min. :12.10 Min. :10.00
1st Qu.:19.77 1st Qu.:14.75
Median :29.90 Median :17.00
Mean :27.34 Mean :17.12
3rd Qu.:33.50 3rd Qu.:19.00
Max. :39.90 Max. :25.00
CASE ASSIGNMENT SEX REGION WORKHOURS MISTAKES
1 1 oral_exam female central_europ 20.1 18
2 2 oral_exam female central_europ 21.3 18
3 3 oral_exam female central_europ 13.5 13
4 4 oral_exam male central_europ 19.5 12
5 5 oral_exam female central_europ 21.8 12
6 6 oral_exam male mid_east 21.7 11
Explore the variable ass$WORKHOURS
:
par(mfrow=c(2,2))
hist(ass$WORKHOURS, main="")
hist(ass$WORKHOURS, main="", breaks="FD")
boxplot(ass$WORKHOURS) # see? bad: doesn't show what's going on well
plot(ecdf(ass$WORKHOURS), verticals=TRUE); grid() # ecdf plot
par(mfrow=c(1,1))
setNames(
c(mean(ass$WORKHOURS), sd(ass$WORKHOURS), median(ass$WORKHOURS), IQR(ass$WORKHOURS)),
c("mean" , "std dev" , "median" , "interq. range"))
mean std dev median interq. range
27.337667 8.014261 29.900000 13.725000
3 Statistical exploration: 2+ variables at a time
Once we have two or more variables to consider at a time, things become a bit more complex – but just a bit: the overall logic will often remain the same.
3.1 Categorical ~ categorical
3.1.1 Descriptive stats
For combinations of categorical variables, descriptive statistics usually just involve cross-tabulation with table
again, and now
- the variable provided as the 1st argument goes into the rows;
- the variable provided as the 2nd argument goes into the columns:
table( # tabulate such that the
clo$ORDER, # 1st variable goes into the rows
clo$CONJ) # 2nd variable goes into the columns
als/when bevor/before nachdem/after weil/because
mc-sc 37 28 26 184
sc-mc 56 18 39 15
mc-sc sc-mc
als/when 37 56
bevor/before 28 18
nachdem/after 26 39
weil/because 184 15
Often it’s useful to add row/column sums/totals with addmargins
):
als/when bevor/before nachdem/after weil/because Sum
mc-sc 37 28 26 184 275
sc-mc 56 18 39 15 128
Sum 93 46 65 199 403
mc-sc sc-mc Sum
als/when 37 56 93
bevor/before 28 18 46
nachdem/after 26 39 65
weil/because 184 15 199
Sum 275 128 403
Tables of proportions with prop.table
are also possible again, but now you need to decide ‘proportions of what?’:
als/when bevor/before nachdem/after weil/because
mc-sc 0.09181141 0.06947891 0.06451613 0.45657568
sc-mc 0.13895782 0.04466501 0.09677419 0.03722084
als/when bevor/before nachdem/after weil/because
mc-sc 0.13454545 0.10181818 0.09454545 0.66909091
sc-mc 0.43750000 0.14062500 0.30468750 0.11718750
als/when bevor/before nachdem/after weil/because
mc-sc 0.39784946 0.60869565 0.40000000 0.92462312
sc-mc 0.60215054 0.39130435 0.60000000 0.07537688
Usually, it is best to compute proportions that add up to 100% for the level of the response variable, i.e. here for ORDER
. That means,
- if your dependent variable is in the rows, you need column percentages;
- if your dependent variable is in the columns, you need row percentages.
Also: remember to provide only useful decimals (here with the pipe):
als/when bevor/before nachdem/after weil/because
mc-sc 0.398 0.609 0.400 0.925
sc-mc 0.602 0.391 0.600 0.075
You can also do higher-dimensional cross-tabulation: a third variable goes into separate ‘slices’ of such an array …
, , = mc-sc
als/when bevor/before nachdem/after weil/because
no 19 18 14 87
yes 18 10 12 97
, , = sc-mc
als/when bevor/before nachdem/after weil/because
no 22 6 14 5
yes 34 12 25 10
… and the notation separating the slices also gives you a clue that you can access them using the same subsetting mechanisms as before:
, , = mc-sc
als/when bevor/before nachdem/after weil/because
no 19 18 14 87
yes 18 10 12 97
, , = sc-mc
als/when bevor/before nachdem/after weil/because
no 22 6 14 5
yes 34 12 25 10
mc-sc sc-mc
als/when 19 22
bevor/before 18 6
nachdem/after 14 14
weil/because 87 5
mc-sc sc-mc
no 19 22
yes 18 34
als/when bevor/before nachdem/after weil/because
no 19 18 14 87
yes 18 10 12 97
For such cases, a function called ftable
might be better, because it creates flat contingency tables, putting the last argument into the columns:
ftable( # create a flat contingency table with
clo$MORETHAN2CL, # this as the 1st row variable
clo$CONJ, # this as the 2nd row variable
clo$ORDER)
mc-sc sc-mc
no als/when 19 22
bevor/before 18 6
nachdem/after 14 14
weil/because 87 5
yes als/when 18 34
bevor/before 10 12
nachdem/after 12 25
weil/because 97 10
This, together with proportions, is often very useful:
ftable( # create a flat contingency table with
clo$MORETHAN2CL, # this as the 1st row variable
clo$CONJ, # this as the 2nd row variable
clo$ORDER) %>% # this as the column variable, then
prop.table(1) %>% round(3) # make this rounded %s
mc-sc sc-mc
no als/when 0.463 0.537
bevor/before 0.750 0.250
nachdem/after 0.500 0.500
weil/because 0.946 0.054
yes als/when 0.346 0.654
bevor/before 0.455 0.545
nachdem/after 0.324 0.676
weil/because 0.907 0.093
3.1.2 Visualization
For 2-dimensional tables, dot charts are again very useful:
als/when bevor/before nachdem/after weil/because
mc-sc 37 28 26 184
sc-mc 56 18 39 15
dotchart( # draw a dot chart
table(clo$ORDER, clo$CONJ), # of these frequencies
xlim=c(0, 403), # with these x-axis limits (min, max)
pch=16) # using filled circles
abline( # draw a straight line
v=403/8, # vertical, at x=403/8 (even distribution)
lty=3) # make it dotted
And now dotchart
is even more annoying than before because
- the ordering of the column levels is still ‘reversed’, it rises from bottom to top so we fix that again with manual reordering;
- the nesting of the variables in the plot does not correspond to the table (which is easy to fix: we transpose with
t
)
als/when bevor/before nachdem/after weil/because
mc-sc 37 28 26 184
sc-mc 56 18 39 15
dotchart( # draw a dot chart
t(table(clo$ORDER, clo$CONJ))[4:1,], # of these frequencies, now transposed & reversed!
xlim=c(0, 403), # with these x-axis limits (min, max)
pch=16) # using filled circles
abline( # draw a straight line
v=403/8, # vertical, at x=403/8 (even distribution)
lty=3) # make it dotted
The version I would most likely use, though, is this one (because now both the levels of CONJ
and ORDER
are presented in alphabetical order):
als/when bevor/before nachdem/after weil/because
mc-sc 37 28 26 184
sc-mc 56 18 39 15
dotchart( # draw a dot chart
table(clo$ORDER, clo$CONJ)[2:1,], # of these frequencies, now reversed!
xlim=c(0, 403), # with these x-axis limits (min, max)
pch=16) # using filled circles
abline( # draw a straight line
v=403/8, # vertical, at x=403/8 (even distribution)
lty=3) # make it dotted
Bar charts are again an only slightly inferior alternative, but the most basic version is absolutely bare-bones:
Totally useless … With the current plot size, we can’t even read the bar labels. Before we tweak this more comprehensively, here’s a nice quick and dirty fix for that: we let R abbreviate
the labels to, here, 6 characters:
barplot( # a bar chart of
table(clo$ORDER, clo$CONJ), # this table
names.arg=abbreviate( # but abbreviate
levels(clo$CONJ), # the levels of CONJ
6)) # to 6 characters
But now we make it generally better (and increase the size of the plot so that more of the labels fit):
barplot( # make a bar chart of
table(clo$ORDER, clo$CONJ), # this table
# x-axis stuff and names.arg:
xlab="Conjunctions", names.arg=abbreviate(levels(clo$CONJ), 8),
ylab="Frequency", ylim=c(0, 200)) # y-axis stuff
legend(2, 150, # x- and y-coordinates
title="Clause orders", # heading of legend
fill=gray.colors(2), # colors of legend
legend=levels(clo$ORDER), # text of legend
xjust=0.5, yjust=0.5, # centered around coordinates
ncol=2, # number of columns of legend
bty="n") # box type is 'none'
And a maybe even better version with a lot of refinements:
als/when bevor/before nachdem/after weil/because
mc-sc 0.39784946 0.60869565 0.40000000 0.92462312
sc-mc 0.60215054 0.39130435 0.60000000 0.07537688
weil/because bevor/before nachdem/after als/when
mc-sc 0.92462312 0.60869565 0.40000000 0.39784946
sc-mc 0.07537688 0.39130435 0.60000000 0.60215054
# now the plotting
qwe <- barplot(p.table, # make a bar chart of the proportion table
beside=TRUE, # w/ bars within CONJ side-by-side
ylim=c(0, 1.1), # & these y-axis limits (leaving room at top)
# this heading and these axis labels:
main="The relation between conjunction and order",
xlab="Conjunction", ylab="Observed % order per conjunction",
col=c("darkgreen", "darkred"), # & these colors
width=prop.table(table(clo$ORDER)))
text(qwe, p.table, # add at the bars' centers & their y-axis values
round(p.table, 3), # the observed proportions
cex=0.75, pos=3, # small font, a bit above
col=c("darkgreen", "darkred")) # in the same colors
legend(x=mean(qwe), y=1, title="Clause orders", # add legend at top of plot
fill=c("darkgreen", "darkred"),
legend=c("mc-sc", "sc-mc"),
xjust=0.5, yjust=0.5, ncol=2, bty="n")
abline(v=1.84, lty=2) # vertical line between first 2 sets of bars
text(1.84, 0.9, "causal" , srt=90, adj=c(0.5, -0.5)) # add vertical labels
text(1.84, 0.9, "temporal", srt=90, adj=c(0.5, 1.25)) # to the vert. line
Another option would be mosaic plots, the most basic version of which again is pretty basic; note: I immediately transpose so the orientation of the plot corresponds to that of the table:
als/when bevor/before nachdem/after weil/because
mc-sc 37 28 26 184
sc-mc 56 18 39 15
A useful addition to this is the argument shade=TRUE
, because then blue and red shading indicates what’s over- and what’s underrepresented respectively:
als/when bevor/before nachdem/after weil/because
mc-sc 37 28 26 184
sc-mc 56 18 39 15
mosaicplot(main="ORDER as a function of CONJ",
table(clo$CONJ, clo$ORDER),
shade=TRUE,
xlab="Conjunctions", ylab="Clause orders")
Some quick exercise(s): Cross-tabulate the response variable clo$ORDER
with the predictor clo$MORETHAN2CL
(frequencies and percentages) and represent the results in a dot chart and a mosaic plot:
mc-sc sc-mc
no 138 47
yes 137 81
mc-sc sc-mc
no 0.7459459 0.2540541
yes 0.6284404 0.3715596
3.2 Categorical ~ ordinal/numeric
3.2.1 Descriptive stats
This is a quite frequent scenario, where we have a categorical (often binary) response and a numeric predictor. In some sense, this is often dealt with in a way that ‘converts’ the numeric variable temporarily into a categorical/ordinal variable with a variety of levels, for each of which we then compute the distribution of the response variable. The splitting up of the numeric variable into levels – the binning – we can do again with cut
where we leave the breaks up to R’s function hist
; we just suppress the plotting of hist
:
(qwe <- table( # make qwe the table of
cut( # the factor that results from cutting up
clo$LENDIFF, # LENDIFF
breaks=hist(clo$LENDIFF, # at the same breaks hist ...
plot=FALSE)$breaks, # would use (no plotting)
include.lowest=TRUE), # include the smallest value as well
clo$ORDER)) # the factor ORDER
mc-sc sc-mc
[-35,-30] 1 0
(-30,-25] 0 0
(-25,-20] 2 1
(-20,-15] 4 2
(-15,-10] 13 4
(-10,-5] 49 12
(-5,0] 91 31
(0,5] 78 54
(5,10] 25 13
(10,15] 8 7
(15,20] 4 3
(20,25] 0 1
But of course you an also manually choose a number of breaks, in which case R chooses breaks
equally wide bins:
3.2.2 Visualization
The corresponding visualization is called a spineplot, essentially a stacked bar plot for a series of grouped numeric values, which we can generate with the function spineplot
:
However, the probably more frequent use of this function uses something very important, the so-called formula notation where we have
- the so-called lhs, which is the ‘left-hand side’, the response variable (here
ORDER
); - the tilde
~
, which means ‘as a function of’; - the so-called rhs, which is the ‘right-hand side’, the one or more predictor variables (here
LENDIFF
):
3.3 Ordinal/numeric ~ categorical
3.3.1 Descriptive stats
This section introduces one of the greatest things in R, the way in which functions from the apply
family allow you to avoid what other programming languages would often do with loops. Given everything you know so far, how would you compute the mean length differences (between main and subordinate clauses) for each conjunction? Probably like this:
setNames(
c(mean(clo$LENDIFF[clo$CONJ=="als/when"]),
mean(clo$LENDIFF[clo$CONJ=="bevor/before"]),
mean(clo$LENDIFF[clo$CONJ=="nachdem/after"]),
mean(clo$LENDIFF[clo$CONJ=="weil/because"])),
levels(clo$CONJ))
als/when bevor/before nachdem/after weil/because
-0.04301075 2.52173913 0.41538462 -0.89447236
It doesn’t get more annoying than that and we even have to label our results ourselves … Thus, we rather do this using tapply
(‘tabulate apply’):
tapply( # apply to
clo$LENDIFF, # (the values of) LENDIFF
clo$CONJ, # a grouping by CONJ
FUN=mean) # and compute the mean for each group
als/when bevor/before nachdem/after weil/because
-0.04301075 2.52173913 0.41538462 -0.89447236
This can be extended to more than one grouping variable:
3.3.2 Visualization
We can visualize these kinds of results in the same way as we did when we had only 1 variable, we now just don’t plot those results for all the data but for the groups defined by the grouping variable(s):
boxplot( # make a box plot for
clo$LENDIFF ~ clo$ORDER, # this formula
notch=TRUE, # include notches
varwidth=TRUE, # box width is the number of data points
xlab="Clause orders", ylab="Length differences mc-sc")
abline(h=0, lty=2) # dashed line at y=0
I am not showing the strip charts, but the ecdf plot is still probably the best way to go:
plot( # make an ecdf plot
ecdf(clo$LENDIFF), # of this variable, but
col="white", # plot it in white (i.e. invisibly)
# use no heading & these axis labels:
main="", xlab="Length differences mc-sc", ylab="Cumulative %")
grid() # add a grid
lines( # draw lines for the ecdf when ORDER is "mc-sc":
ecdf(clo$LENDIFF[clo$ORDER=="mc-sc"]),
# pick 2 very distinctive colors, then use the 1st one for "mc-sc"
col=rainbow(2)[1],
verticals=TRUE) # connect points also vertically
lines( # draw lines for the ecdf when ORDER is "sc-mc":
ecdf(clo$LENDIFF[clo$ORDER=="sc-mc"]),
# pick 2 very distinctive colors, then use the 2nd one for "sc-mc"
col=rainbow(2)[2],
verticals=TRUE) # connect points also vertically
legend(title="Clause orders",
x=-25, xjust=0.5,
y=0.7, yjust=0.5,
fill=rainbow(2),
legend=levels(clo$ORDER),
ncol=2, bty="n")
3.4 Ordinal/numeric ~ ordinal/numeric
3.4.1 Descriptive stats
We go back to the reaction time data for this; remember what the columns mean:
rm(list=ls(all=TRUE)) # clear memory
summary(rts <- read.delim("_input/reactiontimes.csv", stringsAsFactors=TRUE))
CASE RT FREQ FAM IMA
almond : 1 Min. :523.0 Min. : 1.00 hi :12 hi :30
ant : 1 1st Qu.:589.4 1st Qu.: 2.00 lo :12 lo :23
apple : 1 Median :617.7 Median : 4.00 med :31 NA's:24
apricot : 1 Mean :631.9 Mean : 9.26 NA's:22
asparagus: 1 3rd Qu.:663.6 3rd Qu.: 10.00
avocado : 1 Max. :794.5 Max. :118.00
(Other) :71
MEAN
Min. :315.0
1st Qu.:409.8
Median :437.5
Mean :436.0
3rd Qu.:466.2
Max. :553.0
NA's :29
Correlations between numeric variables are expressed with correlation coefficients which usually fall into the interval [-1, 1] such that
- values up to +1 indicate a strong positive correlation, which can be expressed with ‘the more/less …, the more/less …’;
- value around 0: no correlation;
- value up to -1: indicate a strong negative correlation, which can be expressed with ‘the less/more …, the more/less …’.
Note: two variables are correlated if knowing the value of one variable helps you predict/narrow down the value range to expect for other variables.
If you have ‘well-behaved’ and normal variables you want to correlate, you can compute the Pearson product-moment correlation coefficient r:
On the other hand, if your data are not ‘well-behaved’ and normal variable, you can compute either Spearman’s rho or Kendall’s tau:
[1] -0.4851569
[1] -0.3473737
But how do you know whether your data are ‘well-behaved’ and normal? Visualization, which means you should actually nearly always visualize before you compute statistics!
3.4.2 Visualization
For such data, a scatter plot is the useful default. Minimally, that could be either one of these:
But let’s add something you should always add: a non-parametric smoother, i.e. a (potentially quite curvy) line that tries to summarize the point cloud:
Clearly, these are not ‘well-behaved’ let alone normal data; this is confirmed by plotting quick histograms:
par(mfrow=c(1, 2)) # set up a plotting panel with 1 row & 2 columns
hist(rts$FREQ, main="")
hist(rts$RT , main="")
par(mfrow=c(1, 1)) # reset to default
In such situations, using Pearson’s r is completely wrong – you either have to use one of the other correlation coefficients or you try to transform the offending variable (here, FREQ
) into a shape that is more ‘well-behaved’ and normal. For the shape that FREQ
has, often the log transformation works well:
rts$FREQLOG <- log2(rts$FREQ)
par(mfrow=c(1, 2)) # set up a plotting panel with 1 row & 2 columns
hist(rts$FREQLOG, main="")
plot(rts$RT ~ rts$FREQLOG); grid()
lines(lowess(rts$RT ~ rts$FREQLOG), col="red")
par(mfrow=c(1, 1)) # reset to default
In this case, the log transformation helps: Yes, the histogram is clearly non-normal, but the scatterplot is much better than before. Thus, we re-compute the correlation coefficients:
setNames(
c(cor(rts$FREQLOG, rts$RT),
cor(rts$FREQLOG, rts$RT, method="spearman"),
cor(rts$FREQLOG, rts$RT, method="kendall")),
c("Pearson's r", "Spearman's rho", "Kendall's tau"))
Pearson's r Spearman's rho Kendall's tau
-0.4679032 -0.4851569 -0.3473737
Now, a much more refined version is this:
plot(main=paste("RT ~ log frequency; (tau=", round(cor(rts$FREQLOG, rts$RT, method="kendall"), 3), " ***)", sep=""),
xlab="Frequency (logged)", xlim=c(1, 7) , x=rts$FREQLOG,
ylab="Reaction time in ms", ylim=c(500, 900), y=rts$RT,
pch=16, cex=2, col="#00AA0040")
grid()
lines(lowess(rts$RT ~ rts$FREQLOG), lwd=5, col="darkgreen") # thick locally-weighted smoother
abline(v=median(rts$FREQLOG), lty=2) # add a vertical line at the median of FREQLOG
abline(h=median(rts$RT), lty=2) # add a horizontal line at the median of RT
text(x=median(rts$FREQLOG),
y=850,
labels="median frequency",
srt=90, adj=c(0.5, -0.5))
text(6, median(rts$RT), "median RT", adj=c(NA, -0.5))
text(mean(rts$FREQLOG), median(rts$RT), "+", font=2)
text(median(rts$FREQLOG), mean(rts$RT), "+", font=2)
qwe <- tapply(rts$RT, list(rts$RT>median(rts$RT), rts$FREQLOG>median(rts$FREQLOG)), length)
text(c(1, 1 , 7, 7), c(500, 900, 500, 900), qwe, col="darkgreen", cex=0.5+(qwe/median(qwe)))
The way the color is defined for the points involves code, which you interpret as “RRGGBBOO”:
- the first two 0s are the code for how much red you want, the second two 0s are how much green you want, the third two 0s are how much blue you want;
- the last two values (30) are the opacity of the point so 00 means the point is actually completely transparent and FF means the point is fully opaque;
- that in turn might already tell you that these are values on the hexadecimal scale (where values from 0 to F correspond to values from 0 to 15), meaning they go from 00 (for 0) to FF (for 15×15 + 15 = 255). For example,
- “#FF0000FF” is bright red, fully opaque whereas “#0000FFFF” is bright blue, fully opaque;
- “#FFFF00FF” is red plus green, i.e. yellow, fully opaque;
- “#FF00FF80” is red plus blue, i.e. purple, half opaque.
Another often useful option is to plot the words, and this one I will save into a file to remind you of how that’s done:
png("_output/scatterplot_FREQ&RT.png",
height=6, width=6, # dimensions of the plot
units="in", # in inches
res=300) # resolution in dpi, 300 is recommended
# now the plotting
plot(rts$FREQLOG, rts$RT, type="n"); grid() # set up empty coordinate system!
text(rts$FREQLOG, rts$RT, # plot text at these coordinates
rts$CASE, # namely the words
cex=rts$MEAN/475) # at sizes reflecting the MEAN
lines(lowess(rts$RT ~ rts$FREQLOG), lwd=5, col="#00000030") # add a thick transparent grey smoother
# now you close the plotting device:
dev.off()
png
2
You can find the result at the path shown above, i.e. here.
Some quick exercise(s): R has a function called nchar
, which returns the number(s) of letters of a character vector:
rts$CASE %>% # take the factor rts$CASE
as.character %>% # make it a character vector
nchar # count the letters of each word
[1] 6 3 5 7 9 7 6 3 6 3 8 10 8 5 9 5 6 3 6 7 5 9 8 3 7
[26] 6 5 8 8 3 4 7 4 5 5 4 8 5 4 4 5 7 4 6 5 4 6 5 5 8
[51] 7 5 6 3 7 6 4 3 9 6 6 8 5 5 5 6 5 8 5 10 4 6 8 6 4
[76] 5 10
Use this function to generate a scatter plot that represents the correlation between the lengths of words and their reaction times in a plot like the one above for frequency:
rts$LEN <- rts$CASE %>% as.character %>% nchar
# set up coordinate system
plot(rts$LEN, rts$RT, type="n"); grid()
text( # plot text at
rts$LEN, # these x-coordinates
rts$RT, # these y-coordinates
rts$CASE, # namely these words
cex=0.8 + rts$FREQLOG/7) # at these sizes
# add smoother
lines(lowess(rts$RT ~ rts$LEN), lwd=5, col="#00000030")
4 Comments on workflow
Here are some comments on how you would ideally organize your workflow for any empirical study. First, the one thing you don’t want to do is this, but this is what most people do – I am not kidding:
Actually, the above may not be what most people do – because the above implies people use their Documents folder. Instead, what most people really seem to do is a combination of these two ‘approaches’. One is, they have all their files in the Downloads folder. I have seen Downloads folders on students’ laptops with literally >20,000 files in them …
The other thing, probably just as common:
Maybe, just maybe we can avoid this? Here’s what you should do:
- create a separate folder with a meaningful name for whatever empirical study you plan on doing, e.g. <2024_MAthesis>.
- create at least two subfolders in it:
- <_input>: this folder will contain all data files that you likely prepare in spreadsheet software and read into R; most likely, these will be
- spreadsheet files created with Excel or LibreOffice Calc and .txt/.csv/.tsv versions of those files for import into R;
- R-specific data files (.rda files);
- Note: ideally, every data file comes with a separate text file that explains its structure in detail, like this file explains this data file. If you know you will use R for more comprehensively, you should familiarize yourself with the Roxygen2 format I use for describing the data;
- <_output>: this folder will contain all files that you result from what you did in R; most likely these will be
- plots (probably as .png files, maybe as .svg files);
- other results files, which could be .txt/.csv/.tsv files or R-specific .rda files.
- <_input>: this folder will contain all data files that you likely prepare in spreadsheet software and read into R; most likely, these will be
- maybe you will also want to have a
folder for .pdf files of papers, books, etc.
For your actual R analyses, there are two suggestions, a basic one and a more advanced (but also way better) one if you’re serious about this. The basic one is, create your analysis in a simple .r/.R file in RStudio, but use a ton of pound signs for two reasons:
- to add lots of commentary into the file to say what you’re doing, why you’re doing it, what you see in the results, etc. This way, your R analysis file is already half your methods section/chapter, which you later need anyway.
- to create a structure that you can use as an outline for quick navigation (with the outline button in RStudio and the navigator at the bottom).
Here’s an example:
# First analysis: something amazing #############################
## Loading file #################################################
some code
## Exploration and preparation ##################################
some code # I plot this to see whether the data are normal
# turns out they are not
some more code # log-transform the data
some plotting # I plot this to see whether the data are now normal
# yes, they are
### Excursus: why not try THIS? #################################
## Compute correlations #########################################
some code # good, this is high
## Plot this, too ###############################################
ton of plotting code
ton of plotting code
ton of plotting code
ton of plotting code
# Second analysis: something even more amazing ##################
...
It might be useful to number your code files like <01a_analysis_of_RTs.r> and have all output files that it creates begin with the same numbering so it’s easier to keep track of everything. Here are two examples
The serious suggestion is to learn RMarkdown (e.g., here or here) or, better even, its new extension, which is called Quarto. There is no better way to do replicable data science than the kind of literate programming that this would allow you to do and it’s not nearly as hard as you might think it is.
5 Exercises
5.1 The ASSIGNMENT
data
The file _input/assignments.csv contains data regarding the question of which course-final assignment was chosen by what kind of student in a large international student exchange program; here’s what the data contain/represent, the response variable is ASSIGNMENT
:
rm(list=ls(all=TRUE)) # clear memory
summary(ass <- read.delim("_input/assignments.csv", stringsAsFactors=TRUE))
CASE ASSIGNMENT SEX REGION
Min. : 1.00 lab_report:100 female:145 central_europ: 99
1st Qu.: 75.75 oral_exam :100 male :155 hispanic :100
Median :150.50 thesis :100 mid_east :101
Mean :150.50
3rd Qu.:225.25
Max. :300.00
WORKHOURS MISTAKES
Min. :12.10 Min. :10.00
1st Qu.:19.77 1st Qu.:14.75
Median :29.90 Median :17.00
Mean :27.34 Mean :17.12
3rd Qu.:33.50 3rd Qu.:19.00
Max. :39.90 Max. :25.00
CASE ASSIGNMENT SEX REGION WORKHOURS MISTAKES
1 1 oral_exam female central_europ 20.1 18
2 2 oral_exam female central_europ 21.3 18
3 3 oral_exam female central_europ 13.5 13
4 4 oral_exam male central_europ 19.5 12
5 5 oral_exam female central_europ 21.8 12
6 6 oral_exam male mid_east 21.7 11
Your task is to explore the data in ways that would ultimately prepare for and support an analysis of the response variable, i.e. do the usual exploration of these variable(s) and their combinations:
- response/dependent variable:
ASSIGNMENT
; - predictors/independent variables (pick a subset of 4-6 variables you find interesting);
- predictors/independent variables in that subset w/ each other;
- predictors/independent variables in that subset w/ the response/dependent variable.
5.2 The COMPLEMENTIZER
data
The file _input/thatcomplementation.csv contains data regarding the question of whether a speaker would realize that as a complementizer or not as in examples (1) and (2):
(1) a. The problem was Ø the Vorlons refused to help in the war.
b. The problem was that the Vorlons refused to help in the war.
(2) a. I thought Ø the Vorlons could have helped earlier.
b. I thought that the Vorlons could have helped earlier.
Here’s what the data contain/represent, the response variable is COMPLEMENTIZER
:
rm(list=ls(all=TRUE)) # clear memory
summary(tha <- read.delim("_input/thatcomplementation.csv", stringsAsFactors=TRUE))
CASE CORPUS FILE MATCHLEMMA COMPLEMENTIZER
Min. : 1 ICE-GB :3669 GE042 : 58 think :2683 absent :3769
1st Qu.:1544 ICLE :1260 SP024 : 48 say : 895 present:2405
Median :3088 LINDSEI:1245 GE009 : 42 be : 586
Mean :3088 S1A-022: 39 mean : 428
3rd Qu.:4631 S1B-039: 35 know : 359
Max. :6174 S1A-054: 32 hope : 153
(Other):5920 (Other):1070
TYPE L1 REGISTER L_COMP L_COMPSUBJ
obj:5524 English:3669 spoken :4206 Min. : 4.00 Min. : 1.000
sub: 650 German :1369 written:1968 1st Qu.: 27.25 1st Qu.: 2.000
Spanish:1136 Median : 48.00 Median : 3.000
Mean : 63.66 Mean : 8.479
3rd Qu.: 82.00 3rd Qu.: 8.000
Max. :500.00 Max. :189.000
L_COMPREST L_MATRV2CC L_MATRBE4S L_MATRSUBJ
Min. : 1.00 Min. : 0.00000 Min. : 0.000 Min. : 1.000
1st Qu.: 22.00 1st Qu.: 0.00000 1st Qu.: 0.000 1st Qu.: 1.000
Median : 39.00 Median : 0.00000 Median : 0.000 Median : 1.000
Mean : 55.18 Mean : 0.04244 Mean : 2.864 Mean : 5.795
3rd Qu.: 72.00 3rd Qu.: 0.00000 3rd Qu.: 0.000 3rd Qu.: 4.000
Max. :463.00 Max. :69.00000 Max. :202.000 Max. :175.000
L_MATRS2V DP_CW DP_WC SURPRISAL
Min. : 0.0000 Min. :-0.113264 Min. :-0.521334 Min. : 1.343
1st Qu.: 0.0000 1st Qu.:-0.003243 1st Qu.:-0.007668 1st Qu.: 4.094
Median : 0.0000 Median : 0.035430 Median : 0.154076 Median : 5.045
Mean : 0.5011 Mean : 0.153483 Mean : 0.127166 Mean : 6.100
3rd Qu.: 0.0000 3rd Qu.: 0.374783 3rd Qu.: 0.421268 3rd Qu.: 7.020
Max. :119.0000 Max. : 0.374783 Max. : 0.480777 Max. :20.007
CASE CORPUS FILE MATCHLEMMA COMPLEMENTIZER TYPE L1 REGISTER L_COMP
1 1 ICE-GB S2A-068 accept present obj English spoken 115
2 2 ICE-GB S1B-029 accept absent obj English spoken 33
3 3 ICLE SPM04051 accept present obj Spanish written 43
4 4 ICE-GB S1B-063 accept present obj English spoken 55
5 5 ICE-GB S1B-065 accept present obj English spoken 81
6 6 ICE-GB S1B-056 accept present obj English spoken 324
L_COMPSUBJ L_COMPREST L_MATRV2CC L_MATRBE4S L_MATRSUBJ L_MATRS2V DP_CW
1 14 101 0 0 1 0 -0.00676608
2 12 21 0 0 3 0 -0.00676608
3 14 29 0 3 2 0 -0.00676608
4 2 53 0 0 1 0 -0.00676608
5 1 80 0 0 1 0 -0.00676608
6 2 322 0 0 2 0 -0.00676608
DP_WC SURPRISAL
1 -0.4659455 6.424231
2 -0.4659455 13.319049
3 -0.4659455 12.319049
4 -0.4659455 9.511694
5 -0.4659455 9.412159
6 -0.4659455 4.560826
Your task is to explore the data in ways that would ultimately prepare for and support an analysis of the response variable, i.e. do the usual exploration of these variable(s) and their combinations:
- response/dependent variable:
COMPLEMENTIZER
; - predictors/independent variables (pick a subset of 4-6 variables you find interesting);
- predictors/independent variables in that subset w/ each other;
- predictors/independent variables in that subset w/ the response/dependent variable.
5.3 The DURATION
data
The file _input/wdurations.csv contains data regarding the question of how long the pronunciation of a word lasts in a spoken turn with 10 words in it; here’s what the data contain/represent, the response variable is DURATION
:
rm(list=ls(all=TRUE)) # clear memory
summary(dur <- read.delim("_input/wdurations.csv", stringsAsFactors=TRUE))
CASE CONV SPEAKER SEX DURATION
Min. : 1 C08 : 873 S15 : 478 female:3203 Min. : 14.0
1st Qu.:1596 C05 : 831 S10 : 455 male :3180 1st Qu.: 110.0
Median :3192 C06 : 800 S11 : 414 Median : 162.0
Mean :3192 C07 : 772 S14 : 410 Mean : 194.5
3rd Qu.:4788 C04 : 737 S08 : 401 3rd Qu.: 245.0
Max. :6383 C09 : 557 S16 : 395 Max. :1405.0
(Other):1813 (Other):3830
POSINTURN LENGTH FREQ SURPRISAL
Min. : 1.000 Min. : 1.000 Min. : 0.00 Min. : 0.00
1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.:12.50 1st Qu.: 4.00
Median : 6.000 Median : 3.000 Median :15.00 Median : 6.00
Mean : 5.524 Mean : 2.768 Mean :14.04 Mean : 6.17
3rd Qu.: 8.000 3rd Qu.: 3.000 3rd Qu.:17.00 3rd Qu.: 8.00
Max. :10.000 Max. :14.000 Max. :18.00 Max. :21.00
CLASS VOWELS
closed :4021 one :3698
number : 125 two+:2685
open :2086
propname: 110
unclear : 41
CASE CONV SPEAKER SEX DURATION POSINTURN LENGTH FREQ SURPRISAL CLASS VOWELS
1 1 C01 S01 male 105 2 3 16 5 closed one
2 2 C01 S01 male 89 3 3 16 6 closed one
3 3 C01 S01 male 78 4 2 18 3 closed one
4 4 C01 S01 male 100 5 1 15 3 closed one
5 5 C01 S01 male 368 6 6 11 8 open two+
6 6 C01 S01 male 251 7 2 16 7 closed two+
Your task is to explore the data in ways that would ultimately prepare for and support an analysis of the response variable, i.e. do the usual exploration of these variable(s) and their combinations:
- response/dependent variable:
DURATION
; - predictors/independent variables (pick a subset of 4-6 variables you find interesting);
- predictors/independent variables in that subset w/ each other;
- predictors/independent variables in that subset w/ the response/dependent variable.
6 The final assignment: the Boston housing data
The file _input/boston.csv contains data regarding the values of real estate properties in the greater Boston area regarding the question of which variables best explain how valuable a house/lot is; here’s what the data contain/represent, the response variable is MEDIANVALC
. Your task is to explore the data in ways that would ultimately prepare for and support an analysis of the response variable, i.e. do the usual exploration of these variable(s) and their combinations:
- response/dependent variable:
MEDIANVALC
; - predictors/independent variables (pick a subset of 4-6 variables you find interesting);
- predictors/independent variables in that subset w/ each other;
- predictors/independent variables in that subset w/ the response/dependent variable.
For each variable (combination), do some exploration and write a short interpretation. Make sure your analysis is parsimonious, i.e. don’t dump all descriptive statistics you know into the code. For example, for something numeric, don’t just mindlessly compute the mean and the median and the standard deviation and the interquartile range: pick the one measure of central tendency and the corresponding measure of dispersion that is supported by the data; same with the plots.
R version 4.5.1 Patched (2025-06-20 r88332 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] STGmisc_1.01 Rcpp_1.1.0 magrittr_2.0.3
loaded via a namespace (and not attached):
[1] digest_0.6.37 fastmap_1.2.0 xfun_0.52 knitr_1.50
[5] htmltools_0.5.8.1 rmarkdown_2.29 cli_3.6.5 rstudioapi_0.17.1
[9] tools_4.5.1 evaluate_1.0.4 yaml_2.3.10 rlang_1.1.6
[13] jsonlite_2.0.0 htmlwidgets_1.6.4 MASS_7.3-65