Intro to descriptive statistics

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

01 Aug 2025 12-34-56

1 Data frames & other data structures

Data frames are the most important data structure we will deal with directly, they correspond to ‘rectangular’ spreadsheets, i.e. tables where

  • all rows have the same number of columns;
  • all columns have the same number of rows.
  PARTOFSPEECH  CLASS TOKENFREQS TYPEFREQS
1          ADJ   open        421       271
2          ADV   open        337       103
3            N   open       1411       735
4         CONJ closed        458        18
5         PREP closed        455        37

We clear memory and load a very useful package:

rm(list=ls(all=TRUE))
library(magrittr)

1.1 Creating data frames

Creating data frames within R, which we will hardly ever do, can be done in two ways:

  1. we can first create each column that is supposed to go into the data frame and then we create a data frame by putting all those columns together into one structure;
  2. we create both the columns and the data frame together at the same time.

Here’s an example of approach 1. Creating a data structure like a column involves what is called

  • something that is created (PARTOFSPEECH in line 1);
  • the assignment operator (<-);
  • what to put into the thing that is created; in line 1 here that is a data structure called a vector, of which we will use three types:
    • character vectors (as in lines 1-2);
    • numeric vectors (as in lines 3-4); logical vectors (see Section 1.3.2 below). All vectors can be created with the function c (for ‘combine’?):
PARTOFSPEECH <- c("ADJ", "ADV", "N", "CONJ", "PREP")
CLASS <- c("open", "open", "open", "closed", "closed")
TOKENFREQS <- c(421, 337, 1411, 458, 455)
TYPEFREQS <- c(271, 103, 735, 18, 37)

From those 4 vectors, we can then put together a data frame with the function data.frame. Note:

  • if we do an assignment within parentheses, the assigned object is printed;
  • we use an additional argument called stringsAsFactors=TRUE, which converts the character vectors in the data frame into another kind of data structure that is called a factor, which has some advantages for some statistical methods (and some disadvantages for data wrangling …):
(qwe <- data.frame(
   PARTOFSPEECH, CLASS, TOKENFREQS, TYPEFREQS,
   stringsAsFactors=TRUE))
  PARTOFSPEECH  CLASS TOKENFREQS TYPEFREQS
1          ADJ   open        421       271
2          ADV   open        337       103
3            N   open       1411       735
4         CONJ closed        458        18
5         PREP closed        455        37

If the data frame you want to create consists of all combinations of several variables, you can use expand.grid, something we will likely use next week:

# data frame of all combinations of all vectors
(qwe <- expand.grid(
   PARTOFSPEECH=c("n", "v"),
   TOKENFREQS=1:3))
  PARTOFSPEECH TOKENFREQS
1            n          1
2            v          1
3            n          2
4            v          2
5            n          3
6            v          3

1.2 Loading data frames

Nearly always, we will load those from some spreadsheet software such as LibreOffice Calc. For example, here is a data frame that was created in a spreadsheet software like LibreOffice Calc and then saved from it as a .csv (comma-separated values) file, but actually using a tab stop between columns. Such files can be opened with any text editor, any spreadsheet software, or any statistics software. The most versatile version to load a data frame is the function read.table, which can take a lot of arguments:

  • file: a character vector with the path to a file (either one you typed like “_input/reactiontimes.csv” or an interactively chosen one with file.choose());
  • header: a logical vector that states whether the data frame’s first row are the column headings; this should nearly always be TRUE;
  • sep: a character vector that says what’s between columns; this should nearly always be "\t";
  • stringsAsFactors: a logical vector that states whether columns with character vectors should be made factors; this should nearly always be TRUE;
  • quote: a character vector that tells R what characters indicate quotes; this should nearly always be "";
  • comment.char: a character vector that tells R what characters indicate comments; this should nearly always be "".

Here you see it in practice:

rm(list=ls(all=TRUE)) # clear memory
rts <- read.table(        # make rts the result of loading
   "_input/reactiontimes.csv", # data from this file path
   header=TRUE,           # the data frame's first row are the column headings
   sep="\t",              # the columns are separated by tab stops
   stringsAsFactors=TRUE, # make the columns that contain strings factors
   quote="",              # don't interpret any quote characters as special characters
   comment.char="")       # don't interpret pound signs as preceding comments
# note: using a German operating system, you may have to add the argument dec="."
head(rts) # just show the first six rows
       CASE       RT FREQ  FAM  IMA MEAN
1    almond 650.9947    2 <NA> <NA>   NA
2       ant 589.4347    7  med   hi  415
3     apple 523.0493   10   hi   hi  451
4   apricot 642.3342    2   lo   lo   NA
5 asparagus 696.2092    2  med   lo  442
6   avocado 621.6382   12 <NA> <NA>   NA

Here’s what the columns are/mean. Note: A cell containing NA or <NA> means ’there is no data available for that cell.

My recommendation: always make your data frame ‘simple’ such that when you prepare your data frames for statistical analysis in some spreadsheet software,

  • you should always use the case-by-variable / long format:
    • one row for each measurement of the dependent variable;
    • one column for each variable (independent, dependent, confound, moderators, …);
  • there should be no empty cells (use NA for those);
  • use all caps for the column headers and all small letters for factor levels;
  • avoid spaces and maybe even periods in variable names and levels.

If your data frame looks like that, you can make reading it in much simpler using read.delim:

head(rts <- read.delim("_input/reactiontimes.csv", stringsAsFactors=TRUE))
       CASE       RT FREQ  FAM  IMA MEAN
1    almond 650.9947    2 <NA> <NA>   NA
2       ant 589.4347    7  med   hi  415
3     apple 523.0493   10   hi   hi  451
4   apricot 642.3342    2   lo   lo   NA
5 asparagus 696.2092    2  med   lo  442
6   avocado 621.6382   12 <NA> <NA>   NA

Here’re two functions that tell you a lot about a data frame:

summary(rts) # get a summary of each column of the data frame
        CASE          RT             FREQ          FAM       IMA    
 almond   : 1   Min.   :523.0   Min.   :  1.00   hi  :12   hi  :30  
 ant      : 1   1st Qu.:589.4   1st Qu.:  2.00   lo  :12   lo  :23  
 apple    : 1   Median :617.7   Median :  4.00   med :31   NA's:24  
 apricot  : 1   Mean   :631.9   Mean   :  9.26   NA's:22            
 asparagus: 1   3rd Qu.:663.6   3rd Qu.: 10.00                      
 avocado  : 1   Max.   :794.5   Max.   :118.00                      
 (Other)  :71                                                       
      MEAN      
 Min.   :315.0  
 1st Qu.:409.8  
 Median :437.5  
 Mean   :436.0  
 3rd Qu.:466.2  
 Max.   :553.0  
 NA's   :29     
str(rts) # get an overview of the data frame
'data.frame':   77 obs. of  6 variables:
 $ CASE: Factor w/ 77 levels "almond","ant",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ RT  : num  651 589 523 642 696 ...
 $ FREQ: int  2 7 10 2 2 12 5 19 4 12 ...
 $ FAM : Factor w/ 3 levels "hi","lo","med": NA 3 1 2 3 NA 1 3 3 1 ...
 $ IMA : Factor w/ 2 levels "hi","lo": NA 1 1 2 2 NA 1 2 1 1 ...
 $ MEAN: int  NA 415 451 NA 442 NA 449 412 412 441 ...

And this is how you get row and column numbers and names (if defined):

nrow(rts)     # number of rows of the data frame
[1] 77
rownames(rts) # names of the rows of the data frame
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
[46] "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60"
[61] "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74" "75"
[76] "76" "77"
ncol(rts)     # number of columns of the data frame
[1] 6
colnames(rts) # names of the columns of the data frame
[1] "CASE" "RT"   "FREQ" "FAM"  "IMA"  "MEAN"

Some quick exercise(s): Load the data in _input/assignments.csv into a data frame ass and determine how many rows and columns it has:

str(ass <- read.delim("_input/assignments.csv", stringsAsFactors=TRUE))
'data.frame':   300 obs. of  6 variables:
 $ CASE      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ASSIGNMENT: Factor w/ 3 levels "lab_report","oral_exam",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ SEX       : Factor w/ 2 levels "female","male": 1 1 1 2 1 2 2 1 1 1 ...
 $ REGION    : Factor w/ 3 levels "central_europ",..: 1 1 1 1 1 3 1 2 1 1 ...
 $ WORKHOURS : num  20.1 21.3 13.5 19.5 21.8 21.7 15.5 15.9 21.5 13.1 ...
 $ MISTAKES  : int  18 18 13 12 12 11 11 11 11 10 ...
nrow(ass)
[1] 300
ncol(ass)
[1] 6

1.3 Accessing parts of data frames

1.3.1 Picking things yourself

You can look at the top 6 or bottom 6 rows (or other numbers) of a data frame with head and tail:

head(rts)    # the default: 6 rows
       CASE       RT FREQ  FAM  IMA MEAN
1    almond 650.9947    2 <NA> <NA>   NA
2       ant 589.4347    7  med   hi  415
3     apple 523.0493   10   hi   hi  451
4   apricot 642.3342    2   lo   lo   NA
5 asparagus 696.2092    2  med   lo  442
6   avocado 621.6382   12 <NA> <NA>   NA
head(rts, 3) # only 3 rows
    CASE       RT FREQ  FAM  IMA MEAN
1 almond 650.9947    2 <NA> <NA>   NA
2    ant 589.4347    7  med   hi  415
3  apple 523.0493   10   hi   hi  451
tail(rts)    # the default: 6 rows
         CASE       RT FREQ  FAM  IMA MEAN
72     tomato 611.0240    5   hi   hi  489
73   tortoise 733.0323    4   lo   lo  403
74     walnut 663.5908   12  med   lo  468
75       wasp 725.7056    3 <NA> <NA>   NA
76      whale 609.9745    1  med   hi  474
77 woodpecker 686.3439    2 <NA> <NA>   NA
tail(rts, 3) # only 3 rows
         CASE       RT FREQ  FAM  IMA MEAN
75       wasp 725.7056    3 <NA> <NA>   NA
76      whale 609.9745    1  med   hi  474
77 woodpecker 686.3439    2 <NA> <NA>   NA

You can access specific cells using a coordinate system of the type dataframe[rows, columns]:

rts[1,2]            # only row(s) 1  , only column(s) 2
[1] 650.9947
rts[1,2:3]          # only row(s) 1  , only column(s) 2-3
        RT FREQ
1 650.9947    2
rts[1:2,c(1, 2, 4)] # only row(s) 1-2, only column(s) 1-2, 4
    CASE       RT  FAM
1 almond 650.9947 <NA>
2    ant 589.4347  med

If you specify no columns or no rows, you get all of them:

rts[2,] # only row(s) 2   , only column(s): all
  CASE       RT FREQ FAM IMA MEAN
2  ant 589.4347    7 med  hi  415
rts[,2] # only row(s): all, only column(s): 2
 [1] 650.9947 589.4347 523.0493 642.3342 696.2092 621.6382 601.8914 589.6712
 [9] 661.5156 567.0827 751.0928 638.0350 597.6108 672.3081 635.4823 636.0095
[17] 620.8766 567.6625 614.0209 588.0462 667.2140 699.0151 617.7367 583.6823
[25] 618.1155 581.6936 596.3469 616.3894 655.2160 614.8879 592.9443 776.6582
[33] 559.7334 576.6969 568.4144 660.0190 773.9089 624.4767 600.9778 682.1382
[41] 549.3614 634.6535 588.3237 745.4629 596.4692 655.4345 593.6087 620.0387
[49] 604.3841 602.2479 633.9500 578.9176 544.8413 637.3747 745.1318 571.8937
[57] 579.4982 559.0134 636.7750 670.7467 678.3105 754.3662 585.1791 556.1345
[65] 609.8235 562.3294 672.0730 711.7317 794.4522 612.7960 606.9915 611.0240
[73] 733.0323 663.5908 725.7056 609.9745 686.3439

Typically, though, we access columns by their names:

rts[,"CASE"] # only row(s): all, only column(s): "CASE"
 [1] almond     ant        apple      apricot    asparagus  avocado   
 [7] banana     bat        beaver     bee        beetroot   blackberry
[13] broccoli   bunny      butterfly  camel      carrot     cat       
[19] cherry     chicken    clove      crocodile  cucumber   dog       
[25] dolphin    donkey     eagle      eggplant   elephant   fox       
[31] frog       gherkin    goat       goose      grape      gull      
[37] hedgehog   horse      kiwi       leek       lemon      lettuce   
[43] lion       magpie     melon      mole       monkey     moose     
[49] mouse      mushroom   mustard    olive      orange     owl       
[55] paprika    peanut     pear       pig        pineapple  potato    
[61] radish     reindeer   shark      sheep      snake      spider    
[67] squid      squirrel   stork      strawberry swan       tomato    
[73] tortoise   walnut     wasp       whale      woodpecker
77 Levels: almond ant apple apricot asparagus avocado banana bat beaver ... woodpecker
rts$CASE     # only row(s): all, only column(s): "CASE"
 [1] almond     ant        apple      apricot    asparagus  avocado   
 [7] banana     bat        beaver     bee        beetroot   blackberry
[13] broccoli   bunny      butterfly  camel      carrot     cat       
[19] cherry     chicken    clove      crocodile  cucumber   dog       
[25] dolphin    donkey     eagle      eggplant   elephant   fox       
[31] frog       gherkin    goat       goose      grape      gull      
[37] hedgehog   horse      kiwi       leek       lemon      lettuce   
[43] lion       magpie     melon      mole       monkey     moose     
[49] mouse      mushroom   mustard    olive      orange     owl       
[55] paprika    peanut     pear       pig        pineapple  potato    
[61] radish     reindeer   shark      sheep      snake      spider    
[67] squid      squirrel   stork      strawberry swan       tomato    
[73] tortoise   walnut     wasp       whale      woodpecker
77 Levels: almond ant apple apricot asparagus avocado banana bat beaver ... woodpecker

1.3.2 Having R pick things

Often, we want R to do the leg work and find things for us. We can do that in two ways, both of which involve what is called logical expressions, i.e. expressions that return logical vectors containing only either TRUEs or FALSEs. Here’s way 1: we first use a logical expression to create a logical vector (note the ==, two equals signs) …

tail(rts, 3) # check the bottom
         CASE       RT FREQ  FAM  IMA MEAN
75       wasp 725.7056    3 <NA> <NA>   NA
76      whale 609.9745    1  med   hi  474
77 woodpecker 686.3439    2 <NA> <NA>   NA
rts$FREQ==3  # checking for each FREQ whether it is 3
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE  TRUE FALSE FALSE
(freqof3 <- rts$FREQ==3) # here, that gets assigned to freqof3
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE  TRUE FALSE FALSE

… and then use that for what is called subsetting:

rts[rts$FREQ==3,] # subsetting with that logical vector
        CASE       RT FREQ  FAM  IMA MEAN
15 butterfly 635.4823    3  med   hi  424
50  mushroom 602.2479    3 <NA> <NA>   NA
54       owl 637.3747    3  med   lo  397
55   paprika 745.1318    3 <NA> <NA>   NA
66    spider 562.3294    3  med   lo  392
75      wasp 725.7056    3 <NA> <NA>   NA
rts[freqof3,] # subsetting with that logical vector
        CASE       RT FREQ  FAM  IMA MEAN
15 butterfly 635.4823    3  med   hi  424
50  mushroom 602.2479    3 <NA> <NA>   NA
54       owl 637.3747    3  med   lo  397
55   paprika 745.1318    3 <NA> <NA>   NA
66    spider 562.3294    3  med   lo  392
75      wasp 725.7056    3 <NA> <NA>   NA

Way 2 is that we again create a logical vector first, but then also use the function which to get a numeric vector that says where in the logical vector the TRUEs are. Important: For reasons that are not relevant right now, this numeric vector way of doing this is the better way!

rts$FREQ==3 # repeated from above
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE  TRUE FALSE FALSE
which(rts$FREQ==3) # and now w/ which
[1] 15 50 54 55 66 75

Thus, we can do this:

(freqof3 <- which(rts$FREQ==3)) # now a numeric vector
[1] 15 50 54 55 66 75
 rts[freqof3,] # subsetting with that numeric vector
        CASE       RT FREQ  FAM  IMA MEAN
15 butterfly 635.4823    3  med   hi  424
50  mushroom 602.2479    3 <NA> <NA>   NA
54       owl 637.3747    3  med   lo  397
55   paprika 745.1318    3 <NA> <NA>   NA
66    spider 562.3294    3  med   lo  392
75      wasp 725.7056    3 <NA> <NA>   NA

We can use the usual mathematical operators for this kind of expression: ==, !=, >, and <. The first two also apply to character vectors and factors:

rts[which(rts$FAM=="hi"),]
      CASE       RT FREQ FAM IMA MEAN
3    apple 523.0493   10  hi  hi  451
7   banana 601.8914    5  hi  hi  449
10     bee 567.0827   12  hi  hi  441
18     cat 567.6625   24  hi  hi  487
20 chicken 588.0462   38  hi  hi  462
24     dog 583.6823   76  hi  hi  553
38   horse 624.4767  118  hi  hi  471
42 lettuce 634.6535    1  hi  hi  423
53  orange 544.8413   24  hi  hi  467
57    pear 579.4982    7  hi  lo  494
60  potato 670.7467   16  hi  hi  484
72  tomato 611.0240    5  hi  hi  489

Finally, we can ask for multiple conditions at the same time in two ways. One involves using & for ‘and’ and | for ‘or’ and is usually applied to multiple different variables:

rts[                                    # show rts, but only 
   which(rts$FAM=="hi" & rts$IMA=="hi") # rows meeting these conditions
   ,]                                   # & then all columns
      CASE       RT FREQ FAM IMA MEAN
3    apple 523.0493   10  hi  hi  451
7   banana 601.8914    5  hi  hi  449
10     bee 567.0827   12  hi  hi  441
18     cat 567.6625   24  hi  hi  487
20 chicken 588.0462   38  hi  hi  462
24     dog 583.6823   76  hi  hi  553
38   horse 624.4767  118  hi  hi  471
42 lettuce 634.6535    1  hi  hi  423
53  orange 544.8413   24  hi  hi  467
60  potato 670.7467   16  hi  hi  484
72  tomato 611.0240    5  hi  hi  489
rts[                                  # show rts, but only 
   which(rts$FAM=="hi" | rts$FREQ>20) # rows meeting these conditions
   ,]                                 # & then all columns
      CASE       RT FREQ FAM IMA MEAN
3    apple 523.0493   10  hi  hi  451
7   banana 601.8914    5  hi  hi  449
10     bee 567.0827   12  hi  hi  441
18     cat 567.6625   24  hi  hi  487
20 chicken 588.0462   38  hi  hi  462
24     dog 583.6823   76  hi  hi  553
38   horse 624.4767  118  hi  hi  471
42 lettuce 634.6535    1  hi  hi  423
51 mustard 633.9500   21 med  lo  388
53  orange 544.8413   24  hi  hi  467
57    pear 579.4982    7  hi  lo  494
60  potato 670.7467   16  hi  hi  484
64   sheep 556.1345   24 med  lo  444
65   snake 609.8235   45 med  hi  458
72  tomato 611.0240    5  hi  hi  489

The other way applies only to one variable at a time and involves the operator %in%:

(famnotmed <- which( # which
   rts$FAM %in%      # values of rts$FAM are in
   c("lo", "hi")))   # this vector?
 [1]  3  4  7 10 16 18 20 21 22 24 27 29 38 40 42 45 52 53 57 60 61 69 72 73
rts[famnotmed,]
        CASE       RT FREQ FAM  IMA MEAN
3      apple 523.0493   10  hi   hi  451
4    apricot 642.3342    2  lo   lo   NA
7     banana 601.8914    5  hi   hi  449
10       bee 567.0827   12  hi   hi  441
16     camel 636.0095    2  lo   lo  372
18       cat 567.6625   24  hi   hi  487
20   chicken 588.0462   38  hi   hi  462
21     clove 667.2140    2  lo   lo  315
22 crocodile 699.0151    2  lo   hi  387
24       dog 583.6823   76  hi   hi  553
27     eagle 596.3469    6  lo   hi  457
29  elephant 655.2160    8  lo   hi  440
38     horse 624.4767  118  hi   hi  471
40      leek 682.1382    1  lo   lo   NA
42   lettuce 634.6535    1  hi   hi  423
45     melon 596.4692    2  lo <NA>   NA
52     olive 578.9176    6  lo   lo  351
53    orange 544.8413   24  hi   hi  467
57      pear 579.4982    7  hi   lo  494
60    potato 670.7467   16  hi   hi  484
61    radish 678.3105    9  lo <NA>   NA
69     stork 794.4522    1  lo   lo   NA
72    tomato 611.0240    5  hi   hi  489
73  tortoise 733.0323    4  lo   lo  403
# in this specific example, there is also this shorter option:
(famnotmed <- which( # which
   rts$FAM!="med"))  # values of rts$FAM are NOT "med"
 [1]  3  4  7 10 16 18 20 21 22 24 27 29 38 40 42 45 52 53 57 60 61 69 72 73
rts[famnotmed,]
        CASE       RT FREQ FAM  IMA MEAN
3      apple 523.0493   10  hi   hi  451
4    apricot 642.3342    2  lo   lo   NA
7     banana 601.8914    5  hi   hi  449
10       bee 567.0827   12  hi   hi  441
16     camel 636.0095    2  lo   lo  372
18       cat 567.6625   24  hi   hi  487
20   chicken 588.0462   38  hi   hi  462
21     clove 667.2140    2  lo   lo  315
22 crocodile 699.0151    2  lo   hi  387
24       dog 583.6823   76  hi   hi  553
27     eagle 596.3469    6  lo   hi  457
29  elephant 655.2160    8  lo   hi  440
38     horse 624.4767  118  hi   hi  471
40      leek 682.1382    1  lo   lo   NA
42   lettuce 634.6535    1  hi   hi  423
45     melon 596.4692    2  lo <NA>   NA
52     olive 578.9176    6  lo   lo  351
53    orange 544.8413   24  hi   hi  467
57      pear 579.4982    7  hi   lo  494
60    potato 670.7467   16  hi   hi  484
61    radish 678.3105    9  lo <NA>   NA
69     stork 794.4522    1  lo   lo   NA
72    tomato 611.0240    5  hi   hi  489
73  tortoise 733.0323    4  lo   lo  403

Some quick exercise(s): Which words have an RT of less than 560?

rts[rts$RT<560,]     # TMI
     CASE       RT FREQ FAM IMA MEAN
3   apple 523.0493   10  hi  hi  451
33   goat 559.7334    7 med  lo  402
41  lemon 549.3614   19 med  hi  470
53 orange 544.8413   24  hi  hi  467
58    pig 559.0134    9 med  hi   NA
64  sheep 556.1345   24 med  lo  444
rts[rts$RT<560,1]    # better
[1] apple  goat   lemon  orange pig    sheep 
77 Levels: almond ant apple apricot asparagus avocado banana bat beaver ... woodpecker
rts$CASE[rts$RT<560] # better
[1] apple  goat   lemon  orange pig    sheep 
77 Levels: almond ant apple apricot asparagus avocado banana bat beaver ... woodpecker

Use the function sort and return the sorted reaction times for when both FAM is lo and IMA is hi:

sort(rts$RT[rts$FAM=="lo" & rts$IMA=="hi"])
[1] 596.3469 655.2160 699.0151

Use the function mean to compute the mean reaction time for all words with a frequency in the interval [5, 10]:

mean(rts$RT[rts$FREQ>=5 & rts$FREQ<=10]) # or
[1] 598.8744
mean(rts$RT[rts$FREQ %in% 5:10])
[1] 598.8744

1.4 Changing values in data frames

1.4.1 Adding columns

You can add a new column to a data frame just by specifying that new column. Here’s an example where we add a logical vector based on information that’s already in the data frame:

rts$FREQHIGH <- rts$FREQ>mean(rts$FREQ) # mean is 9.25974
head(rts)
       CASE       RT FREQ  FAM  IMA MEAN FREQHIGH
1    almond 650.9947    2 <NA> <NA>   NA    FALSE
2       ant 589.4347    7  med   hi  415    FALSE
3     apple 523.0493   10   hi   hi  451     TRUE
4   apricot 642.3342    2   lo   lo   NA    FALSE
5 asparagus 696.2092    2  med   lo  442    FALSE
6   avocado 621.6382   12 <NA> <NA>   NA     TRUE

But you can add any kind of information. Here, we’re adding as many random numbers between -10 and +10 to the data frame, as rts has rows:

rts$SOMECRAP <- runif(nrow(rts), -10, 10)
head(rts)
       CASE       RT FREQ  FAM  IMA MEAN FREQHIGH   SOMECRAP
1    almond 650.9947    2 <NA> <NA>   NA    FALSE -0.5993406
2       ant 589.4347    7  med   hi  415    FALSE -0.4795214
3     apple 523.0493   10   hi   hi  451     TRUE -6.7335936
4   apricot 642.3342    2   lo   lo   NA    FALSE -9.2649223
5 asparagus 696.2092    2  med   lo  442    FALSE -2.8901470
6   avocado 621.6382   12 <NA> <NA>   NA     TRUE  2.1543869

But let’s get rid of these two columns again with negative subsetting:

rts <- rts[,-(7:8)]
head(rts)
       CASE       RT FREQ  FAM  IMA MEAN
1    almond 650.9947    2 <NA> <NA>   NA
2       ant 589.4347    7  med   hi  415
3     apple 523.0493   10   hi   hi  451
4   apricot 642.3342    2   lo   lo   NA
5 asparagus 696.2092    2  med   lo  442
6   avocado 621.6382   12 <NA> <NA>   NA

And here we add a factor to the data frame that groups the frequency values not just into hi and lo, but into different user-defined groups:

rts$MEANBIN <- cut(rts$MEAN, # make rts$MEANBIN the cut-up version of # rts$MEAN
   breaks=c(300, 350, 400, 450, 600), # cut up into 4 groups with these 5 boundaries ...
   # note: boundary 1 is < the min of MEAN, boundary 5 is > the max of MEAN
   labels=c("lo", "med", "hi", "superhi")) # ... and with these names
rts[15:25,]
        CASE       RT FREQ  FAM  IMA MEAN MEANBIN
15 butterfly 635.4823    3  med   hi  424      hi
16     camel 636.0095    2   lo   lo  372     med
17    carrot 620.8766    2  med   lo  425      hi
18       cat 567.6625   24   hi   hi  487 superhi
19    cherry 614.0209    7  med   lo  497 superhi
20   chicken 588.0462   38   hi   hi  462 superhi
21     clove 667.2140    2   lo   lo  315      lo
22 crocodile 699.0151    2   lo   hi  387     med
23  cucumber 617.7367    1  med   hi  428      hi
24       dog 583.6823   76   hi   hi  553 superhi
25   dolphin 618.1155    2 <NA> <NA>   NA    <NA>

If you don’t provide labels for the factor levels, check out the Wikipedia entry on interval notation:

  • () = ‘excludes the boundary’;
  • [] = ‘includes the boundary’.

1.4.2 Changing cell values

Changing cell values is easy/obvious for numeric variables (numeric vectors) (and also for logical vectors):

head(rts, 3)
    CASE       RT FREQ  FAM  IMA MEAN MEANBIN
1 almond 650.9947    2 <NA> <NA>   NA    <NA>
2    ant 589.4347    7  med   hi  415      hi
3  apple 523.0493   10   hi   hi  451 superhi
rts$FREQ[1] <- 1000
head(rts, 3)
    CASE       RT FREQ  FAM  IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA>   NA    <NA>
2    ant 589.4347    7  med   hi  415      hi
3  apple 523.0493   10   hi   hi  451 superhi

It’s also easy/obvious for categorical variables (character vectors or factors) when you change something to an existing level:

head(rts, 3)
    CASE       RT FREQ  FAM  IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA>   NA    <NA>
2    ant 589.4347    7  med   hi  415      hi
3  apple 523.0493   10   hi   hi  451 superhi
rts$FAM[2] <- "hi"
head(rts, 3)
    CASE       RT FREQ  FAM  IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA>   NA    <NA>
2    ant 589.4347    7   hi   hi  415      hi
3  apple 523.0493   10   hi   hi  451 superhi

Where it’s less easy/obvious is for categorical variables that are factors when you change something to a new level:

summary(rts)
        CASE          RT             FREQ           FAM       IMA    
 almond   : 1   Min.   :523.0   Min.   :   1.00   hi  :13   hi  :30  
 ant      : 1   1st Qu.:589.4   1st Qu.:   2.00   lo  :12   lo  :23  
 apple    : 1   Median :617.7   Median :   4.00   med :30   NA's:24  
 apricot  : 1   Mean   :631.9   Mean   :  22.22   NA's:22            
 asparagus: 1   3rd Qu.:663.6   3rd Qu.:  10.00                      
 avocado  : 1   Max.   :794.5   Max.   :1000.00                      
 (Other)  :71                                                        
      MEAN          MEANBIN  
 Min.   :315.0   lo     : 1  
 1st Qu.:409.8   med    : 8  
 Median :437.5   hi     :21  
 Mean   :436.0   superhi:18  
 3rd Qu.:466.2   NA's   :29  
 Max.   :553.0               
 NA's   :29                  
rts$FAM[2] <- "superlo"
head(rts, 3)
    CASE       RT FREQ  FAM  IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA>   NA    <NA>
2    ant 589.4347    7 <NA>   hi  415      hi
3  apple 523.0493   10   hi   hi  451 superhi
summary(rts)
        CASE          RT             FREQ           FAM       IMA    
 almond   : 1   Min.   :523.0   Min.   :   1.00   hi  :12   hi  :30  
 ant      : 1   1st Qu.:589.4   1st Qu.:   2.00   lo  :12   lo  :23  
 apple    : 1   Median :617.7   Median :   4.00   med :30   NA's:24  
 apricot  : 1   Mean   :631.9   Mean   :  22.22   NA's:23            
 asparagus: 1   3rd Qu.:663.6   3rd Qu.:  10.00                      
 avocado  : 1   Max.   :794.5   Max.   :1000.00                      
 (Other)  :71                                                        
      MEAN          MEANBIN  
 Min.   :315.0   lo     : 1  
 1st Qu.:409.8   med    : 8  
 Median :437.5   hi     :21  
 Mean   :436.0   superhi:18  
 3rd Qu.:466.2   NA's   :29  
 Max.   :553.0               
 NA's   :29                  

The solution involves two steps:

  1. We let R know that that factor should allow that new level, too:
levels(rts$FAM) <- c( # make the levels of rts$FAM be
   levels(rts$FAM),   # the old levels of rts$FAM,
   "superlo")         # but now also "superlo"
summary(rts) # check
        CASE          RT             FREQ              FAM       IMA    
 almond   : 1   Min.   :523.0   Min.   :   1.00   hi     :12   hi  :30  
 ant      : 1   1st Qu.:589.4   1st Qu.:   2.00   lo     :12   lo  :23  
 apple    : 1   Median :617.7   Median :   4.00   med    :30   NA's:24  
 apricot  : 1   Mean   :631.9   Mean   :  22.22   superlo: 0            
 asparagus: 1   3rd Qu.:663.6   3rd Qu.:  10.00   NA's   :23            
 avocado  : 1   Max.   :794.5   Max.   :1000.00                         
 (Other)  :71                                                           
      MEAN          MEANBIN  
 Min.   :315.0   lo     : 1  
 1st Qu.:409.8   med    : 8  
 Median :437.5   hi     :21  
 Mean   :436.0   superhi:18  
 3rd Qu.:466.2   NA's   :29  
 Max.   :553.0               
 NA's   :29                  
  1. Now we can assign the new level:
rts$FAM[2] <- "superlo"
head(rts, 3)
    CASE       RT FREQ     FAM  IMA MEAN MEANBIN
1 almond 650.9947 1000    <NA> <NA>   NA    <NA>
2    ant 589.4347    7 superlo   hi  415      hi
3  apple 523.0493   10      hi   hi  451 superhi
summary(rts)
        CASE          RT             FREQ              FAM       IMA    
 almond   : 1   Min.   :523.0   Min.   :   1.00   hi     :12   hi  :30  
 ant      : 1   1st Qu.:589.4   1st Qu.:   2.00   lo     :12   lo  :23  
 apple    : 1   Median :617.7   Median :   4.00   med    :30   NA's:24  
 apricot  : 1   Mean   :631.9   Mean   :  22.22   superlo: 1            
 asparagus: 1   3rd Qu.:663.6   3rd Qu.:  10.00   NA's   :22            
 avocado  : 1   Max.   :794.5   Max.   :1000.00                         
 (Other)  :71                                                           
      MEAN          MEANBIN  
 Min.   :315.0   lo     : 1  
 1st Qu.:409.8   med    : 8  
 Median :437.5   hi     :21  
 Mean   :436.0   superhi:18  
 3rd Qu.:466.2   NA's   :29  
 Max.   :553.0               
 NA's   :29                  

But note what can happen (e.g., when we had introduced the new level or when we now change things back):

rts$FAM[2] <- "med"
head(rts, 3)
    CASE       RT FREQ  FAM  IMA MEAN MEANBIN
1 almond 650.9947 1000 <NA> <NA>   NA    <NA>
2    ant 589.4347    7  med   hi  415      hi
3  apple 523.0493   10   hi   hi  451 superhi
summary(rts)
        CASE          RT             FREQ              FAM       IMA    
 almond   : 1   Min.   :523.0   Min.   :   1.00   hi     :12   hi  :30  
 ant      : 1   1st Qu.:589.4   1st Qu.:   2.00   lo     :12   lo  :23  
 apple    : 1   Median :617.7   Median :   4.00   med    :31   NA's:24  
 apricot  : 1   Mean   :631.9   Mean   :  22.22   superlo: 0            
 asparagus: 1   3rd Qu.:663.6   3rd Qu.:  10.00   NA's   :22            
 avocado  : 1   Max.   :794.5   Max.   :1000.00                         
 (Other)  :71                                                           
      MEAN          MEANBIN  
 Min.   :315.0   lo     : 1  
 1st Qu.:409.8   med    : 8  
 Median :437.5   hi     :21  
 Mean   :436.0   superhi:18  
 3rd Qu.:466.2   NA's   :29  
 Max.   :553.0               
 NA's   :29                  

To avoid levels with no occurrences, it’s safest to always apply droplevels to a data frame before you start your statistical exploration:

rts <- droplevels(rts)
summary(rts)
        CASE          RT             FREQ           FAM       IMA    
 almond   : 1   Min.   :523.0   Min.   :   1.00   hi  :12   hi  :30  
 ant      : 1   1st Qu.:589.4   1st Qu.:   2.00   lo  :12   lo  :23  
 apple    : 1   Median :617.7   Median :   4.00   med :31   NA's:24  
 apricot  : 1   Mean   :631.9   Mean   :  22.22   NA's:22            
 asparagus: 1   3rd Qu.:663.6   3rd Qu.:  10.00                      
 avocado  : 1   Max.   :794.5   Max.   :1000.00                      
 (Other)  :71                                                        
      MEAN          MEANBIN  
 Min.   :315.0   lo     : 1  
 1st Qu.:409.8   med    : 8  
 Median :437.5   hi     :21  
 Mean   :436.0   superhi:18  
 3rd Qu.:466.2   NA's   :29  
 Max.   :553.0               
 NA's   :29                  

Some quick exercise(s): Create a data frame qwe that contains only the rows of rts where MEAN < 400.

(qwe <- rts[which(rts$MEAN<400),])
        CASE       RT FREQ FAM IMA MEAN MEANBIN
16     camel 636.0095    2  lo  lo  372     med
21     clove 667.2140    2  lo  lo  315      lo
22 crocodile 699.0151    2  lo  hi  387     med
46      mole 655.4345    5 med  lo  372     med
51   mustard 633.9500   21 med  lo  388     med
52     olive 578.9176    6  lo  lo  351     med
54       owl 637.3747    3 med  lo  397     med
66    spider 562.3294    3 med  lo  392     med
68  squirrel 711.7317    2 med  hi  397     med

What is the highest FREQ value when the RT is <600?

qwe <- rts[which(rts$RT<600),]; max(qwe$FREQ)
[1] 76

How many FAM values are med when IMA is lo?

famISmed <- which(rts$FAM=="med")
imaISlo  <- which(rts$IMA=="lo")
both <- famISmed[famISmed %in% imaISlo]
nrow(rts[both,])
[1] 15
# nrow(rts[intersect(famISmed, imaISlo),])

2 Statistical exploration: 1 variable at a time

Let’s load a different data set and here’s what the columns are/mean:

rm(list=ls(all=TRUE)) # clear memory
clo <- read.delim("_input/clauseorders.csv", stringsAsFactors=TRUE)
head(clo)    # look at the first 6 rows of the data frame
  CASE ORDER SUBORDTYPE     CONJ LENMC LENSC LENDIFF MORETHAN2CL
1 4777 sc-mc       temp als/when     4    10      -6          no
2 1698 mc-sc       temp als/when     7     6       1          no
3  953 sc-mc       temp als/when    12     7       5         yes
4 1681 mc-sc       temp als/when     6    15      -9          no
5 4055 sc-mc       temp als/when     9     5       4         yes
6  967 sc-mc       temp als/when     9     5       4         yes
summary(clo) # look at the summary of each column of the data frame
      CASE        ORDER     SUBORDTYPE            CONJ         LENMC       
 Min.   :   2   mc-sc:275   caus:199   als/when     : 93   Min.   : 2.000  
 1st Qu.:1655   sc-mc:128   temp:204   bevor/before : 46   1st Qu.: 6.000  
 Median :2756                          nachdem/after: 65   Median : 8.000  
 Mean   :2764                          weil/because :199   Mean   : 9.266  
 3rd Qu.:4106                                              3rd Qu.:11.000  
 Max.   :5053                                              Max.   :31.000  
     LENSC           LENDIFF          MORETHAN2CL
 Min.   : 2.000   Min.   :-32.00000   no :185    
 1st Qu.: 5.000   1st Qu.: -4.00000   yes:218    
 Median : 8.000   Median :  0.00000              
 Mean   : 9.362   Mean   : -0.09677              
 3rd Qu.:12.000   3rd Qu.:  3.00000              
 Max.   :36.000   Max.   : 25.00000              

When it comes to deciding what statistical method to use, then even just for descriptive statistics, the most important first decision to make is always “what’s the dependent/response variable and what is its information level?” Then you can use this graph together with this document to guide you.

Figure 1: Overview of descriptive statistics

The following sections cover this ‘flow chart’.

2.1 Nominal/categorical data

For categorical data, it is usually best to begin with the descriptive numeric statistics and then visualize those with plots.

2.1.1 Descriptive stats

We can count levels and provide their proportions/percentages with table and prop.table(table(...)):

table(clo$CONJ) # frequency table

     als/when  bevor/before nachdem/after  weil/because 
           93            46            65           199 
prop.table(table(clo$CONJ)) # proportion table, = table(clo$CONJ)/length(clo$CONJ)

     als/when  bevor/before nachdem/after  weil/because 
    0.2307692     0.1141439     0.1612903     0.4937965 

You should only provide a useful number of decimals, which for frequencies in the range of hundreds like here is probably 3-4 decimals:

round(prop.table(table(clo$CONJ)), 3)

     als/when  bevor/before nachdem/after  weil/because 
        0.231         0.114         0.161         0.494 

The measure of central tendency for categorical data is the mode, the most frequent level:

table(clo$CONJ)[which(table(clo$CONJ)==max(table(clo$CONJ)))]
weil/because 
         199 

Can you imagine why I don’t just write this?

max(table(clo$CONJ))
[1] 199

One measure of dispersion for categorical data is normalized entropy, for which no function is available in base R, which is why we write it ourselves:

norm.entropy <- function (some.vector) {
   percs <- some.vector/sum(some.vector)
   sum(percs * -log2(percs), na.rm=TRUE) / log2(length(some.vector))
}

Now we can use this function:

norm.entropy(table(clo$CONJ)) # very high uncertainty
[1] 0.886415

This is how normalized entropy behaves with extreme distributions:

norm.entropy(c(0, 0, 99))   # 0 (i.e. no uncertainty)
[1] 0
norm.entropy(c(33, 33, 33)) # 1 (i.e. max uncertainty)
[1] 1

2.1.2 Visualization

The best visualization really would be dot charts:

dotchart(          # draw a dot chart
   table(clo$CONJ),  # of the frequencies of the levels of CONJ
   xlim=c(0, 403), # with these x-axis limits (min, max)
   pch=16)         # using filled circles
abline(            # draw a straight line
   v=403/4,        # vertical, at x=403/4 (even distribution)
   lty=3)          # make it dotted
Figure 2: A dot chart of the frequencies of the levels of CONJ

The function dotchart is a bit annoying because of the somewhat unintuitive ordering of the columns of the table: down-to-up. We might therefore want to reverse the order of the levels:

dotchart(          # draw a dot chart
   table(clo$CONJ)[4:1],  # of the frequencies of the levels of CONJ, now reversed
   xlim=c(0, 403), # with these x-axis limits (min, max)
   pch=16)         # using filled circles
abline(            # draw a straight line
   v=403/4,        # vertical, at x=403/4 (even distribution)
   lty=3)          # make it dotted
Figure 3: A dot chart of the frequencies of the levels of CONJ

A slightly less good alternative are bar charts:

qwe <- barplot(    # create a bar chart (& save the mids of the bars into qwe)
   table(clo$ORDER), # of this table
   ylim=c(0, 300))   # with these y-axis limits
text(qwe,              # plot text at these x-coordinates
   table(clo$ORDER)/2, # and these y-coordinates
   table(clo$ORDER))   # namely this text (the frequencies)
Figure 4: A bar chart of the frequencies of the levels of CONJ

Here are several refinements:

qwe <- barplot(
   table(clo$ORDER), main="Frequency of clause orders",
   xlab="Order of clauses",
   ylab="Observed frequency", ylim=c(0, 300))
text(qwe,            # plot text at these x-coordinates
   table(clo$ORDER), # and these y-coordinates
   table(clo$ORDER), # namely this text (the frequencies)
   pos=3)            # but slightly higher than that
abline(                      # draw a straight line
   h=mean(table(clo$ORDER)), # horizontal at the mean freq
   lty=2)                    # make it dashed
text(qwe[2],               # plot text above the 2nd bar
   mean(table(clo$ORDER)), # and at the mean frequency
   "mean frequency",       # these words
   pos=3,                  # but slightly higher than that
   cex=0.75)               # reduce font size by 25%
Figure 5: A bar chart of the frequencies of the levels of CONJ

BTW, it is often useful to be able to save those graphs into files, especially for people who still use unspeakable things like Microsoft Word. This is how you do this for the previous plot:

png("_output/barplot_freqs.of.ORDER.png",
    height=6, width=6, # dimensions of the plot
    units="in",        # in inches
    res=300)           # resolution in dpi, 300 is recommended
# now the plotting
qwe <- barplot(table(clo$ORDER), main="Frequency of clause orders",
   xlab="Order of clauses", ylab="Observed frequency", ylim=c(0, 300))
text(qwe, table(clo$ORDER), table(clo$ORDER), pos=3)
abline(h=mean(table(clo$ORDER)), lty=2)
text(qwe[2], mean(table(clo$ORDER)), "mean frequency", pos=3, cex=0.75)
# now you close the plotting device:
dev.off()
png 
  2 

This can now be cropped and then inserted into whatever software you use.


Some quick exercise(s): Generate a frequency table, a proportion table, and a dot chart for clo$ORDER, state the mode and the normalized entropy:

(qwe <- table(clo$ORDER))

mc-sc sc-mc 
  275   128 
prop.table(qwe) # or

    mc-sc     sc-mc 
0.6823821 0.3176179 
qwe %>% prop.table

    mc-sc     sc-mc 
0.6823821 0.3176179 
dotchart(          # draw a dot chart
   qwe,  # of the frequencies of the levels of ORDER
   xlim=c(0, 403), # with these x-axis limits (min, max)
   pch=16)         # using filled circles
abline(            # draw a straight line
   v=403/2,        # vertical, at x=403/2 (even distribution)
   lty=3)          # make it dotted
text(     # plot text
   qwe,   # at these x-coordinates
   1:2,   # at these y-coordinates
   qwe,   # namely the observed frequencies
   pos=3) # plot the text above the points

max(qwe)          # 'compute' the mode
[1] 275
norm.entropy(qwe) # compute the normalized entropy
[1] 0.9017721

2.2 Numeric data

For categorical data, we began with the descriptive numeric statistics and then visualized those with plots – for numeric data, it is better to begin with the visualization because, as you saw in Figure 1 above, for numeric variables the choice of statistic depends on whether the variable is normally/uniformly distributed or not. That’s why, here, we start with visualization and then compute the descriptive statistics.

2.2.1 Visualization

The easiest visualization is a histogram:

clo$LENSC %>% hist(main="") # same as hist(clo$LENSC, main="")
Figure 6: A histogram of LENSC

Since R decides on the bin width, you always want to do a second one where you change the bin width, e.g. like this:

clo$LENSC %>% hist(breaks="FD", main="") # same as hist(clo$LENSC, breaks="FD", main="")
Figure 7: A histogram of LENSC

Here are some other plot options: a box plot, a strip chart, and the best of all, an ecdf plot (for empirical cumulative distribution function):

par(mfrow=c(1, 3)) # set up a plotting panel with 1 row and 3 columns
boxplot(clo$LENSC, notch=TRUE); grid()         # boxplot
stripchart(clo$LENSC, method="jitter"); grid() # strip chart
plot(ecdf(clo$LENSC), verticals=TRUE); grid()  # ecdf plot
par(mfrow=c(1, 1)) # reset to default
Figure 8: Three plots of LENSC

All of them indicate that the variable is not normally distributed (see discussion in class and the textbook). What about the variable LENDIFF?

par(mfrow=c(2, 2)) # set up a plotting panel with 2 row and 2 columns
clo$LENDIFF %>% hist(main="")
clo$LENDIFF %>% hist(breaks="FD", main="")
boxplot(clo$LENDIFF, notch=TRUE); grid()         # boxplot
plot(ecdf(clo$LENDIFF), verticals=TRUE); grid()  # ecdf plot
par(mfrow=c(1, 1)) # reset to default
Figure 9: Three plots of LENDIFF

All of these are at least broadly compatible with LENDIFF being normally distributed.

2.2.2 Descriptive stats

For numeric data, the measure of central tendency one provides depends on the shape of the distribution. For non-normal data, we use the median as the measure of central tendency and quantiles/the interquartile range as measures of dispersion:

# central tendency & dispersion for ordinal data
# (or for non-normal data or data with outliers)
median(clo$LENSC)                         # median
## [1] 8
quantile(clo$LENSC)                       # quartiles
##   0%  25%  50%  75% 100% 
##    2    5    8   12   36
IQR(clo$LENSC)                            # interquartile range
## [1] 7
quantile(clo$LENSC, probs=seq(0, 1, 0.1)) # deciles
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    2    4    5    6    7    8    9   11   13   17   36

For normal data, we use the mean as the measure of central tendency and the standard deviation/variance as measures of dispersion. Here’s a nice way in which we can represent multiple values in one vector:

setNames(                # create a data structure w/ named elements
   c(mean(clo$LENDIFF), sd(clo$LENDIFF), var(clo$LENDIFF)), # the elements
   c("mean", "std dev", "variance"))                  # the names
       mean     std dev    variance 
-0.09677419  6.85188462 46.94832290 

The standard deviation is computed like this:

sqrt(
   sum((clo$LENDIFF-mean(clo$LENDIFF))^2) /
   (length(clo$LENDIFF)-1))
[1] 6.851885

A nicer way to represent this uses two new things that are extremely useful:

  1. in R, even things that don’t look like functions – like +, -, *, /, ^ – are functions so instead of writing 1+2 one can write "+"(1, 2) or instead of writing 2^3 one can write "^"(2, 3);
  2. instead of writing function(argument), one can write argument %>% function, then the pipe (%>%) makes argument the first (!) argument of function:
sqrt(9) # the same as:
[1] 3
9 %>% sqrt
[1] 3

And if function has more than one argument, you can provide them later:

log(100, base=10) # the same as
[1] 2
100 %>% log(base=10) # 100 becomes the 1st argument of log, 10 becomes the base
[1] 2
log(100, 10) # the same as
[1] 2
100 %>% log(10) # 100 becomes the 1st argument of log, 10 becomes the 2nd
[1] 2

With that, we can compute the standard deviation as follows:

"/"(                                   # divide ...
   (clo$LENDIFF-mean(clo$LENDIFF)) %>% "^"(2) %>% sum,   # this numerator by
   length(clo$LENDIFF)-1               # this denominator, then
   ) %>% sqrt %>% round(3)             # take the square root of that & round
[1] 6.852

2.3 Ordinal data

Ordinal data are sometimes a bit of a tricky issue. One recommendation to deal with them would be this:

  • if the number of ordinal levels you have is high (meaning, maybe ≥6), you might
    • use descriptive statistics for categorical data, but maybe
    • visualizations for numeric data
  • if the number of ordinal levels you have is low (meaning, maybe <5), you probably want to treat them as categorical data.

But it’s often a judgment call, and often you may need to try both and see which representation of the data is most instructive.


Some quick exercise(s): The file _input/assignments.csv contains data regarding the question of which course-final assignment was chosen by what kind of student in a large international student exchange program; here’s what the data contain/represent, the response variable is ASSIGNMENT:

summary(ass <- read.delim("_input/assignments.csv", stringsAsFactors=TRUE))
      CASE             ASSIGNMENT      SEX                REGION   
 Min.   :  1.00   lab_report:100   female:145   central_europ: 99  
 1st Qu.: 75.75   oral_exam :100   male  :155   hispanic     :100  
 Median :150.50   thesis    :100                mid_east     :101  
 Mean   :150.50                                                    
 3rd Qu.:225.25                                                    
 Max.   :300.00                                                    
   WORKHOURS        MISTAKES    
 Min.   :12.10   Min.   :10.00  
 1st Qu.:19.77   1st Qu.:14.75  
 Median :29.90   Median :17.00  
 Mean   :27.34   Mean   :17.12  
 3rd Qu.:33.50   3rd Qu.:19.00  
 Max.   :39.90   Max.   :25.00  
head(ass) # look at the first 6 rows of the data frame
  CASE ASSIGNMENT    SEX        REGION WORKHOURS MISTAKES
1    1  oral_exam female central_europ      20.1       18
2    2  oral_exam female central_europ      21.3       18
3    3  oral_exam female central_europ      13.5       13
4    4  oral_exam   male central_europ      19.5       12
5    5  oral_exam female central_europ      21.8       12
6    6  oral_exam   male      mid_east      21.7       11

Explore the variable ass$WORKHOURS:

par(mfrow=c(2,2))
hist(ass$WORKHOURS, main="")
hist(ass$WORKHOURS, main="", breaks="FD")
boxplot(ass$WORKHOURS) # see? bad: doesn't show what's going on well
plot(ecdf(ass$WORKHOURS), verticals=TRUE); grid()  # ecdf plot

par(mfrow=c(1,1))
setNames(
   c(mean(ass$WORKHOURS), sd(ass$WORKHOURS), median(ass$WORKHOURS), IQR(ass$WORKHOURS)),
   c("mean"             , "std dev"        , "median"             , "interq. range"))
         mean       std dev        median interq. range 
    27.337667      8.014261     29.900000     13.725000 

3 Statistical exploration: 2+ variables at a time

Once we have two or more variables to consider at a time, things become a bit more complex – but just a bit: the overall logic will often remain the same.

3.1 Categorical ~ categorical

3.1.1 Descriptive stats

For combinations of categorical variables, descriptive statistics usually just involve cross-tabulation with table again, and now

  • the variable provided as the 1st argument goes into the rows;
  • the variable provided as the 2nd argument goes into the columns:
table(        # tabulate such that the 
   clo$ORDER, # 1st variable goes into the rows
   clo$CONJ)  # 2nd variable goes into the columns
       
        als/when bevor/before nachdem/after weil/because
  mc-sc       37           28            26          184
  sc-mc       56           18            39           15
table(clo$CONJ, clo$ORDER) # 1st variable -> rows, 2nd -> columns
               
                mc-sc sc-mc
  als/when         37    56
  bevor/before     28    18
  nachdem/after    26    39
  weil/because    184    15

Often it’s useful to add row/column sums/totals with addmargins):

addmargins(table(clo$ORDER, clo$CONJ)) # 1st variable -> rows, 2nd -> columns
       
        als/when bevor/before nachdem/after weil/because Sum
  mc-sc       37           28            26          184 275
  sc-mc       56           18            39           15 128
  Sum         93           46            65          199 403
table(clo$CONJ, clo$ORDER) %>% # 1st variable -> rows, 2nd -> columns
   addmargins                  # then add totals
               
                mc-sc sc-mc Sum
  als/when         37    56  93
  bevor/before     28    18  46
  nachdem/after    26    39  65
  weil/because    184    15 199
  Sum             275   128 403

Tables of proportions with prop.table are also possible again, but now you need to decide ‘proportions of what?’:

prop.table(table(clo$ORDER, clo$CONJ)) # proportions of totals
       
          als/when bevor/before nachdem/after weil/because
  mc-sc 0.09181141   0.06947891    0.06451613   0.45657568
  sc-mc 0.13895782   0.04466501    0.09677419   0.03722084
prop.table(table(clo$ORDER, clo$CONJ), 1) # row proportions
       
          als/when bevor/before nachdem/after weil/because
  mc-sc 0.13454545   0.10181818    0.09454545   0.66909091
  sc-mc 0.43750000   0.14062500    0.30468750   0.11718750
prop.table(table(clo$ORDER, clo$CONJ), 2) # col proportions
       
          als/when bevor/before nachdem/after weil/because
  mc-sc 0.39784946   0.60869565    0.40000000   0.92462312
  sc-mc 0.60215054   0.39130435    0.60000000   0.07537688

Usually, it is best to compute proportions that add up to 100% for the level of the response variable, i.e. here for ORDER. That means,

  • if your dependent variable is in the rows, you need column percentages;
  • if your dependent variable is in the columns, you need row percentages.

Also: remember to provide only useful decimals (here with the pipe):

table(clo$ORDER, clo$CONJ) %>% prop.table(2) %>% round(3)
       
        als/when bevor/before nachdem/after weil/because
  mc-sc    0.398        0.609         0.400        0.925
  sc-mc    0.602        0.391         0.600        0.075

You can also do higher-dimensional cross-tabulation: a third variable goes into separate ‘slices’ of such an array …

table(clo$MORETHAN2CL, clo$CONJ, clo$ORDER)
, ,  = mc-sc

     
      als/when bevor/before nachdem/after weil/because
  no        19           18            14           87
  yes       18           10            12           97

, ,  = sc-mc

     
      als/when bevor/before nachdem/after weil/because
  no        22            6            14            5
  yes       34           12            25           10

… and the notation separating the slices also gives you a clue that you can access them using the same subsetting mechanisms as before:

(qwe <- table(clo$MORETHAN2CL, clo$CONJ, clo$ORDER))
, ,  = mc-sc

     
      als/when bevor/before nachdem/after weil/because
  no        19           18            14           87
  yes       18           10            12           97

, ,  = sc-mc

     
      als/when bevor/before nachdem/after weil/because
  no        22            6            14            5
  yes       34           12            25           10
qwe[1,,] # maybe a little confusing
               
                mc-sc sc-mc
  als/when         19    22
  bevor/before     18     6
  nachdem/after    14    14
  weil/because     87     5
qwe[,1,] # maybe a little confusing
     
      mc-sc sc-mc
  no     19    22
  yes    18    34
qwe[,,1] # straightforward
     
      als/when bevor/before nachdem/after weil/because
  no        19           18            14           87
  yes       18           10            12           97

For such cases, a function called ftable might be better, because it creates flat contingency tables, putting the last argument into the columns:

ftable( # create a flat contingency table with
   clo$MORETHAN2CL, # this as the 1st row variable
   clo$CONJ,        # this as the 2nd row variable
   clo$ORDER)
                   mc-sc sc-mc
                              
no  als/when          19    22
    bevor/before      18     6
    nachdem/after     14    14
    weil/because      87     5
yes als/when          18    34
    bevor/before      10    12
    nachdem/after     12    25
    weil/because      97    10

This, together with proportions, is often very useful:

ftable( # create a flat contingency table with
   clo$MORETHAN2CL, # this as the 1st row variable
   clo$CONJ,        # this as the 2nd row variable
   clo$ORDER) %>%   # this as the column variable, then
   prop.table(1) %>% round(3) # make this rounded %s
                   mc-sc sc-mc
                              
no  als/when       0.463 0.537
    bevor/before   0.750 0.250
    nachdem/after  0.500 0.500
    weil/because   0.946 0.054
yes als/when       0.346 0.654
    bevor/before   0.455 0.545
    nachdem/after  0.324 0.676
    weil/because   0.907 0.093

3.1.2 Visualization

For 2-dimensional tables, dot charts are again very useful:

table(clo$ORDER, clo$CONJ)
       
        als/when bevor/before nachdem/after weil/because
  mc-sc       37           28            26          184
  sc-mc       56           18            39           15
dotchart(          # draw a dot chart
   table(clo$ORDER, clo$CONJ), # of these frequencies
   xlim=c(0, 403), # with these x-axis limits (min, max)
   pch=16)         # using filled circles
abline(            # draw a straight line
   v=403/8,        # vertical, at x=403/8 (even distribution)
   lty=3)          # make it dotted
Figure 10: A dot chart of the frequencies levels of ORDER per CONJ

And now dotchart is even more annoying than before because

  • the ordering of the column levels is still ‘reversed’, it rises from bottom to top so we fix that again with manual reordering;
  • the nesting of the variables in the plot does not correspond to the table (which is easy to fix: we transpose with t)
table(clo$ORDER, clo$CONJ)
       
        als/when bevor/before nachdem/after weil/because
  mc-sc       37           28            26          184
  sc-mc       56           18            39           15
dotchart(          # draw a dot chart
   t(table(clo$ORDER, clo$CONJ))[4:1,], # of these frequencies, now transposed & reversed!
   xlim=c(0, 403), # with these x-axis limits (min, max)
   pch=16)         # using filled circles
abline(            # draw a straight line
   v=403/8,        # vertical, at x=403/8 (even distribution)
   lty=3)          # make it dotted
Figure 11: A dot chart of the frequencies levels of CONJ per ORDER

The version I would most likely use, though, is this one (because now both the levels of CONJ and ORDER are presented in alphabetical order):

table(clo$ORDER, clo$CONJ)
       
        als/when bevor/before nachdem/after weil/because
  mc-sc       37           28            26          184
  sc-mc       56           18            39           15
dotchart(          # draw a dot chart
   table(clo$ORDER, clo$CONJ)[2:1,], # of these frequencies, now reversed!
   xlim=c(0, 403), # with these x-axis limits (min, max)
   pch=16)         # using filled circles
abline(            # draw a straight line
   v=403/8,        # vertical, at x=403/8 (even distribution)
   lty=3)          # make it dotted

Bar charts are again an only slightly inferior alternative, but the most basic version is absolutely bare-bones:

barplot(table(clo$ORDER, clo$CONJ)) # simplest version
Figure 12: A bar chart of the frequencies of the levels of ORDER per CONJ

Totally useless … With the current plot size, we can’t even read the bar labels. Before we tweak this more comprehensively, here’s a nice quick and dirty fix for that: we let R abbreviate the labels to, here, 6 characters:

barplot(                   # a bar chart of
   table(clo$ORDER, clo$CONJ), # this table
   names.arg=abbreviate(   # but abbreviate
      levels(clo$CONJ),    # the levels of CONJ
      6))                  # to 6 characters
Figure 13: A bar chart of the frequencies of the levels of ORDER per CONJ

But now we make it generally better (and increase the size of the plot so that more of the labels fit):

barplot(                   # make a bar chart of
   table(clo$ORDER, clo$CONJ), # this table
   # x-axis stuff and names.arg:
   xlab="Conjunctions", names.arg=abbreviate(levels(clo$CONJ), 8),
   ylab="Frequency", ylim=c(0, 200)) # y-axis stuff
legend(2, 150,               # x- and y-coordinates
   title="Clause orders",    # heading of legend
   fill=gray.colors(2),      # colors of legend
   legend=levels(clo$ORDER), # text of legend
   xjust=0.5, yjust=0.5,     # centered around coordinates
   ncol=2,                   # number of columns of legend
   bty="n")                  # box type is 'none'
Figure 14: A bar chart of the frequencies of the levels of ORDER per CONJ

And a maybe even better version with a lot of refinements:

(p.table <- prop.table(table(clo$ORDER, clo$CONJ), 2))
       
          als/when bevor/before nachdem/after weil/because
  mc-sc 0.39784946   0.60869565    0.40000000   0.92462312
  sc-mc 0.60215054   0.39130435    0.60000000   0.07537688
(p.table <- p.table[,order(p.table[2,])]) # order in ascending order of sc-mc
       
        weil/because bevor/before nachdem/after   als/when
  mc-sc   0.92462312   0.60869565    0.40000000 0.39784946
  sc-mc   0.07537688   0.39130435    0.60000000 0.60215054
# now the plotting
qwe <- barplot(p.table, # make a bar chart of the proportion table
   beside=TRUE,         # w/ bars within CONJ side-by-side
   ylim=c(0, 1.1),      # & these y-axis limits (leaving room at top)
   # this heading and these axis labels:
   main="The relation between conjunction and order",
   xlab="Conjunction", ylab="Observed % order per conjunction",
   col=c("darkgreen", "darkred"), # & these colors
   width=prop.table(table(clo$ORDER)))
text(qwe, p.table,    # add at the bars' centers & their y-axis values
   round(p.table, 3),              # the observed proportions
    cex=0.75, pos=3,               # small font, a bit above
    col=c("darkgreen", "darkred")) # in the same colors
legend(x=mean(qwe), y=1, title="Clause orders", # add legend at top of plot
       fill=c("darkgreen", "darkred"),
       legend=c("mc-sc", "sc-mc"),
       xjust=0.5, yjust=0.5, ncol=2, bty="n")
abline(v=1.84, lty=2) # vertical line between first 2 sets of bars
text(1.84, 0.9, "causal"  , srt=90, adj=c(0.5, -0.5)) # add vertical labels
text(1.84, 0.9, "temporal", srt=90, adj=c(0.5, 1.25)) # to the vert. line
Figure 15: A bar chart of the frequencies of the levels of ORDER per CONJ

Another option would be mosaic plots, the most basic version of which again is pretty basic; note: I immediately transpose so the orientation of the plot corresponds to that of the table:

table(clo$ORDER, clo$CONJ)
       
        als/when bevor/before nachdem/after weil/because
  mc-sc       37           28            26          184
  sc-mc       56           18            39           15
mosaicplot(t(table(clo$ORDER, clo$CONJ)))
Figure 16: A mosaic plot of the proportions of the levels of ORDER per CONJ

A useful addition to this is the argument shade=TRUE, because then blue and red shading indicates what’s over- and what’s underrepresented respectively:

table(clo$ORDER, clo$CONJ)
       
        als/when bevor/before nachdem/after weil/because
  mc-sc       37           28            26          184
  sc-mc       56           18            39           15
mosaicplot(main="ORDER as a function of CONJ",
   table(clo$CONJ, clo$ORDER),
   shade=TRUE,
   xlab="Conjunctions", ylab="Clause orders")
Figure 17: A mosaic plot of the proportions of the levels of ORDER per CONJ

Some quick exercise(s): Cross-tabulate the response variable clo$ORDER with the predictor clo$MORETHAN2CL (frequencies and percentages) and represent the results in a dot chart and a mosaic plot:

(qwe <- table(clo$MORETHAN2CL, clo$ORDER))
     
      mc-sc sc-mc
  no    138    47
  yes   137    81
prop.table(qwe, 1)
     
          mc-sc     sc-mc
  no  0.7459459 0.2540541
  yes 0.6284404 0.3715596
par(mfrow=c(1,2))
dotchart(t(qwe)[2:1,])
mosaicplot(t(qwe), shade=TRUE)
par(mfrow=c(1,1))


3.2 Categorical ~ ordinal/numeric

3.2.1 Descriptive stats

This is a quite frequent scenario, where we have a categorical (often binary) response and a numeric predictor. In some sense, this is often dealt with in a way that ‘converts’ the numeric variable temporarily into a categorical/ordinal variable with a variety of levels, for each of which we then compute the distribution of the response variable. The splitting up of the numeric variable into levels – the binning – we can do again with cut where we leave the breaks up to R’s function hist; we just suppress the plotting of hist:

(qwe <- table(                 # make qwe the table of
   cut(                        # the factor that results from cutting up
      clo$LENDIFF,                # LENDIFF
      breaks=hist(clo$LENDIFF,    # at the same breaks hist ...
           plot=FALSE)$breaks,    # would use (no plotting)
      include.lowest=TRUE),       # include the smallest value as well
   clo$ORDER))                 # the factor ORDER
           
            mc-sc sc-mc
  [-35,-30]     1     0
  (-30,-25]     0     0
  (-25,-20]     2     1
  (-20,-15]     4     2
  (-15,-10]    13     4
  (-10,-5]     49    12
  (-5,0]       91    31
  (0,5]        78    54
  (5,10]       25    13
  (10,15]       8     7
  (15,20]       4     3
  (20,25]       0     1

But of course you an also manually choose a number of breaks, in which case R chooses breaks equally wide bins:

(qwe <- table(                 # make qwe the table of
   cut(clo$LENDIFF, breaks=5), # the factor from cutting up LENDIFF into 5 bins
   clo$ORDER))                 # the factor ORDER
               
                mc-sc sc-mc
  (-32.1,-20.6]     2     1
  (-20.6,-9.2]     18     6
  (-9.2,2.2]      183    68
  (2.2,13.6]       66    49
  (13.6,25.1]       6     4

3.2.2 Visualization

The corresponding visualization is called a spineplot, essentially a stacked bar plot for a series of grouped numeric values, which we can generate with the function spineplot:

spineplot(x=clo$LENDIFF, y=clo$ORDER)
Figure 18: A spine plot of the proportions of the levels of ORDER against LENDIFF

However, the probably more frequent use of this function uses something very important, the so-called formula notation where we have

  • the so-called lhs, which is the ‘left-hand side’, the response variable (here ORDER);
  • the tilde ~, which means ‘as a function of’;
  • the so-called rhs, which is the ‘right-hand side’, the one or more predictor variables (here LENDIFF):
spineplot(                  # make a spine plot
   clo$ORDER ~ clo$LENDIFF, # for this formula
   # w/ these axis labels:
   xlab="Length difference: main - subordinate clause",
   ylab="Clause orders")
Figure 19: A spine plot of the proportions of the levels of ORDER against LENDIFF

3.3 Ordinal/numeric ~ categorical

3.3.1 Descriptive stats

This section introduces one of the greatest things in R, the way in which functions from the apply family allow you to avoid what other programming languages would often do with loops. Given everything you know so far, how would you compute the mean length differences (between main and subordinate clauses) for each conjunction? Probably like this:

setNames(
   c(mean(clo$LENDIFF[clo$CONJ=="als/when"]),
     mean(clo$LENDIFF[clo$CONJ=="bevor/before"]),
     mean(clo$LENDIFF[clo$CONJ=="nachdem/after"]),
     mean(clo$LENDIFF[clo$CONJ=="weil/because"])),
   levels(clo$CONJ))
     als/when  bevor/before nachdem/after  weil/because 
  -0.04301075    2.52173913    0.41538462   -0.89447236 

It doesn’t get more annoying than that and we even have to label our results ourselves … Thus, we rather do this using tapply (‘tabulate apply’):

tapply(         # apply to
   clo$LENDIFF, # (the values of) LENDIFF
   clo$CONJ,    # a grouping by CONJ
   FUN=mean)    # and compute the mean for each group
     als/when  bevor/before nachdem/after  weil/because 
  -0.04301075    2.52173913    0.41538462   -0.89447236 

This can be extended to more than one grouping variable:

tapply(         # apply to
   clo$LENDIFF, # (the values of) LENDIFF
   list(                # a grouping by
      clo$CONJ,         # CONJ and also
      clo$MORETHAN2CL), # MORETHAN2CL
   FUN=median)  # and compute the median for each group
                no  yes
als/when       1.0  0.0
bevor/before   1.5  2.5
nachdem/after  1.5 -1.0
weil/because  -1.0  0.0

3.3.2 Visualization

We can visualize these kinds of results in the same way as we did when we had only 1 variable, we now just don’t plot those results for all the data but for the groups defined by the grouping variable(s):

boxplot(                    # make a box plot for
   clo$LENDIFF ~ clo$ORDER, # this formula
   notch=TRUE,              # include notches
   varwidth=TRUE,           # box width is the number of data points
   xlab="Clause orders", ylab="Length differences mc-sc")
abline(h=0, lty=2) # dashed line at y=0
Figure 20: A box plot of LENDIFF against ORDER

I am not showing the strip charts, but the ecdf plot is still probably the best way to go:

plot(                 # make an ecdf plot
   ecdf(clo$LENDIFF), # of this variable, but
   col="white",       # plot it in white (i.e. invisibly)
   # use no heading & these axis labels:
   main="", xlab="Length differences mc-sc", ylab="Cumulative %")
grid() # add a grid
lines( # draw lines for the ecdf when ORDER is "mc-sc":
   ecdf(clo$LENDIFF[clo$ORDER=="mc-sc"]),
   # pick 2 very distinctive colors, then use the 1st one for "mc-sc"
   col=rainbow(2)[1],
   verticals=TRUE) # connect points also vertically
lines( # draw lines for the ecdf when ORDER is "sc-mc":
   ecdf(clo$LENDIFF[clo$ORDER=="sc-mc"]),
   # pick 2 very distinctive colors, then use the 2nd one  for "sc-mc"
   col=rainbow(2)[2],
   verticals=TRUE) # connect points also vertically
legend(title="Clause orders",
   x=-25, xjust=0.5,
   y=0.7, yjust=0.5,
   fill=rainbow(2),
   legend=levels(clo$ORDER),
   ncol=2, bty="n")
Figure 21: An ecdf plot of LENDIFF against ORDER

3.4 Ordinal/numeric ~ ordinal/numeric

3.4.1 Descriptive stats

We go back to the reaction time data for this; remember what the columns mean:

rm(list=ls(all=TRUE)) # clear memory
summary(rts <- read.delim("_input/reactiontimes.csv", stringsAsFactors=TRUE))
        CASE          RT             FREQ          FAM       IMA    
 almond   : 1   Min.   :523.0   Min.   :  1.00   hi  :12   hi  :30  
 ant      : 1   1st Qu.:589.4   1st Qu.:  2.00   lo  :12   lo  :23  
 apple    : 1   Median :617.7   Median :  4.00   med :31   NA's:24  
 apricot  : 1   Mean   :631.9   Mean   :  9.26   NA's:22            
 asparagus: 1   3rd Qu.:663.6   3rd Qu.: 10.00                      
 avocado  : 1   Max.   :794.5   Max.   :118.00                      
 (Other)  :71                                                       
      MEAN      
 Min.   :315.0  
 1st Qu.:409.8  
 Median :437.5  
 Mean   :436.0  
 3rd Qu.:466.2  
 Max.   :553.0  
 NA's   :29     

Correlations between numeric variables are expressed with correlation coefficients which usually fall into the interval [-1, 1] such that

  • values up to +1 indicate a strong positive correlation, which can be expressed with ‘the more/less …, the more/less …’;
  • value around 0: no correlation;
  • value up to -1: indicate a strong negative correlation, which can be expressed with ‘the less/more …, the more/less …’.

Note: two variables are correlated if knowing the value of one variable helps you predict/narrow down the value range to expect for other variables.

If you have ‘well-behaved’ and normal variables you want to correlate, you can compute the Pearson product-moment correlation coefficient r:

cor(rts$FREQ, rts$RT) # r
[1] -0.2377259

On the other hand, if your data are not ‘well-behaved’ and normal variable, you can compute either Spearman’s rho or Kendall’s tau:

cor(rts$FREQ, rts$RT, method="spearman") # rho
[1] -0.4851569
cor(rts$FREQ, rts$RT, method="kendall") # tau
[1] -0.3473737

But how do you know whether your data are ‘well-behaved’ and normal? Visualization, which means you should actually nearly always visualize before you compute statistics!

3.4.2 Visualization

For such data, a scatter plot is the useful default. Minimally, that could be either one of these:

plot(x=rts$FREQ, y=rts$RT); grid()
# plot(rts$RT ~ rts$FREQ); grid() # formula notation
Figure 22: RT as a function of FREQ

But let’s add something you should always add: a non-parametric smoother, i.e. a (potentially quite curvy) line that tries to summarize the point cloud:

plot(rts$RT ~ rts$FREQ); grid()
lines(lowess(rts$RT ~ rts$FREQ), col="red")
Figure 23: RT as a function of FREQ

Clearly, these are not ‘well-behaved’ let alone normal data; this is confirmed by plotting quick histograms:

par(mfrow=c(1, 2)) # set up a plotting panel with 1 row & 2 columns
hist(rts$FREQ, main="")
hist(rts$RT  , main="")
par(mfrow=c(1, 1)) # reset to default
Figure 24: Histograms of FREQ and RT

In such situations, using Pearson’s r is completely wrong – you either have to use one of the other correlation coefficients or you try to transform the offending variable (here, FREQ) into a shape that is more ‘well-behaved’ and normal. For the shape that FREQ has, often the log transformation works well:

rts$FREQLOG <- log2(rts$FREQ)
par(mfrow=c(1, 2)) # set up a plotting panel with 1 row & 2 columns
hist(rts$FREQLOG, main="")
plot(rts$RT ~ rts$FREQLOG); grid()
lines(lowess(rts$RT ~ rts$FREQLOG), col="red")
par(mfrow=c(1, 1)) # reset to default
Figure 25: Histogram of logged FREQ and its relation to RT

In this case, the log transformation helps: Yes, the histogram is clearly non-normal, but the scatterplot is much better than before. Thus, we re-compute the correlation coefficients:

setNames(
   c(cor(rts$FREQLOG, rts$RT),
     cor(rts$FREQLOG, rts$RT, method="spearman"),
     cor(rts$FREQLOG, rts$RT, method="kendall")),
   c("Pearson's r", "Spearman's rho", "Kendall's tau"))
   Pearson's r Spearman's rho  Kendall's tau 
    -0.4679032     -0.4851569     -0.3473737 

Now, a much more refined version is this:

plot(main=paste("RT ~ log frequency; (tau=", round(cor(rts$FREQLOG, rts$RT, method="kendall"), 3), " ***)", sep=""),
   xlab="Frequency (logged)",  xlim=c(1, 7)    , x=rts$FREQLOG,
   ylab="Reaction time in ms", ylim=c(500, 900), y=rts$RT,
   pch=16, cex=2, col="#00AA0040")
grid()
lines(lowess(rts$RT ~ rts$FREQLOG), lwd=5, col="darkgreen") # thick locally-weighted smoother
abline(v=median(rts$FREQLOG), lty=2) # add a vertical line at the median of FREQLOG
abline(h=median(rts$RT), lty=2)      # add a horizontal line at the median of RT
text(x=median(rts$FREQLOG),
     y=850,
     labels="median frequency",
     srt=90, adj=c(0.5, -0.5))
text(6, median(rts$RT), "median RT", adj=c(NA, -0.5))
text(mean(rts$FREQLOG), median(rts$RT), "+", font=2)
text(median(rts$FREQLOG), mean(rts$RT), "+", font=2)
qwe <- tapply(rts$RT, list(rts$RT>median(rts$RT), rts$FREQLOG>median(rts$FREQLOG)), length)
text(c(1, 1 , 7, 7), c(500, 900, 500, 900), qwe, col="darkgreen", cex=0.5+(qwe/median(qwe)))
Figure 26: RT as a function of FREQ

The way the color is defined for the points involves code, which you interpret as “RRGGBBOO”:

  • the first two 0s are the code for how much red you want, the second two 0s are how much green you want, the third two 0s are how much blue you want;
  • the last two values (30) are the opacity of the point so 00 means the point is actually completely transparent and FF means the point is fully opaque;
  • that in turn might already tell you that these are values on the hexadecimal scale (where values from 0 to F correspond to values from 0 to 15), meaning they go from 00 (for 0) to FF (for 15×15 + 15 = 255). For example,
    • “#FF0000FF” is bright red, fully opaque whereas “#0000FFFF” is bright blue, fully opaque;
    • “#FFFF00FF” is red plus green, i.e. yellow, fully opaque;
    • “#FF00FF80” is red plus blue, i.e. purple, half opaque.

Another often useful option is to plot the words, and this one I will save into a file to remind you of how that’s done:

png("_output/scatterplot_FREQ&RT.png",
    height=6, width=6, # dimensions of the plot
    units="in",        # in inches
    res=300)           # resolution in dpi, 300 is recommended
# now the plotting
plot(rts$FREQLOG, rts$RT, type="n"); grid() # set up empty coordinate system!
text(rts$FREQLOG, rts$RT, # plot text at these coordinates
   rts$CASE,              # namely the words
   cex=rts$MEAN/475)      # at sizes reflecting the MEAN
lines(lowess(rts$RT ~ rts$FREQLOG), lwd=5, col="#00000030") # add a thick transparent grey smoother
# now you close the plotting device:
dev.off()
png 
  2 

You can find the result at the path shown above, i.e. here.


Some quick exercise(s): R has a function called nchar, which returns the number(s) of letters of a character vector:

rts$CASE        %>% # take the factor rts$CASE
   as.character %>% # make it a character vector
   nchar            # count the letters of each word
 [1]  6  3  5  7  9  7  6  3  6  3  8 10  8  5  9  5  6  3  6  7  5  9  8  3  7
[26]  6  5  8  8  3  4  7  4  5  5  4  8  5  4  4  5  7  4  6  5  4  6  5  5  8
[51]  7  5  6  3  7  6  4  3  9  6  6  8  5  5  5  6  5  8  5 10  4  6  8  6  4
[76]  5 10

Use this function to generate a scatter plot that represents the correlation between the lengths of words and their reaction times in a plot like the one above for frequency:

rts$LEN <- rts$CASE %>% as.character %>% nchar
# set up coordinate system
plot(rts$LEN, rts$RT, type="n"); grid()
text(                       # plot text at
   rts$LEN,                 # these x-coordinates
   rts$RT,                  # these y-coordinates
   rts$CASE,                # namely these words 
   cex=0.8 + rts$FREQLOG/7) # at these sizes
# add smoother
lines(lowess(rts$RT ~ rts$LEN), lwd=5, col="#00000030")


4 Comments on workflow

Here are some comments on how you would ideally organize your workflow for any empirical study. First, the one thing you don’t want to do is this, but this is what most people do – I am not kidding:

Figure 27: The average workflow 1

Actually, the above may not be what most people do – because the above implies people use their Documents folder. Instead, what most people really seem to do is a combination of these two ‘approaches’. One is, they have all their files in the Downloads folder. I have seen Downloads folders on students’ laptops with literally >20,000 files in them …

The other thing, probably just as common:

Figure 28: The average workflow 2

Maybe, just maybe we can avoid this? Here’s what you should do:

  1. create a separate folder with a meaningful name for whatever empirical study you plan on doing, e.g. <2024_MAthesis>.
  2. create at least two subfolders in it:
    1. <_input>: this folder will contain all data files that you likely prepare in spreadsheet software and read into R; most likely, these will be
      1. spreadsheet files created with Excel or LibreOffice Calc and .txt/.csv/.tsv versions of those files for import into R;
      2. R-specific data files (.rda files);
      3. Note: ideally, every data file comes with a separate text file that explains its structure in detail, like this file explains this data file. If you know you will use R for more comprehensively, you should familiarize yourself with the Roxygen2 format I use for describing the data;
    2. <_output>: this folder will contain all files that you result from what you did in R; most likely these will be
      1. plots (probably as .png files, maybe as .svg files);
      2. other results files, which could be .txt/.csv/.tsv files or R-specific .rda files.
  3. maybe you will also want to have a folder for .pdf files of papers, books, etc.

For your actual R analyses, there are two suggestions, a basic one and a more advanced (but also way better) one if you’re serious about this. The basic one is, create your analysis in a simple .r/.R file in RStudio, but use a ton of pound signs for two reasons:

  • to add lots of commentary into the file to say what you’re doing, why you’re doing it, what you see in the results, etc. This way, your R analysis file is already half your methods section/chapter, which you later need anyway.
  • to create a structure that you can use as an outline for quick navigation (with the outline button in RStudio and the navigator at the bottom).

Here’s an example:

# First analysis: something amazing #############################
## Loading file #################################################
some code

## Exploration and preparation ##################################
some code # I plot this to see whether the data are normal
# turns out they are not

some more code # log-transform the data
some plotting # I plot this to see whether the data are now normal
# yes, they are

### Excursus: why not try THIS? #################################
## Compute correlations #########################################
some code # good, this is high

## Plot this, too ###############################################
ton of plotting code
ton of plotting code
ton of plotting code
ton of plotting code

# Second analysis: something even more amazing ##################
...

It might be useful to number your code files like <01a_analysis_of_RTs.r> and have all output files that it creates begin with the same numbering so it’s easier to keep track of everything. Here are two examples

A possible use of file names

Another possible use of file names

The serious suggestion is to learn RMarkdown (e.g., here or here) or, better even, its new extension, which is called Quarto. There is no better way to do replicable data science than the kind of literate programming that this would allow you to do and it’s not nearly as hard as you might think it is.


5 Exercises

5.1 The ASSIGNMENT data

The file _input/assignments.csv contains data regarding the question of which course-final assignment was chosen by what kind of student in a large international student exchange program; here’s what the data contain/represent, the response variable is ASSIGNMENT:

rm(list=ls(all=TRUE)) # clear memory
summary(ass <- read.delim("_input/assignments.csv", stringsAsFactors=TRUE))
      CASE             ASSIGNMENT      SEX                REGION   
 Min.   :  1.00   lab_report:100   female:145   central_europ: 99  
 1st Qu.: 75.75   oral_exam :100   male  :155   hispanic     :100  
 Median :150.50   thesis    :100                mid_east     :101  
 Mean   :150.50                                                    
 3rd Qu.:225.25                                                    
 Max.   :300.00                                                    
   WORKHOURS        MISTAKES    
 Min.   :12.10   Min.   :10.00  
 1st Qu.:19.77   1st Qu.:14.75  
 Median :29.90   Median :17.00  
 Mean   :27.34   Mean   :17.12  
 3rd Qu.:33.50   3rd Qu.:19.00  
 Max.   :39.90   Max.   :25.00  
head(ass) # look at the first 6 rows of the data frame
  CASE ASSIGNMENT    SEX        REGION WORKHOURS MISTAKES
1    1  oral_exam female central_europ      20.1       18
2    2  oral_exam female central_europ      21.3       18
3    3  oral_exam female central_europ      13.5       13
4    4  oral_exam   male central_europ      19.5       12
5    5  oral_exam female central_europ      21.8       12
6    6  oral_exam   male      mid_east      21.7       11

Your task is to explore the data in ways that would ultimately prepare for and support an analysis of the response variable, i.e. do the usual exploration of these variable(s) and their combinations:

  • response/dependent variable: ASSIGNMENT;
  • predictors/independent variables (pick a subset of 4-6 variables you find interesting);
  • predictors/independent variables in that subset w/ each other;
  • predictors/independent variables in that subset w/ the response/dependent variable.
# replace this with your answer(s)/code

5.2 The COMPLEMENTIZER data

The file _input/thatcomplementation.csv contains data regarding the question of whether a speaker would realize that as a complementizer or not as in examples (1) and (2):

(1) a. The problem was Ø    the Vorlons refused to help in the war.
    b. The problem was that the Vorlons refused to help in the war.
(2) a. I thought Ø    the Vorlons could have helped earlier.
    b. I thought that the Vorlons could have helped earlier.

Here’s what the data contain/represent, the response variable is COMPLEMENTIZER:

rm(list=ls(all=TRUE)) # clear memory
summary(tha <- read.delim("_input/thatcomplementation.csv", stringsAsFactors=TRUE))
      CASE          CORPUS          FILE        MATCHLEMMA   COMPLEMENTIZER
 Min.   :   1   ICE-GB :3669   GE042  :  58   think  :2683   absent :3769  
 1st Qu.:1544   ICLE   :1260   SP024  :  48   say    : 895   present:2405  
 Median :3088   LINDSEI:1245   GE009  :  42   be     : 586                 
 Mean   :3088                  S1A-022:  39   mean   : 428                 
 3rd Qu.:4631                  S1B-039:  35   know   : 359                 
 Max.   :6174                  S1A-054:  32   hope   : 153                 
                               (Other):5920   (Other):1070                 
  TYPE            L1          REGISTER        L_COMP         L_COMPSUBJ     
 obj:5524   English:3669   spoken :4206   Min.   :  4.00   Min.   :  1.000  
 sub: 650   German :1369   written:1968   1st Qu.: 27.25   1st Qu.:  2.000  
            Spanish:1136                  Median : 48.00   Median :  3.000  
                                          Mean   : 63.66   Mean   :  8.479  
                                          3rd Qu.: 82.00   3rd Qu.:  8.000  
                                          Max.   :500.00   Max.   :189.000  
                                                                            
   L_COMPREST       L_MATRV2CC         L_MATRBE4S        L_MATRSUBJ     
 Min.   :  1.00   Min.   : 0.00000   Min.   :  0.000   Min.   :  1.000  
 1st Qu.: 22.00   1st Qu.: 0.00000   1st Qu.:  0.000   1st Qu.:  1.000  
 Median : 39.00   Median : 0.00000   Median :  0.000   Median :  1.000  
 Mean   : 55.18   Mean   : 0.04244   Mean   :  2.864   Mean   :  5.795  
 3rd Qu.: 72.00   3rd Qu.: 0.00000   3rd Qu.:  0.000   3rd Qu.:  4.000  
 Max.   :463.00   Max.   :69.00000   Max.   :202.000   Max.   :175.000  
                                                                        
   L_MATRS2V            DP_CW               DP_WC             SURPRISAL     
 Min.   :  0.0000   Min.   :-0.113264   Min.   :-0.521334   Min.   : 1.343  
 1st Qu.:  0.0000   1st Qu.:-0.003243   1st Qu.:-0.007668   1st Qu.: 4.094  
 Median :  0.0000   Median : 0.035430   Median : 0.154076   Median : 5.045  
 Mean   :  0.5011   Mean   : 0.153483   Mean   : 0.127166   Mean   : 6.100  
 3rd Qu.:  0.0000   3rd Qu.: 0.374783   3rd Qu.: 0.421268   3rd Qu.: 7.020  
 Max.   :119.0000   Max.   : 0.374783   Max.   : 0.480777   Max.   :20.007  
                                                                            
head(tha) # look at the first 6 rows of the data frame
  CASE CORPUS     FILE MATCHLEMMA COMPLEMENTIZER TYPE      L1 REGISTER L_COMP
1    1 ICE-GB  S2A-068     accept        present  obj English   spoken    115
2    2 ICE-GB  S1B-029     accept         absent  obj English   spoken     33
3    3   ICLE SPM04051     accept        present  obj Spanish  written     43
4    4 ICE-GB  S1B-063     accept        present  obj English   spoken     55
5    5 ICE-GB  S1B-065     accept        present  obj English   spoken     81
6    6 ICE-GB  S1B-056     accept        present  obj English   spoken    324
  L_COMPSUBJ L_COMPREST L_MATRV2CC L_MATRBE4S L_MATRSUBJ L_MATRS2V       DP_CW
1         14        101          0          0          1         0 -0.00676608
2         12         21          0          0          3         0 -0.00676608
3         14         29          0          3          2         0 -0.00676608
4          2         53          0          0          1         0 -0.00676608
5          1         80          0          0          1         0 -0.00676608
6          2        322          0          0          2         0 -0.00676608
       DP_WC SURPRISAL
1 -0.4659455  6.424231
2 -0.4659455 13.319049
3 -0.4659455 12.319049
4 -0.4659455  9.511694
5 -0.4659455  9.412159
6 -0.4659455  4.560826

Your task is to explore the data in ways that would ultimately prepare for and support an analysis of the response variable, i.e. do the usual exploration of these variable(s) and their combinations:

  • response/dependent variable: COMPLEMENTIZER;
  • predictors/independent variables (pick a subset of 4-6 variables you find interesting);
  • predictors/independent variables in that subset w/ each other;
  • predictors/independent variables in that subset w/ the response/dependent variable.
# replace this with your answer(s)/code

5.3 The DURATION data

The file _input/wdurations.csv contains data regarding the question of how long the pronunciation of a word lasts in a spoken turn with 10 words in it; here’s what the data contain/represent, the response variable is DURATION:

rm(list=ls(all=TRUE)) # clear memory
summary(dur <- read.delim("_input/wdurations.csv", stringsAsFactors=TRUE))
      CASE           CONV         SPEAKER         SEX          DURATION     
 Min.   :   1   C08    : 873   S15    : 478   female:3203   Min.   :  14.0  
 1st Qu.:1596   C05    : 831   S10    : 455   male  :3180   1st Qu.: 110.0  
 Median :3192   C06    : 800   S11    : 414                 Median : 162.0  
 Mean   :3192   C07    : 772   S14    : 410                 Mean   : 194.5  
 3rd Qu.:4788   C04    : 737   S08    : 401                 3rd Qu.: 245.0  
 Max.   :6383   C09    : 557   S16    : 395                 Max.   :1405.0  
                (Other):1813   (Other):3830                                 
   POSINTURN          LENGTH            FREQ         SURPRISAL    
 Min.   : 1.000   Min.   : 1.000   Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 3.000   1st Qu.: 2.000   1st Qu.:12.50   1st Qu.: 4.00  
 Median : 6.000   Median : 3.000   Median :15.00   Median : 6.00  
 Mean   : 5.524   Mean   : 2.768   Mean   :14.04   Mean   : 6.17  
 3rd Qu.: 8.000   3rd Qu.: 3.000   3rd Qu.:17.00   3rd Qu.: 8.00  
 Max.   :10.000   Max.   :14.000   Max.   :18.00   Max.   :21.00  
                                                                  
      CLASS       VOWELS    
 closed  :4021   one :3698  
 number  : 125   two+:2685  
 open    :2086              
 propname: 110              
 unclear :  41              
                            
                            
head(dur) # look at the first 6 rows of the data frame
  CASE CONV SPEAKER  SEX DURATION POSINTURN LENGTH FREQ SURPRISAL  CLASS VOWELS
1    1  C01     S01 male      105         2      3   16         5 closed    one
2    2  C01     S01 male       89         3      3   16         6 closed    one
3    3  C01     S01 male       78         4      2   18         3 closed    one
4    4  C01     S01 male      100         5      1   15         3 closed    one
5    5  C01     S01 male      368         6      6   11         8   open   two+
6    6  C01     S01 male      251         7      2   16         7 closed   two+

Your task is to explore the data in ways that would ultimately prepare for and support an analysis of the response variable, i.e. do the usual exploration of these variable(s) and their combinations:

  • response/dependent variable: DURATION;
  • predictors/independent variables (pick a subset of 4-6 variables you find interesting);
  • predictors/independent variables in that subset w/ each other;
  • predictors/independent variables in that subset w/ the response/dependent variable.

6 The final assignment: the Boston housing data

The file _input/boston.csv contains data regarding the values of real estate properties in the greater Boston area regarding the question of which variables best explain how valuable a house/lot is; here’s what the data contain/represent, the response variable is MEDIANVALC. Your task is to explore the data in ways that would ultimately prepare for and support an analysis of the response variable, i.e. do the usual exploration of these variable(s) and their combinations:

  • response/dependent variable: MEDIANVALC;
  • predictors/independent variables (pick a subset of 4-6 variables you find interesting);
  • predictors/independent variables in that subset w/ each other;
  • predictors/independent variables in that subset w/ the response/dependent variable.

For each variable (combination), do some exploration and write a short interpretation. Make sure your analysis is parsimonious, i.e. don’t dump all descriptive statistics you know into the code. For example, for something numeric, don’t just mindlessly compute the mean and the median and the standard deviation and the interquartile range: pick the one measure of central tendency and the corresponding measure of dispersion that is supported by the data; same with the plots.

sessionInfo()
R version 4.5.1 Patched (2025-06-20 r88332 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods  
[8] base     

other attached packages:
[1] STGmisc_1.01   Rcpp_1.1.0     magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] digest_0.6.37     fastmap_1.2.0     xfun_0.52         knitr_1.50       
 [5] htmltools_0.5.8.1 rmarkdown_2.29    cli_3.6.5         rstudioapi_0.17.1
 [9] tools_4.5.1       evaluate_1.0.4    yaml_2.3.10       rlang_1.1.6      
[13] jsonlite_2.0.0    htmlwidgets_1.6.4 MASS_7.3-65