Intro to descriptive statistics, part 1

UC Santa Barbara

JLU Giessen

27 Jul 2025 12-34-56

Introduction

The scientific method

Figure 1: Scientific method

A few self-evident objectives of empirical scientific inquiry

  • Description, answering the question “what happens/happened?”
  • explanation, answering the question “why does x happen?”
  • prediction, answering the question “what will happen with x if …?”
  • control, answering the question “how can x be influenced?”

But why use statistics for this?

  • To describe, explain, and predict
    • objectively
    • precisely
    • comparably
    • concisely
  • to cope with variability and to generalize: different samples even from the same population will yield different results
  • thus, we need to be able to
    • quantify this variability
    • separate random from systematic/meaningful variability
  • to assess the robustness of one’s generalizations

Three absolutely central notions

  • Objectivity: independence of personal opinions
  • reliability: precision (in the sense of ‘re-test reliability’)
  • validity: one measures what one wants to measure; in a sense, this is probably the most important one

Pitfalls you can avoid with proper quantitative analysis

Two English verbs verb1 and verb2

  • A published study discussed the complementation preferences verb1 and verb2 with regard to two grammatical patterns on the basis of the following data:
addmargins(example.1 <-matrix(c(295, 131, 104, 35), ncol=2,
   dimnames=list(VERB=1:2, PATTERN=1:2)))
     PATTERN
VERB    1   2 Sum
  1   295 104 399
  2   131  35 166
  Sum 426 139 565
    PATTERN
VERB    1    2
   1 0.74 0.26
   2 0.79 0.21
  • conclusion drawn from this: “[c]omparing the postverbal elements in the two verbs, we can see that the proportion of [pattern1] for [verb2] is higher than for [verb1]” …
  • yes, 79% > 74%, but a certain statistical test would have shown that the distribution is not significantly different from chance:

    Pearson's Chi-squared test

data:  example.1
X-squared = 1.5679, df = 1, p-value = 0.2105
  • thus, with this test, the author would have avoided making an incorrect overgeneralization.

Two English verbs verb1 and verb2

  • Another study on two English verbs verb1 and verb2 discussed their complementation preferences with regard to 5 kinds of XPs on the basis of the following data:
addmargins(example.2 <- matrix(c(302,73, 8,0, 145,5, 19,3, 8,0), ncol=5,
   dimnames=list(VERB=1:2, PATTERN=c("NP", "PP", "VP", "AdjP", "AdvP"))))
     PATTERN
VERB   NP PP  VP AdjP AdvP Sum
  1   302  8 145   19    8 482
  2    73  0   5    3    0  81
  Sum 375  8 150   22    8 563
  • “[i]f we look at the distribution of x before major constituents we find that (a) [verb1] is more common before noun-phrases than before other constituents” …
  • yes, 302 is largest figure in the first row, or even the whole table, but the focus of much of the study was on verb1 vs. verb2, and compared to verb2, verb1 actually disprefers to occur before NPs (as shown by residuals):
    PATTERN
VERB    NP    PP    VP  AdjP  AdvP
   1 -1.06  0.44  1.46  0.04  0.44
   2  2.59 -1.07 -3.57 -0.09 -1.07
  • Thus, with this test, the author would have avoided their oversight.

Avoiding complete surprises 1

Figure 2: A correlation between two variables XX and YY

Avoiding complete surprises 2

Figure 3: A correlation between two variables XX and YY, controlled for a 3rd variable FF

Caveats: note, however:

  • statistics don’t provide content –- the researcher does
  • statistics are only useful to the extent that the researcher has been successful
    • in operationalizing his variables appropriately
    • eliciting/collecting the data correctly
    • choosing the right statistical technique

The phases of empirical quantitative studies

The phases of an empirical study

  • reconnaissance
  • hypotheses (text and statistical forms)
  • data collection ((operationalizations of) variables)
  • evaluation of hypotheses given the data
    • effect sizes
    • graphs
    • significance tests (p-values)

Phase 1 and 2: variables

  • Variables: symbols of sets of characteristics an item can exhibit
  • they can be distinguished in terms of
    • their role in an investigation
      • predictor/independent
      • response/dependent
      • confounds (controlled, accounted for, or residualized out)
      • moderators (accounted for by interactions w/ add. variables)

  • their information value
    • categorical: different values → different properties
    • ordinal: categorical + different values → different ranks
    • numeric (interval/ratio): categorical + ordinal + different values → sizes of differences

Example of information levels

  • Here are results of a fictitious results of an Olympic 100m dash
  • What is the information level of each variable in a column?
TIME RANK NAME NUMBER MEDAL
9.86 1 S. Davis 453473 1
9.91 2 J. White 563456 1
10.01 3 S. Hendry 756675 1
20.02 4 C. Lewis 585821 0
  • TIME: num
  • RANK: ord
  • NAME/NUMBER: cat
  • MEDAL: kinda depends …

Phase 2: What are hypotheses?

  • What are hypotheses?
    • universal statement (going beyond a singular event)
    • implicit structure of a conditional sentence
      • if …, then …
      • the more/less …, the more/less …
    • empirically testable and potentially falsifiable
    • statements postulating a distribution of one or more response variables

Phase 2: Kinds of hypotheses

  • alternative hypothesis H1: a statement postulating
    • a particular distribution of a (response) variable (goodness-of-fit)
    • a relation between 1+ predictors & 1+ response variables (independence/difference(s))
      • stipulating some difference, but not its direction: non-directional/2-tailed
      • stipulating a difference and its direction: directional/1-tailed
  • null hypothesis H0:
    • the logical counterpart to H1
    • an alternative hypothesis with not in it

Phase 2: Operationalization

  • Definition 1: pinpointing and fleshing out the notions that the text hypotheses refer to
  • definition 2: translating the text hypotheses into something that involves numbers (i.e., can be counted, measured, …)
  • most frequent statistical measures
    • counts/frequencies
    • distributions
    • averages/means
    • dispersions
    • correlations

Phase 3: Data storage rules

  • Store the data in the so-called case-by-variable format:
    • each data point (i.e., measurement of the dependent variable) has a row on its own
    • every variable or every other characteristic of a data point has a column on its own
    • the very first row contains the names of all variables (header)
    • missing data are marked as NA –- do not use empty cells!
    • do not use numbers for categorical variables

Phase 3: Data storage (wrong)

  • Imagine you hypothesizes subjects are shorter than objects in English and collected the following data set:
Table 1: Terribly wrong format
SENTENCE SUBJ OBJ
The younger bachelors ate the nice little cat 3 4
He was locking the door 1 2
The quick brown fox hit the lazy dog 4 3
  • how many variables? 2:
    • the response LEN
    • a predictor GRAMREL
  • how many data points? 6, but …
  • … each row has 2 data points (of LEN), not one
  • each of the columns 2 and 3 represents the levels of a variable (GRAMREL), not a variable

Phase 3: Data storage (right)

  • The last two columns of this would be the correct format:
Table 2: Correct format
CASE SENTNO SENTENCE GRAMREL LEN
1 1 The younger bachelors ate the nice little cat subj 3
2 1 The younger bachelors ate the nice little cat obj 4
3 2 He was locking the door subj 1
4 2 He was locking the door obj 2
5 3 The quick brown fox hit the lazy dog subj 4
6 3 The quick brown fox hit the lazy dog obj 3
  • how many variables? 2, and that’s how many columns we have there
  • how many data points? 6, and that’s how many rows we have

Phase 3: Data storage: direct comparison