Intro to descriptive statistics, part 1

Stefan Th. Gries

UC Santa Barbara

JLU Giessen

27 Jul 2025 12-34-56

Introduction

The scientific method

Figure 1: Scientific method

A few self-evident objectives of empirical scientific inquiry

Description, answering the question “what happens/happened?”
explanation, answering the question “why does x happen?”
prediction, answering the question “what will happen with x if …?”
control, answering the question “how can x be influenced?”

But why use statistics for this?

To describe, explain, and predict
- objectively
- precisely
- comparably
- concisely
to cope with variability and to generalize: different samples even from the same population will yield different results
thus, we need to be able to
- quantify this variability
- separate random from systematic/meaningful variability
to assess the robustness of one’s generalizations

Three absolutely central notions

Objectivity: independence of personal opinions
reliability: precision (in the sense of ‘re-test reliability’)
validity: one measures what one wants to measure; in a sense, this is probably the most important one

Pitfalls you can avoid with proper quantitative analysis

Two English verbs verb₁ and verb₂

A published study discussed the complementation preferences verb₁ and verb₂ with regard to two grammatical patterns on the basis of the following data:

addmargins(example.1 <-matrix(c(295, 131, 104, 35), ncol=2,
   dimnames=list(VERB=1:2, PATTERN=1:2)))

     PATTERN
VERB    1   2 Sum
  1   295 104 399
  2   131  35 166
  Sum 426 139 565

    PATTERN
VERB    1    2
   1 0.74 0.26
   2 0.79 0.21

conclusion drawn from this: “[c]omparing the postverbal elements in the two verbs, we can see that the proportion of [pattern₁] for [verb₂] is higher than for [verb₁]” …
yes, 79% > 74%, but a certain statistical test would have shown that the distribution is not significantly different from chance:


    Pearson's Chi-squared test

data:  example.1
X-squared = 1.5679, df = 1, p-value = 0.2105

thus, with this test, the author would have avoided making an incorrect overgeneralization.

Two English verbs verb₁ and verb₂

Another study on two English verbs verb₁ and verb₂ discussed their complementation preferences with regard to 5 kinds of XPs on the basis of the following data:

addmargins(example.2 <- matrix(c(302,73, 8,0, 145,5, 19,3, 8,0), ncol=5,
   dimnames=list(VERB=1:2, PATTERN=c("NP", "PP", "VP", "AdjP", "AdvP"))))

     PATTERN
VERB   NP PP  VP AdjP AdvP Sum
  1   302  8 145   19    8 482
  2    73  0   5    3    0  81
  Sum 375  8 150   22    8 563

“[i]f we look at the distribution of x before major constituents we find that (a) [verb₁] is more common before noun-phrases than before other constituents” …
yes, 302 is largest figure in the first row, or even the whole table, but the focus of much of the study was on verb₁ vs. verb₂, and compared to verb₂, verb₁ actually disprefers to occur before NPs (as shown by residuals):

    PATTERN
VERB    NP    PP    VP  AdjP  AdvP
   1 -1.06  0.44  1.46  0.04  0.44
   2  2.59 -1.07 -3.57 -0.09 -1.07

Thus, with this test, the author would have avoided their oversight.

Avoiding complete surprises 1

Figure 2: A correlation between two variables XX and YY

Avoiding complete surprises 2

Figure 3: A correlation between two variables XX and YY, controlled for a 3rd variable FF

Caveats: note, however:

statistics don’t provide content –- the researcher does
statistics are only useful to the extent that the researcher has been successful
- in operationalizing his variables appropriately
- eliciting/collecting the data correctly
- choosing the right statistical technique

The phases of empirical quantitative studies

The phases of an empirical study

reconnaissance
hypotheses (text and statistical forms)
data collection ((operationalizations of) variables)
evaluation of hypotheses given the data
- effect sizes
- graphs
- significance tests (p-values)

Phase 1 and 2: variables

Variables: symbols of sets of characteristics an item can exhibit
they can be distinguished in terms of
- their role in an investigation
  - predictor/independent
  - response/dependent
  - confounds (controlled, accounted for, or residualized out)
  - moderators (accounted for by interactions w/ add. variables)

their information value
- categorical: different values → different properties
- ordinal: categorical + different values → different ranks
- numeric (interval/ratio): categorical + ordinal + different values → sizes of differences

Example of information levels

Here are results of a fictitious results of an Olympic 100m dash
What is the information level of each variable in a column?

TIME	RANK	NAME	NUMBER	MEDAL
9.86	1	S. Davis	453473	1
9.91	2	J. White	563456	1
10.01	3	S. Hendry	756675	1
20.02	4	C. Lewis	585821	0

TIME: num
RANK: ord
NAME/NUMBER: cat
MEDAL: kinda depends …

Phase 2: What are hypotheses?

What are hypotheses?
- universal statement (going beyond a singular event)
- implicit structure of a conditional sentence
  - if …, then …
  - the more/less …, the more/less …
- empirically testable and potentially falsifiable
- statements postulating a distribution of one or more response variables

Phase 2: Kinds of hypotheses

alternative hypothesis H₁: a statement postulating
- a particular distribution of a (response) variable (goodness-of-fit)
- a relation between 1+ predictors & 1+ response variables (independence/difference(s))
  - stipulating some difference, but not its direction: non-directional/2-tailed
  - stipulating a difference and its direction: directional/1-tailed
null hypothesis H₀:
- the logical counterpart to H₁
- an alternative hypothesis with not in it

Phase 2: Operationalization

Definition 1: pinpointing and fleshing out the notions that the text hypotheses refer to
definition 2: translating the text hypotheses into something that involves numbers (i.e., can be counted, measured, …)
most frequent statistical measures
- counts/frequencies
- distributions
- averages/means
- dispersions
- correlations

Phase 3: Data storage rules

Store the data in the so-called case-by-variable format:
- each data point (i.e., measurement of the dependent variable) has a row on its own
- every variable or every other characteristic of a data point has a column on its own
- the very first row contains the names of all variables (header)
- missing data are marked as NA –- do not use empty cells!
- do not use numbers for categorical variables

Phase 3: Data storage (wrong)

Imagine you hypothesizes subjects are shorter than objects in English and collected the following data set:

Table 1: Terribly wrong format

SENTENCE	SUBJ	OBJ
The younger bachelors ate the nice little cat	3	4
He was locking the door	1	2
The quick brown fox hit the lazy dog	4	3

how many variables? 2:
- the response LEN
- a predictor GRAMREL
how many data points? 6, but …
… each row has 2 data points (of LEN), not one
each of the columns 2 and 3 represents the levels of a variable (GRAMREL), not a variable

Phase 3: Data storage (right)

The last two columns of this would be the correct format:

Table 2: Correct format

CASE	SENTNO	SENTENCE	GRAMREL	LEN
1	1	The younger bachelors ate the nice little cat	subj	3
2	1	The younger bachelors ate the nice little cat	obj	4
3	2	He was locking the door	subj	1
4	2	He was locking the door	obj	2
5	3	The quick brown fox hit the lazy dog	subj	4
6	3	The quick brown fox hit the lazy dog	obj	3

how many variables? 2, and that’s how many columns we have there
how many data points? 6, and that’s how many rows we have

Intro to descriptive statistics, part 1

Introduction

The scientific method

A few self-evident objectives of empirical scientific inquiry

But why use statistics for this?

Three absolutely central notions

Pitfalls you can avoid with proper quantitative analysis

Two English verbs verb1 and verb2

Two English verbs verb1 and verb2

Avoiding complete surprises 1

Avoiding complete surprises 2

Caveats: note, however:

The phases of empirical quantitative studies

The phases of an empirical study

Phase 1 and 2: variables

Example of information levels

Phase 2: What are hypotheses?

Phase 2: Kinds of hypotheses

Phase 2: Operationalization

Phase 3: Data storage rules

Phase 3: Data storage (wrong)

Phase 3: Data storage (right)

Phase 3: Data storage: direct comparison

Two English verbs verb₁ and verb₂

Two English verbs verb₁ and verb₂