Corrections to Nelson (2023):

DP(norm) and KLD(norm) are not wrong on pi at all

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

14 Feb 2024 12-19-01

1 Introduction

In February 2023, the Journal of Quantitative Linguistics published Nelson (2023), a “review article” discussing dispersion measures with an emphasis on my recent work involving in particular the measures of

  • Deviation of Proportions DP/DPnorm (raw and normalized), as discussed, among other places, in Gries (2008, 2020) and Lijffijt & Gries (2012);
  • the Kullback-Leibler divergence DKLnorm, as discussed in Gries (2020) and in more detail in my forthcoming book Gries (to appear).

While there is a lot to be discussed there, this small paper is intended as a correction of several aspects of Nelson’s paper, focusing for the most part on two false computations which produce results that make DPnorm and DKLnorm seem unable to correctly quantify the distribution/dispersion of the first one million digits of pi. Here is the relevant quote from Nelson (2023:158):


consider a corpus made from the first 1,000,000 digits of pi. Intuition suggests that if we treat this as a corpus of the digits 0 through 9, every digit should get a score near 0 as there are (famously) no predictable biases in the distribution of the digits of pi. In fact, when this corpus is partitioned into 1,000 bins of 1,000 digits, every digit receives a DPnorm score of 0.495 – nearly the midpoint of the range of the score – while receiving a DKLnorm score of 0.9. Figure 2 shows why. In the left frame of the figure, the heights of the bars show the number of times ‘7’ occurs in a part while the numbers along the bottom show the position (in pi) of the part. The heights of the bars resemble a random walk between 80 and 130, an impression reinforced by the symmetry of their distribution in the right frame. This random noise alone is enough to hold these scores far from any value that might intuitively (and probably empirically) be interpreted as unbiased and even.


As Nelson states correctly, both DP/DPnorm and DKLnorm are scaled such that they fall into the interval [0, 1] and are oriented such that,

  • if words in a corpus are distributed evenly/regularly, their values should be at the bottom end of that scale;
  • if words in a corpus are distributed unevenly/clumpily, their values should be at the top end of that scale.

Therefore, it indeed follows that, when presented with 1 million ‘words’/digits of pi that are distributed with “no predictable biases” over 1000 ‘corpus files’/bins, both measures should return values that are close to their theoretical minimums, which also means that the results Nelson reports would, if true, lead anyone to probably never consider DPnorm and DKLnorm as measures of dispersion again. The main purpose of this correction is to demonstrate what the real results are and how that makes at least this part of Nelson’s argumentation collapse. To make my points as transparent and replicably as possible, this paper already contains the R code with which everyone can replicate the results for themselves – all one needs is an installation of R, the package magrittr, and internet access; readers can access a rendered Quarto document with all content, code, and my session info at >https://www.stgries.info/research/2024_STG_OnRNN_JQL.html<.

rm(list=ls(all=TRUE)); library(magrittr)

2 The overall distribution of the digits of pi

To demonstrate the falsity of both of Nelson’s results (and some additional small problems), let’s first download the first one million digits of pi, here from a website from the University of Exeter:

str(pi.1m <-
   "http://newton.ex.ac.uk/research/qsystems/collabs/pi/pi6.txt" %>% 
   scan(what="", quiet=TRUE))
 chr [1:100101] "3." "\f" "1415926535" "8979323846" "2643383279" ...

Given the way, pi is provided there, we need to prepare this a bit to get a vector of 1 million digits:

pi.1m <- pi.1m        %>% # make pi.1m the result of
   grep(                  # retaining only the elements
      "^\\d+$", .,        # that contain only digits
      perl=TRUE,          # using PCRE
      value=TRUE)     %>% # returning the strings w/ digits, then
   paste(collapse="") %>% # pasting it all into 1 string
   strsplit(split="") %>% # splitting up at every digit
   unlist                 # making that a vector
head(pi.1m, 25) # check the result
 [1] "1" "4" "1" "5" "9" "2" "6" "5" "3" "5" "8" "9" "7" "9" "3" "2" "3" "8" "4"
[20] "6" "2" "6" "4" "3" "3"

Just to make absolutely sure that these are the right numbers, let’s check these digits against those from another website:

str(pi.1m.check <- scan("https://assets.angio.net/pi1000000.txt", what="", quiet=TRUE))
 chr "3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865"| __truncated__

There, too, we need a little bit of prep to get all the digits into a single vector:

pi.1m.check <-            # make pi.1m.check the result of
   sub(                   # replacing
      "^..", "",          # the 1st two characters by nothing
      pi.1m.check,        # in pi.1m.check
      perl=TRUE)      %>% # using PCRE, then
   strsplit(split="") %>% # splitting up at every digit
   unlist                 # making that a vector
head(pi.1m.check, 25) # check the result
 [1] "1" "4" "1" "5" "9" "2" "6" "5" "3" "5" "8" "9" "7" "9" "3" "2" "3" "8" "4"
[20] "6" "2" "6" "4" "3" "3"

Crucially, both vectors contain the same digits so we can proceed with just pi.1m for the rest of the paper; for one sanity check later, we also compute the frequencies of the digits:

all(pi.1m==pi.1m.check) # both vectors are identical
[1] TRUE
rm(pi.1m.check); gc() # memory management
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  659182 35.3    1410772 75.4  1410772 75.4
Vcells 2211269 16.9    8388608 64.0  6846256 52.3
table(pi.1m) # frequencies of digits
pi.1m
     0      1      2      3      4      5      6      7      8      9 
 99959  99758 100026 100229 100230 100359  99548  99800  99985 100106 

Nelson then “partitioned [this corpus] into 1,000 bins of 1,000 digits” but does unfortunately not say how exactly this was done so, for our replication, we will have to guess and explore the two most straightforward options. This is the first, which repeats the sequence of 1000 bin names 1000 times, …

bins.times <- formatC( # make bins.times a character vector:
   # repeat the sequence 1:1000 a thousand times
   rep(1:1000, times=1000),
   width=4,            # make the elements 4 chars wide
   flag="0")           # with leading 0s
bins.times[c(1:3, 999998:1000000)]
[1] "0001" "0002" "0003" "0998" "0999" "1000"

… and this is the second, which repeats the name of each of the 1000 bins 1000 times:

bins.each <- formatC( # make bins.each a character vector:
   # repeat each of the numbers 1:1000 a thousand times
   rep(1:1000, each=1000),
   width=4,           # make the elements 4 chars wide
   flag="0")          # with leading 0s
bins.each[c(1:3, 999998:1000000)]
[1] "0001" "0001" "0001" "1000" "1000" "1000"

Many dispersion measures are straightforwardly computable from what, with regular corpora, would be term-document matrices and what, here, are digit-bin matrices. For each way of defining the bins, we compute one version with observed absolute frequencies and one with row proportions (because these latter ones will represent how much of each word as a proportion of all its uses is observed in each corpus bin):

# compute the term-document / digit-bin matrices
PI.tdm.times.abs <- # make PI.tdm.times.abs the result of
   table(                    # cross-tabulating
      PIsDIGITS =pi.1m,      # pi.1m with
      CORPUSBINS=bins.times) # the bins
PI.tdm.times.abs[,1:4] # check result
         CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
        0  104   83  107   96
        1  102  100  106  100
        2   93   88  102   97
        3  102  113   84   87
        4  106  110   97   92
        5  103  106   97  113
        6  100   95  110  103
        7  115  103  109  101
        8   93  104   96  102
        9   82   98   92  109
PI.tdm.times.rel <- # make PI.tdm.times.rel the result of
   PI.tdm.times.abs %>% # taking PI.tdm.times.abs and
   prop.table(margin=1) # convert to row percentages
PI.tdm.times.rel[,1:4] # check result
         CORPUSBINS
PIsDIGITS         0001         0002         0003         0004
        0 0.0010404266 0.0008303404 0.0010704389 0.0009603938
        1 0.0010224744 0.0010024259 0.0010625714 0.0010024259
        2 0.0009297583 0.0008797713 0.0010197349 0.0009697479
        3 0.0010176695 0.0011274182 0.0008380808 0.0008680123
        4 0.0010575676 0.0010974758 0.0009677741 0.0009178889
        5 0.0010263155 0.0010562082 0.0009665302 0.0011259578
        6 0.0010045405 0.0009543135 0.0011049946 0.0010346767
        7 0.0011523046 0.0010320641 0.0010921844 0.0010120240
        8 0.0009301395 0.0010401560 0.0009601440 0.0010201530
        9 0.0008191317 0.0009789623 0.0009190258 0.0010888458
# compute the term-document / digit-bin matrices
PI.tdm.each.abs <- # make PI.tdm.each.abs the result of
   table(                   # cross-tabulating
      PIsDIGITS =pi.1m,     # pi.1m with
      CORPUSBINS=bins.each) # the bins
PI.tdm.each.abs[,1:4] # check result
         CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
        0   93   89   77  103
        1  116   96   97  120
        2  103  104   96  105
        3  102   86   77  103
        4   93  102  123   87
        5   97  108  110  102
        6   94  106  102   96
        7   95  102   90   90
        8  101  101  108   95
        9  106  106  120   99
PI.tdm.each.rel <- # make PI.tdm.each.rel the result of
   PI.tdm.each.abs %>%  # taking PI.tdm.each.abs and
   prop.table(margin=1) # convert to row percentages
PI.tdm.each.rel[,1:4] # check result
         CORPUSBINS
PIsDIGITS         0001         0002         0003         0004
        0 0.0009303815 0.0008903650 0.0007703158 0.0010304225
        1 0.0011628140 0.0009623288 0.0009723531 0.0012029110
        2 0.0010297323 0.0010397297 0.0009597505 0.0010497271
        3 0.0010176695 0.0008580351 0.0007682407 0.0010276467
        4 0.0009278659 0.0010176594 0.0012271775 0.0008680036
        5 0.0009665302 0.0010761367 0.0010960651 0.0010163513
        6 0.0009442681 0.0010648130 0.0010246313 0.0009643589
        7 0.0009519038 0.0010220441 0.0009018036 0.0009018036
        8 0.0010101515 0.0010101515 0.0010801620 0.0009501425
        9 0.0010588776 0.0010588776 0.0011987293 0.0009889517

For example, the 93 / 0.0009303815 in PI.tdm.each.abs‘s top left cell reflects that the 93 instances of ’0’ in bin “0001” corresponds to 0.0009303815 of the 99959 instances of ‘0’ in the 1m digits of pi.

For both dispersion measures, we then also need the expected, or prior, distribution. In this example, where we created the corpus files/bins as above, these are of course just one thousand instance of 0.001:

exp.prior <-      # make exp.prior the result of
   bins.times %>% # taking the bins,
   table      %>% # tabulating them, &
   prop.table     # making them proportions
# same as rep(0.001, 1000)
exp.prior %>% table
.
0.001 
 1000 

Nelson then shows the “noisy but unbiased dispersion of ‘7’ in pi in his Figure 2 (also p. 158). The right panel of his Figure 2 is fine and replicated here as Figure 1:

hist(                          # plot a histogram of the
   PI.tdm.times.rel["7",]*1E5, # freqs of 7 in the bins of pi
   breaks=50,           # approximating N's bin width
   main="",             # suppress a main heading
   xlab="Count of '7'", # replicate his ...
   ylab="N Bins")       # axis labels

Figure 1: Nelson’s Figure 2 (right)

However, the left panel of his Figure 2 also seems wrong already. This is because its x-axis ranges from 1 to 1 million, presumably because the x-axis is the one million digits of pi, but it does so although – check the quote at the beginning of this paper – Nelson says “the numbers along the bottom show the position (in pi) of the part” (my emphasis), meaning the x-axis should actually go from 1 to 1000. His x-axis labeling error seems confirmed by the fact that Nelson’s y-axis is labeled “Count of ‘7’” and ranges from 0 to 140, suggesting that those are the frequencies of ‘7’ in every x-axis bin (as he described in the quote), and the vertical lines then shown in the plot seem to be the frequencies of ‘7’ summarized in the histogram. But then that is why the current plot is wrong because one cannot have it both ways: only one of these two following plots would be correct:

  • plot 1, where the x-axis goes from 1 to 1m (for each word): This plot would be the kind of dispersion plot many apps provide, where each slot on the x-axis is for one word and each vertical line indicates whether the word in that corpus slot is a word in question (like, here, ‘7’). However, in such a plot, the y-axis would go from 0 (for FALSE, the digit is not ‘7’) to 1 (for TRUE, the digit is ‘7’), not from 0 to 140. Or,
  • plot 2, where the x-axis goes from 1 to 1000 (for each bin) and which, from his description, seems to be the intended one: There, this plot would indicate the frequency of ‘7’ in each bin and then, yes, the y-axis needs to go from 0 to 140. In Figure 1, I am showing the plot he meant to show (but with a main heading I already changed to the averages of the results we will obtain below).
plot(type="h",
   PI.tdm.times.abs["7",],
   main=expression(paste("'7' in ", pi, ": ", italic(DP)[norm], "≈0.0367, ", italic(D)[KLnorm], "≈0.0042")),
   xlab="[This time with the correct x-axis]",
   ylab="Clount of '7'", ylim=c(0, 140))

Figure 2: Nelson’s Figure 2 (left), one corrected version

But his plot mixes up the axes: It has the x-axis from plot 1 and the y-axis from plot 2. Thus, it pretends that there are 1 million counts of ‘7’ between 80 and 140, i.e. around 100 million data points rather than the 1 million digits of pi he had us consider.

3 Computing DPnorm correctly

From the digit-bin matrices, we can compute the DP-values in a very R way: by applying to PI.tdm.times.rel (and PI.tdm.each.rel), namely each row/digit (MARGIN=1) an anonymous function, which will make the row an argument internally called obs.post (for ‘observed/posterior’) and then compute half of the sum of all absolute pairwise differences of obs.post and exp.prior, which is the formula for DP Nelson also provided:

DP.times <- apply(                 # make DP.times the result of applying
   PI.tdm.times.rel, MARGIN=1,     # to PI.tdm.times.rel, namely each row
   FUN=\(obs.post)                 # an anonymous function that computes DP:
   sum(abs(obs.post-exp.prior))/2) # sum(|O-E|) / 2 (see Nelson, eq. (1)
DP.each <- apply(                  # make DP.each the result of applying
   PI.tdm.each.rel, MARGIN=1,      # to PI.tdm.each.rel, namely each row
   FUN=\(obs.post)                 # an anonymous function that computes DP:
   sum(abs(obs.post-exp.prior))/2) # sum(|O-E|) / 2 (see Nelson, eq. (1)

These two sets of DP-values can then be normalized to the final DPnorm-values:1

(DP.times / (1-min(exp.prior))) %>% round(4) # one kind of bins
     0      1      2      3      4      5      6      7      8      9 
0.0381 0.0373 0.0380 0.0384 0.0380 0.0388 0.0371 0.0376 0.0373 0.0394 
(DP.each / (1-min(exp.prior))) %>% round(4) # other kind of bins
     0      1      2      3      4      5      6      7      8      9 
0.0364 0.0383 0.0385 0.0370 0.0377 0.0375 0.0387 0.0358 0.0388 0.0373 

This is a result that makes a lot of sense given that, as per Nelson himself, “there are (famously) no predictable biases in the distribution of the digits of pi” and the very low DPnorm values are perfectly compatible with that. However, it certainly is not the case that “every digit receives a DPnorm score 0.495 – nearly the midpoint of the range of the score”, that’s not even close to the truth. The DPnorm results he discusses on p. 158 and in Table 1 on p. 161 are wrong: The histogram in Figure 1 and the plot in Figure 2 are in fact perfectly compatible with a “value that might intuitively (and probably empirically) be interpreted as unbiased and even.”

4 Computing DKLnorm correctly

First and for the record, Nelson makes another small mistake in his definition of DKLnorm. On p. 155 of his article, he defines the Kullback-Leibler Divergence with his equation (4), which is shown here:

\[D_{KLnorm} = D(O||E) = \sum_{i=1}^{N} O_i Log_n \frac{O_i}{E_i} \tag{1}\]

To that, he adds “Where (as above) O is a vector of observed proportions and E is the vector of expected proportions” (also p. 155). That is of course not correct because his equation 4 shown here in Equation 1 does not compute the normalized DKL but a regular, unnormalized DKL, something which Nelson implicitly admits in his very next sentence, where he says “InGries [sic] (2022), a normalization of this measure’s score, \(D_{KLnorm} = 1 – e^{−Dkl}\), is proposed to fit the outputs to the unit interval.” In other words, his equation 4 is not already the normalized version, the normalized one we get only after also applying \(D_{KLnorm} = 1 – e^{−Dkl}\).2

Let us now compute the DKLnorm-values from the digit-bin matrices again. For that we first define a function that computes DKL, which I here call KLDs; to try and replicate Nelson’s procedure, I am using the natural log in the function like he did.

KLDs <- function (obs.post, exp.prior) {
   sum(obs.post * log(obs.post/exp.prior))
}

Then, we again apply this function to every row of PI.tdm.times.rel and PI.tdm.each.rel:

KLD.times <- apply(            # make KLD.times the result of applying
   PI.tdm.times.rel, MARGIN=1, # to PI.tdm.times.rel, namely each row
   KLDs, exp.prior)            # the function computing the KLD
KLD.times %>% round(4) # check the result
     0      1      2      3      4      5      6      7      8      9 
0.0045 0.0044 0.0045 0.0046 0.0045 0.0048 0.0043 0.0044 0.0044 0.0049 
KLD.each <- apply(            # make KLD.each the result of applying
   PI.tdm.each.rel, MARGIN=1, # to PI.tdm.each.rel, namely each row
   KLDs, exp.prior)           # the function computing the KLD
KLD.each %>% round(4) # check the result
     0      1      2      3      4      5      6      7      8      9 
0.0043 0.0044 0.0046 0.0042 0.0045 0.0043 0.0047 0.0040 0.0047 0.0044 

Any reader who does not trust my function can double-check the result for, say, ‘7’, with the following code (for PI.tdm.each.rel):

apply(PI.tdm.each.rel, MARGIN=1,
      entropy::KL.plugin, exp.prior) %>% round(4)
     0      1      2      3      4      5      6      7      8      9 
0.0043 0.0044 0.0046 0.0042 0.0045 0.0043 0.0047 0.0040 0.0047 0.0044 

And then we normalize using the above-mentioned strategy:

1-exp(-KLD.times) %>% round(4)
     0      1      2      3      4      5      6      7      8      9 
0.0045 0.0044 0.0045 0.0046 0.0045 0.0048 0.0043 0.0044 0.0044 0.0049 
1-exp(-KLD.each) %>% round(4)
     0      1      2      3      4      5      6      7      8      9 
0.0043 0.0044 0.0046 0.0042 0.0045 0.0043 0.0047 0.0040 0.0047 0.0044 

As before with DPnorm, this result is exactly what one would expect DKL or DKLnorm to return for the digits of pi, what Nelson (2023:158) himself formulated as an expectation, namely values that are close to 0 because that is the kind of “value that might intuitively (and probably empirically) be interpreted as unbiased and even”. But here the result is, in a sense, not even ‘only’ half off, this time Nelson’s results are at the exact opposite of the interval of values that DKL can assume. He claims results of 0.9, but the real values are very close to 0.

5 Summary

In sum, Nelson (2023) commits five mistakes:

  • the left panel of Figure 2 misrepresents the data studied;
  • equation (4) labels DKL as DKLnorm;
  • the results reported for measuring the dispersion of the first 1m digits of pi are false for
    • DPnorm (p. 158);
    • DKLnorm (p. 158);
  • the DPnorm results in Table 1 are false.

To reiterate: given the unpatterned first million digits of pi and contra Nelson, both DPnorm and DKLnorm return exactly the results the measures are sensibly claimed to return: values close to 0 indicating even distributions/dispersions. This does of course not prove that either measure works perfectly for all realistic combinations of corpus sizes (in words and in parts) and word frequencies: we definitely need more studies that (i) explore how dispersion measures we already have ‘react’ to different scenarios, (ii) develop new measures and, importantly, (iii) validate them against other, external data (see, e.g., Biber et al. 2016, Burch et al. 2017, Gries 2022). Given the importance of dispersion as a corpus-linguistic quantity, it would certainly be great if corpus linguistics as a field devoted as much energy to dispersion measures as it has to association measures, but this is of course requires that the computations are done correctly and replicably, which was the main point of this short note.

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods  
[8] base     

other attached packages:
[1] STGmisc_1.0    Rcpp_1.0.12    magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] digest_0.6.34     fastmap_1.1.1     xfun_0.42         knitr_1.45       
 [5] htmltools_0.5.7   rmarkdown_2.25    cli_3.6.2         rstudioapi_0.15.0
 [9] tools_4.3.2       entropy_1.3.1     evaluate_0.23     yaml_2.3.8       
[13] rlang_1.1.3       jsonlite_1.8.8    htmlwidgets_1.6.4

6 References

Footnotes

  1. In this very specific case, where all digit bins are equally large – 1/1000 – the normalization formulae from Gries (2008) and Lijffijt & Gries (2012) amount to the same results; the R code I give here is the one that would work in the more general case where not all corpus files are equally large.↩︎

  2. Also, the paper of mine he cites for DKLnorm as a dispersion measure used the binary log, not the natural log Nelson’s equation uses.↩︎