Benford's law

Masquerouge

Deity
Joined
Jun 3, 2002
Messages
17,790
Location
Mountain View, CA
So I recently learned about Benford's law.

I suggest this:
http://en.wikipedia.org/wiki/Benford's_law

and this:
http://www.math.gatech.edu/~hill/publications/cv.dir/1st-dig.pdf

for a detailed explanation, but the general idea is that the first digits of random numbers from random sets are not uniformely distributed between 1 and 9. On the contrary,
Leading digit Probability
1------------30.1%
2----------- 17.6%
3----------- 12.5%
4----------- 9.7%
5----------- 7.9%
6----------- 6.7%
7----------- 5.8%
8----------- 5.1%
9----------- 4.6%

This means that if, for instance, you decided to pick up all the numbers in the front page of various newspapers, thus ending with random numbers from random sets (lottery numbers, temperatures, casualties, etc.), then on average 30.1% of these numbers would start with a 1, 17.6% would start with a 2, and so on...

The thing I do not understand is why? I can explain what the law is about, but I don't understand why it works. Could someone please explain it to me?
 
Because not all sets go as high as the nineties?
 
I think it's because the size of the measuring unit is arbitrarily chosen. Like, say, a foot (based on the human foot) versus, I dunno, the lengths of various animals. You should expect a quasi-uniform distribution of the logarithms of the values, over a range of many orders of magnitude.

Suppose the animals' lengths in feet were a uniform linear distribution, instead. That would just be really weird. It would mean that most animals would have to be within an order of magnitude of the size of a blue whale.

This has always made intuitive sense to me because I think in terms of geometric progressions, orders of magnitude, etc. rather than linear progressions. But I'm not sure I can explain why it makes sense. I read something interesting by a philosopher on this, once, I'll see if I can dig that up.
 
Our counting system starts with 1, so its obvious this will be more used than 2 etc, seems obvious. For example if the street you live in counts up to # 325, than first number:

1 - 121x (1, 10-19, 100-199)
2 - 121x (2, 20-29, 200-299)
3 - 37x (3, 30-39, 300-325)
4 - 11x (4, 40-49)
5 - 11x
6 - 11x
7 - 11x
8 - 11x

etc
 
I think it's because the size of the measuring unit is arbitrarily chosen. Like, say, a foot (based on the human foot) versus, I dunno, the lengths of various animals. You should expect a quasi-uniform distribution of the logarithms of the values, over a range of many orders of magnitude.

Okay I think that's what they're saying, and that's exactly what I don't get. Why should I expect a quasi-uniform distribution of the logarithms, and why having a quasi-uniform distribution of logarithms of the values means that I will end up with 30% of the number starting with 1, 17% starting with 2, etc.?



Our counting system starts with 1, so its obvious this will be more used than 2 etc, seems obvious. For example if the street you live in counts up to # 325, than first number:

I understand your example, but I'm not sure that's what the explanation is - and that's too bad, because this one I understand :)
 
Pick some random numbers of arbitrary size. List the integers going up from 1 to each of those numbers. You'll get sequences looking something like this:

*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33
*1, 2, 3
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

(I picked 4 random numbers up to 100)

Amount of numbers produced that begin with 1: 30.
Amount of numbers produced that begin with 2: 24.
Amount of numbers produced that begin with 3: 18.
et cetera...
Amount of numbers produced that begin with 9: 3.

If you pick among a worse sample, such as "up to 199", the pattern is stronger. If a lottery has 20000 tickets, more than half of them have a number beginning with a 1.

Put another way, the amount of numbers between 1 and N inclusive that begin with the digit "1" is always equal to or greater than the amount of numbers between 1 and N inclusive that begin with another digit.

Does that help?
 
Pick some random numbers of arbitrary size. List the integers going up from 1 to each of those numbers. You'll get sequences looking something like this:

*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33
*1, 2, 3
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

(I picked 4 random numbers up to 100)

Amount of numbers produced that begin with 1: 30.
Amount of numbers produced that begin with 2: 24.
Amount of numbers produced that begin with 3: 18.
et cetera...
Amount of numbers produced that begin with 9: 3.

If you pick among a worse sample, such as "up to 199", the pattern is stronger. If a lottery has 20000 tickets, more than half of them have a number beginning with a 1.

Put another way, the amount of numbers between 1 and N inclusive that begin with the digit "1" is always equal to or greater than the amount of numbers between 1 and N inclusive that begin with another digit.

Does that help?

Yes, it does help, and that was my intuitive explanation, but it doesn't seem to be what the papers are saying.
For instance, Wiki says:
This (meaning the fact that the leading digit is 1 almost one third of the time, and larger numbers occur as the leading digit with less and less frequency as they grow in magnitude) is based on the observation that real-world measurements are generally distributed logarithmically, thus the logarithm of a set of real-world measurements is generally distributed uniformly.

I don't understand why real-world measurements are distributed logarithmically, and I don't understand what it means to be distributed logarithmically. Is that a fancy way of saying what you just explained?

The law can be explained by the fact that, if it is indeed true that the first digits have a particular distribution, it must be independent of the measuring units used. For example, this means that if one converts from e.g. feet to yards (multiplication by a constant), the distribution must be unchanged — it is scale invariant, and the only distribution that fits this is logarithmic.

And here I'm completely lost. I don't understand why a logarithmic distribution is scale invariant, and why being scale invariant means that the first digit will be 1 30% of the time.
 
I don't understand why real-world measurements are distributed logarithmically, and I don't understand what it means to be distributed logarithmically. Is that a fancy way of saying what you just explained?
AFAIK, yes. The "why" of being distributed logarithmically is "because they're distributed in the above way, and that's logarithmic".



And here I'm completely lost. I don't understand why a logarithmic distribution is scale invariant, and why being scale invariant means that the first digit will be 1 30% of the time.
Saying that a logarithmic distribution is scale invariant is a bit like saying that the slope of a pyramid is translation invariant, but not under rotation. Pyramids stay the same angle if you move up or down; logarithmic distributions stay patterned if you multiply by some factor.
The 30% figure (and the others, let me quote) are from the following pattern:
1------------30.1%
2----------- 17.6%
3----------- 12.5%
4----------- 9.7%
5----------- 7.9%
6----------- 6.7%
7----------- 5.8%
8----------- 5.1%
9----------- 4.6%
Log 2 = 0.301029996
Log 3 = 0.477121255.
0.477121255-0.301029996 = 0.176091259
Etc...
Log 9 = 0.954242509
1 - 0.954242509 = 0.045757491
 
AFAIK, yes. The "why" of being distributed logarithmically is "because they're distributed in the above way, and that's logarithmic".

Thanks man :) I was frustrated because I understood very well how it worked, but not why - and when you explain it to someone else it helps :)



Saying that a logarithmic distribution is scale invariant is a bit like saying that the slope of a pyramid is translation invariant, but not under rotation. Pyramids stay the same angle if you move up or down; logarithmic distributions stay patterned if you multiply by some factor.
The 30% figure (and the others, let me quote) are from the following pattern:

Log 2 = 0.301029996
Log 3 = 0.477121255.
0.477121255-0.301029996 = 0.176091259
Etc...
Log 9 = 0.954242509
1 - 0.954242509 = 0.045757491

:eek: that awesome! Thanks again! :)
 
The law can be explained by the fact that, if it is indeed true that the first digits have a particular distribution, it must be independent of the measuring units used. For example, this means that if one converts from e.g. feet to yards (multiplication by a constant), the distribution must be unchanged — it is scale invariant, and the only distribution that fits this is logarithmic.

And here I'm completely lost. I don't understand why a logarithmic distribution is scale invariant, and why being scale invariant means that the first digit will be 1 30% of the time.

Funny, that explanation you quoted sounds a lot like what that philosopher I mentioned was saying.

I think Erik answered the 2nd half of your question. To answer the first, take several large arrays of numbers and play around with them - multiply each by a constant.
Here are some examples of uniform logarithmic distributions:
1 2 4 8 16 32 64 ....
1 3 9 27 81 243 729 ...
And an example of a uniform linear distribution:
1 2 3 4 5 6 ...
 
But if we switch from decimal to binary numbers, every binary digit will start with 1! Thus, the first figure is 1 in 100% cases!
 
The applications in Accounting fraud are really amazing! :wow: I shall have to find some way of using this to my advantage, perhaps at work :hmm:

EDIT: Incidentally, I think the key to the "why" is in this paragraph:
Multiple probability distributions

Note that for numbers drawn from many distributions, for example IQ scores, human heights or other variables following normal distributions, the law is not valid. However, if one "mixes" numbers from those distributions, for example by taking numbers from newspaper articles, Benford's law reappears. This can be proven mathematically: if one repeatedly "randomly" chooses a probability distribution and then randomly chooses a number according to that distribution, the resulting list of numbers will obey Benford's law.[8][3]
Having read [3] and skimmed [8] (it speaks in maths which I can't really understand), the above results in the logarithmic distribution of digits (the proof of this is [8]).
 
Yeah, logarithms answer the question.

EDIT: That was directed to the OP rather than illram. Don't buy a lottery ticket :lol:
 
The applications in Accounting fraud are really amazing! :wow: I shall have to find some way of using this to my advantage, perhaps at work :hmm:
How do you see such an application?
 
The reason that the distribution is logarithmic is because

The probability that a number is between 100 and 1000 (logarithm between 2 and 3)
= The probability that a number is between 10,000 and 100,000 (logarithm between 4 and 5)

Why?

Well, this is obviously not always true.. but for many sets of numbers it is a reasonable assumption.. especially for sets of numbers that grow exponentially, like incomes, and stock prices, and sets of numbers we encounter in daily life.

Why?

Because the systems we use to measure things are arbitrary.. Take the distribution of all incomes, of all people who live in the U.S. You're going to get a whole bunch of things at the bottom. ($0 - $10,000), then a smaller amount of things a bit higher up ($10,000 - $20,000), then an even smaller amoutn of things a bit higher ($20,000 - $30,000), and so on, and so on.

But wait! What if you expressed all these incomes in Zimbabwean dollars? or Polish Zloty? Or Euros? Or yen? Well, you'd get the exact same type of distribution.

I realize that it's not totally obvious why that makes things logarithmic, but that's how it makes sense to me.
 
How do you see such an application?

It is used in forensic auditing. For example, if someone if making up false invoices and then having the company pay them: People would tend to (say) make lots of fake invoices that are small enough not to arouse suspicion (eg. less than $1000) but large enough to make it worthwhile (ie go for several hundred rather than one hundred).

The "real" invoices will likely follow benford's law, while the fake ones will distort it, because they are not 'random' (or even pseudo-random).
 
For me, the interesting part isn't that "exponential things" have first digits distributed logarithmically (that's quite obvious when you're told it!), or that lots of every day things are exponential. The interesting part, for me, is that, when you take a random number from a random distribution -- even ones that don't obey Benford's law, such as uniform distributions or normal distributions -- you end up with a distribution that obeys Benford's law. I find that quite incredible.

EDIT: It's unfortunate that the wiki only dedicates a single paragraph to this fact, and spends much more time explaining the "exponential" and the "measurement" things, neither of which explain how taking disparate numbers from different newspapers will result in a Benford-distributed set of first digits.

I can't follow the proof (not being well versed in Statistics, or even Maths anymore), so if anyone has a more "intuitive" description of the proof, I'd love to hear it!

@Birdjaguar: Do you mean at my work or in Accounting fraud?
 
Top Bottom