# Benford's law

Discussion in 'Science & Technology' started by Masquerouge, May 30, 2007.

1. ### MasquerougeChieftain

Joined:
Jun 3, 2002
Messages:
17,790
Location:
Mountain View, CA
So I recently learned about Benford's law.

I suggest this:
http://en.wikipedia.org/wiki/Benford's_law

and this:
http://www.math.gatech.edu/~hill/publications/cv.dir/1st-dig.pdf

for a detailed explanation, but the general idea is that the first digits of random numbers from random sets are not uniformely distributed between 1 and 9. On the contrary,
1------------30.1%
2----------- 17.6%
3----------- 12.5%
4----------- 9.7%
5----------- 7.9%
6----------- 6.7%
7----------- 5.8%
8----------- 5.1%
9----------- 4.6%

This means that if, for instance, you decided to pick up all the numbers in the front page of various newspapers, thus ending with random numbers from random sets (lottery numbers, temperatures, casualties, etc.), then on average 30.1% of these numbers would start with a 1, 17.6% would start with a 2, and so on...

The thing I do not understand is why? I can explain what the law is about, but I don't understand why it works. Could someone please explain it to me?

2. ### History_BuffKnight of Cydonia

Joined:
Aug 12, 2001
Messages:
6,529
Location:
Calgary, Alberta
Because not all sets go as high as the nineties?

3. ### Ayatollah Sothe spoof'll set you free

Joined:
Feb 20, 2002
Messages:
4,387
Location:
SE Michigan
I think it's because the size of the measuring unit is arbitrarily chosen. Like, say, a foot (based on the human foot) versus, I dunno, the lengths of various animals. You should expect a quasi-uniform distribution of the logarithms of the values, over a range of many orders of magnitude.

Suppose the animals' lengths in feet were a uniform linear distribution, instead. That would just be really weird. It would mean that most animals would have to be within an order of magnitude of the size of a blue whale.

This has always made intuitive sense to me because I think in terms of geometric progressions, orders of magnitude, etc. rather than linear progressions. But I'm not sure I can explain why it makes sense. I read something interesting by a philosopher on this, once, I'll see if I can dig that up.

4. ### angeleyesmood indigo

Joined:
Oct 12, 2005
Messages:
2,300
Location:
The Netherlands
Our counting system starts with 1, so its obvious this will be more used than 2 etc, seems obvious. For example if the street you live in counts up to # 325, than first number:

1 - 121x (1, 10-19, 100-199)
2 - 121x (2, 20-29, 200-299)
3 - 37x (3, 30-39, 300-325)
4 - 11x (4, 40-49)
5 - 11x
6 - 11x
7 - 11x
8 - 11x

etc

5. ### MasquerougeChieftain

Joined:
Jun 3, 2002
Messages:
17,790
Location:
Mountain View, CA
Okay I think that's what they're saying, and that's exactly what I don't get. Why should I expect a quasi-uniform distribution of the logarithms, and why having a quasi-uniform distribution of logarithms of the values means that I will end up with 30% of the number starting with 1, 17% starting with 2, etc.?

I understand your example, but I'm not sure that's what the explanation is - and that's too bad, because this one I understand

6. ### Erik MesoyCore Tester / Intern

Joined:
Mar 25, 2002
Messages:
10,949
Location:
Oslo, Norway
Pick some random numbers of arbitrary size. List the integers going up from 1 to each of those numbers. You'll get sequences looking something like this:

*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33
*1, 2, 3
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

(I picked 4 random numbers up to 100)

Amount of numbers produced that begin with 1: 30.
Amount of numbers produced that begin with 2: 24.
Amount of numbers produced that begin with 3: 18.
et cetera...
Amount of numbers produced that begin with 9: 3.

If you pick among a worse sample, such as "up to 199", the pattern is stronger. If a lottery has 20000 tickets, more than half of them have a number beginning with a 1.

Put another way, the amount of numbers between 1 and N inclusive that begin with the digit "1" is always equal to or greater than the amount of numbers between 1 and N inclusive that begin with another digit.

Does that help?

7. ### MasquerougeChieftain

Joined:
Jun 3, 2002
Messages:
17,790
Location:
Mountain View, CA
Yes, it does help, and that was my intuitive explanation, but it doesn't seem to be what the papers are saying.
For instance, Wiki says:
I don't understand why real-world measurements are distributed logarithmically, and I don't understand what it means to be distributed logarithmically. Is that a fancy way of saying what you just explained?

And here I'm completely lost. I don't understand why a logarithmic distribution is scale invariant, and why being scale invariant means that the first digit will be 1 30% of the time.

8. ### Erik MesoyCore Tester / Intern

Joined:
Mar 25, 2002
Messages:
10,949
Location:
Oslo, Norway
AFAIK, yes. The "why" of being distributed logarithmically is "because they're distributed in the above way, and that's logarithmic".

Saying that a logarithmic distribution is scale invariant is a bit like saying that the slope of a pyramid is translation invariant, but not under rotation. Pyramids stay the same angle if you move up or down; logarithmic distributions stay patterned if you multiply by some factor.
The 30% figure (and the others, let me quote) are from the following pattern:
Log 2 = 0.301029996
Log 3 = 0.477121255.
0.477121255-0.301029996 = 0.176091259
Etc...
Log 9 = 0.954242509
1 - 0.954242509 = 0.045757491

9. ### MasquerougeChieftain

Joined:
Jun 3, 2002
Messages:
17,790
Location:
Mountain View, CA
Thanks man I was frustrated because I understood very well how it worked, but not why - and when you explain it to someone else it helps

that awesome! Thanks again!

10. ### Ayatollah Sothe spoof'll set you free

Joined:
Feb 20, 2002
Messages:
4,387
Location:
SE Michigan
Funny, that explanation you quoted sounds a lot like what that philosopher I mentioned was saying.

I think Erik answered the 2nd half of your question. To answer the first, take several large arrays of numbers and play around with them - multiply each by a constant.
Here are some examples of uniform logarithmic distributions:
1 2 4 8 16 32 64 ....
1 3 9 27 81 243 729 ...
And an example of a uniform linear distribution:
1 2 3 4 5 6 ...

11. ### mackoChieftain

Joined:
Mar 17, 2009
Messages:
1
But if we switch from decimal to binary numbers, every binary digit will start with 1! Thus, the first figure is 1 in 100% cases!

12. ### Miseisle of lucy

Joined:
Apr 13, 2004
Messages:
28,495
Location:
London, UK
The applications in Accounting fraud are really amazing! I shall have to find some way of using this to my advantage, perhaps at work

EDIT: Incidentally, I think the key to the "why" is in this paragraph:
Having read [3] and skimmed [8] (it speaks in maths which I can't really understand), the above results in the logarithmic distribution of digits (the proof of this is [8]).

13. ### illramModeratorModerator

Joined:
Dec 25, 2005
Messages:
9,217
Location:
San Francisco
So can I use this to up my chances of winning the lottery?

Joined:
Apr 4, 2007
Messages:
21,810
Location:
Liverpool, home of Everton FC

EDIT: That was directed to the OP rather than illram. Don't buy a lottery ticket

15. ### BirdjaguarEntangledRetired ModeratorSupporter

Joined:
Dec 24, 2001
Messages:
30,229
Location:
Albuquerque, NM
How do you see such an application?

16. ### warpusIn pork I trust

Joined:
Aug 28, 2005
Messages:
46,611
Location:
Stamford Bridge
The reason that the distribution is logarithmic is because

The probability that a number is between 100 and 1000 (logarithm between 2 and 3)
= The probability that a number is between 10,000 and 100,000 (logarithm between 4 and 5)

Why?

Well, this is obviously not always true.. but for many sets of numbers it is a reasonable assumption.. especially for sets of numbers that grow exponentially, like incomes, and stock prices, and sets of numbers we encounter in daily life.

Why?

Because the systems we use to measure things are arbitrary.. Take the distribution of all incomes, of all people who live in the U.S. You're going to get a whole bunch of things at the bottom. (\$0 - \$10,000), then a smaller amount of things a bit higher up (\$10,000 - \$20,000), then an even smaller amoutn of things a bit higher (\$20,000 - \$30,000), and so on, and so on.

But wait! What if you expressed all these incomes in Zimbabwean dollars? or Polish Zloty? Or Euros? Or yen? Well, you'd get the exact same type of distribution.

I realize that it's not totally obvious why that makes things logarithmic, but that's how it makes sense to me.

17. ### Knight-DragonUnhidden DragonRetired Moderator

Joined:
Jun 25, 2001
Messages:
19,958
Location:
Singapore
Moderator Action: Moved to S/T.

Joined:
Oct 5, 2001
Messages:
30,060
It is used in forensic auditing. For example, if someone if making up false invoices and then having the company pay them: People would tend to (say) make lots of fake invoices that are small enough not to arouse suspicion (eg. less than \$1000) but large enough to make it worthwhile (ie go for several hundred rather than one hundred).

The "real" invoices will likely follow benford's law, while the fake ones will distort it, because they are not 'random' (or even pseudo-random).

19. ### Miseisle of lucy

Joined:
Apr 13, 2004
Messages:
28,495
Location:
London, UK
For me, the interesting part isn't that "exponential things" have first digits distributed logarithmically (that's quite obvious when you're told it!), or that lots of every day things are exponential. The interesting part, for me, is that, when you take a random number from a random distribution -- even ones that don't obey Benford's law, such as uniform distributions or normal distributions -- you end up with a distribution that obeys Benford's law. I find that quite incredible.

EDIT: It's unfortunate that the wiki only dedicates a single paragraph to this fact, and spends much more time explaining the "exponential" and the "measurement" things, neither of which explain how taking disparate numbers from different newspapers will result in a Benford-distributed set of first digits.

I can't follow the proof (not being well versed in Statistics, or even Maths anymore), so if anyone has a more "intuitive" description of the proof, I'd love to hear it!

@Birdjaguar: Do you mean at my work or in Accounting fraud?

20. ### warpusIn pork I trust

Joined:
Aug 28, 2005
Messages:
46,611
Location:
Stamford Bridge
Mise, once you understand that the probabilities have a logarithmic distribution and why, what else is there to understand?