View Full Version : Benford's law


Masquerouge
May 30, 2007, 05:18 PM
So I recently learned about Benford's law.

I suggest this:
http://en.wikipedia.org/wiki/Benford%27s_law

and this:
http://www.math.gatech.edu/~hill/publications/cv.dir/1st-dig.pdf (http://www.math.gatech.edu/%7Ehill/publications/cv.dir/1st-dig.pdf)

for a detailed explanation, but the general idea is that the first digits of random numbers from random sets are not uniformely distributed between 1 and 9. On the contrary,
Leading digit Probability
1------------30.1%
2----------- 17.6%
3----------- 12.5%
4----------- 9.7%
5----------- 7.9%
6----------- 6.7%
7----------- 5.8%
8----------- 5.1%
9----------- 4.6%

This means that if, for instance, you decided to pick up all the numbers in the front page of various newspapers, thus ending with random numbers from random sets (lottery numbers, temperatures, casualties, etc.), then on average 30.1% of these numbers would start with a 1, 17.6% would start with a 2, and so on...

The thing I do not understand is why? I can explain what the law is about, but I don't understand why it works. Could someone please explain it to me?

History_Buff
May 30, 2007, 06:54 PM
Because not all sets go as high as the nineties?

Ayatollah So
May 30, 2007, 08:22 PM
I think it's because the size of the measuring unit is arbitrarily chosen. Like, say, a foot (based on the human foot) versus, I dunno, the lengths of various animals. You should expect a quasi-uniform distribution of the logarithms of the values, over a range of many orders of magnitude.

Suppose the animals' lengths in feet were a uniform linear distribution, instead. That would just be really weird. It would mean that most animals would have to be within an order of magnitude of the size of a blue whale.

This has always made intuitive sense to me because I think in terms of geometric progressions, orders of magnitude, etc. rather than linear progressions. But I'm not sure I can explain why it makes sense. I read something interesting by a philosopher on this, once, I'll see if I can dig that up.

angeleyes
May 31, 2007, 12:45 AM
Our counting system starts with 1, so its obvious this will be more used than 2 etc, seems obvious. For example if the street you live in counts up to # 325, than first number:

1 - 121x (1, 10-19, 100-199)
2 - 121x (2, 20-29, 200-299)
3 - 37x (3, 30-39, 300-325)
4 - 11x (4, 40-49)
5 - 11x
6 - 11x
7 - 11x
8 - 11x

etc

Masquerouge
May 31, 2007, 09:34 AM
I think it's because the size of the measuring unit is arbitrarily chosen. Like, say, a foot (based on the human foot) versus, I dunno, the lengths of various animals. You should expect a quasi-uniform distribution of the logarithms of the values, over a range of many orders of magnitude.

Okay I think that's what they're saying, and that's exactly what I don't get. Why should I expect a quasi-uniform distribution of the logarithms, and why having a quasi-uniform distribution of logarithms of the values means that I will end up with 30% of the number starting with 1, 17% starting with 2, etc.?



Our counting system starts with 1, so its obvious this will be more used than 2 etc, seems obvious. For example if the street you live in counts up to # 325, than first number:


I understand your example, but I'm not sure that's what the explanation is - and that's too bad, because this one I understand :)

Erik Mesoy
May 31, 2007, 10:16 AM
Pick some random numbers of arbitrary size. List the integers going up from 1 to each of those numbers. You'll get sequences looking something like this:

*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33
*1, 2, 3
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

(I picked 4 random numbers up to 100)

Amount of numbers produced that begin with 1: 30.
Amount of numbers produced that begin with 2: 24.
Amount of numbers produced that begin with 3: 18.
et cetera...
Amount of numbers produced that begin with 9: 3.

If you pick among a worse sample, such as "up to 199", the pattern is stronger. If a lottery has 20000 tickets, more than half of them have a number beginning with a 1.

Put another way, the amount of numbers between 1 and N inclusive that begin with the digit "1" is always equal to or greater than the amount of numbers between 1 and N inclusive that begin with another digit.

Does that help?

Masquerouge
May 31, 2007, 10:46 AM
Pick some random numbers of arbitrary size. List the integers going up from 1 to each of those numbers. You'll get sequences looking something like this:

*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33
*1, 2, 3
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
*1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

(I picked 4 random numbers up to 100)

Amount of numbers produced that begin with 1: 30.
Amount of numbers produced that begin with 2: 24.
Amount of numbers produced that begin with 3: 18.
et cetera...
Amount of numbers produced that begin with 9: 3.

If you pick among a worse sample, such as "up to 199", the pattern is stronger. If a lottery has 20000 tickets, more than half of them have a number beginning with a 1.

Put another way, the amount of numbers between 1 and N inclusive that begin with the digit "1" is always equal to or greater than the amount of numbers between 1 and N inclusive that begin with another digit.

Does that help?

Yes, it does help, and that was my intuitive explanation, but it doesn't seem to be what the papers are saying.
For instance, Wiki says:
This (meaning the fact that the leading digit is 1 almost one third of the time, and larger numbers occur as the leading digit with less and less frequency as they grow in magnitude) is based on the observation that real-world measurements are generally distributed logarithmically, thus the logarithm of a set of real-world measurements is generally distributed uniformly.

I don't understand why real-world measurements are distributed logarithmically, and I don't understand what it means to be distributed logarithmically. Is that a fancy way of saying what you just explained?

The law can be explained by the fact that, if it is indeed true that the first digits have a particular distribution (http://en.wikipedia.org/wiki/Distribution_%28statistics%29), it must be independent of the measuring units used. For example, this means that if one converts from e.g. feet (http://en.wikipedia.org/wiki/Foot_%28unit%29) to yards (http://en.wikipedia.org/wiki/Yard) (multiplication by a constant), the distribution must be unchanged — it is scale invariant (http://en.wikipedia.org/wiki/Scale_invariant), and the only distribution that fits this is logarithmic (http://en.wikipedia.org/wiki/Logarithmic_distribution).

And here I'm completely lost. I don't understand why a logarithmic distribution is scale invariant, and why being scale invariant means that the first digit will be 1 30% of the time.

Erik Mesoy
May 31, 2007, 10:52 AM
I don't understand why real-world measurements are distributed logarithmically, and I don't understand what it means to be distributed logarithmically. Is that a fancy way of saying what you just explained?AFAIK, yes. The "why" of being distributed logarithmically is "because they're distributed in the above way, and that's logarithmic".



And here I'm completely lost. I don't understand why a logarithmic distribution is scale invariant, and why being scale invariant means that the first digit will be 1 30% of the time.
Saying that a logarithmic distribution is scale invariant is a bit like saying that the slope of a pyramid is translation invariant, but not under rotation. Pyramids stay the same angle if you move up or down; logarithmic distributions stay patterned if you multiply by some factor.
The 30% figure (and the others, let me quote) are from the following pattern:
1------------30.1%
2----------- 17.6%
3----------- 12.5%
4----------- 9.7%
5----------- 7.9%
6----------- 6.7%
7----------- 5.8%
8----------- 5.1%
9----------- 4.6%
Log 2 = 0.301029996
Log 3 = 0.477121255.
0.477121255-0.301029996 = 0.176091259
Etc...
Log 9 = 0.954242509
1 - 0.954242509 = 0.045757491

Masquerouge
May 31, 2007, 10:58 AM
AFAIK, yes. The "why" of being distributed logarithmically is "because they're distributed in the above way, and that's logarithmic".

Thanks man :) I was frustrated because I understood very well how it worked, but not why - and when you explain it to someone else it helps :)




Saying that a logarithmic distribution is scale invariant is a bit like saying that the slope of a pyramid is translation invariant, but not under rotation. Pyramids stay the same angle if you move up or down; logarithmic distributions stay patterned if you multiply by some factor.
The 30% figure (and the others, let me quote) are from the following pattern:

Log 2 = 0.301029996
Log 3 = 0.477121255.
0.477121255-0.301029996 = 0.176091259
Etc...
Log 9 = 0.954242509
1 - 0.954242509 = 0.045757491

:eek: that awesome! Thanks again! :)

Ayatollah So
May 31, 2007, 11:03 AM
The law can be explained by the fact that, if it is indeed true that the first digits have a particular distribution, it must be independent of the measuring units used. For example, this means that if one converts from e.g. feet to yards (multiplication by a constant), the distribution must be unchanged it is scale invariant, and the only distribution that fits this is logarithmic.

And here I'm completely lost. I don't understand why a logarithmic distribution is scale invariant, and why being scale invariant means that the first digit will be 1 30% of the time.

Funny, that explanation you quoted sounds a lot like what that philosopher I mentioned was saying.

I think Erik answered the 2nd half of your question. To answer the first, take several large arrays of numbers and play around with them - multiply each by a constant.
Here are some examples of uniform logarithmic distributions:
1 2 4 8 16 32 64 ....
1 3 9 27 81 243 729 ...
And an example of a uniform linear distribution:
1 2 3 4 5 6 ...

macko
Mar 17, 2009, 03:14 AM
But if we switch from decimal to binary numbers, every binary digit will start with 1! Thus, the first figure is 1 in 100% cases!

Mise
Mar 17, 2009, 04:39 AM
The applications in Accounting fraud are really amazing! :wow: I shall have to find some way of using this to my advantage, perhaps at work :hmm:

EDIT: Incidentally, I think the key to the "why" is in this paragraph:
Multiple probability distributions

Note that for numbers drawn from many distributions, for example IQ scores, human heights or other variables following normal distributions, the law is not valid. However, if one "mixes" numbers from those distributions, for example by taking numbers from newspaper articles, Benford's law reappears. This can be proven mathematically: if one repeatedly "randomly" chooses a probability distribution and then randomly chooses a number according to that distribution, the resulting list of numbers will obey Benford's law.[8][3]
Having read [3] and skimmed [8] (it speaks in maths which I can't really understand), the above results in the logarithmic distribution of digits (the proof of this is [8]).

illram
Mar 17, 2009, 07:57 PM
So can I use this to up my chances of winning the lottery?

ParadigmShifter
Mar 17, 2009, 08:25 PM
Yeah, logarithms answer the question.

EDIT: That was directed to the OP rather than illram. Don't buy a lottery ticket :lol:

Birdjaguar
Mar 17, 2009, 10:07 PM
The applications in Accounting fraud are really amazing! :wow: I shall have to find some way of using this to my advantage, perhaps at work :hmm:

How do you see such an application?

warpus
Mar 17, 2009, 11:17 PM
The reason that the distribution is logarithmic is because

The probability that a number is between 100 and 1000 (logarithm between 2 and 3)
= The probability that a number is between 10,000 and 100,000 (logarithm between 4 and 5)

Why?

Well, this is obviously not always true.. but for many sets of numbers it is a reasonable assumption.. especially for sets of numbers that grow exponentially, like incomes, and stock prices, and sets of numbers we encounter in daily life.

Why?

Because the systems we use to measure things are arbitrary.. Take the distribution of all incomes, of all people who live in the U.S. You're going to get a whole bunch of things at the bottom. ($0 - $10,000), then a smaller amount of things a bit higher up ($10,000 - $20,000), then an even smaller amoutn of things a bit higher ($20,000 - $30,000), and so on, and so on.

But wait! What if you expressed all these incomes in Zimbabwean dollars? or Polish Zloty? Or Euros? Or yen? Well, you'd get the exact same type of distribution.

I realize that it's not totally obvious why that makes things logarithmic, but that's how it makes sense to me.

Knight-Dragon
Mar 17, 2009, 11:34 PM
Moved to S/T.

ainwood
Mar 18, 2009, 03:18 AM
How do you see such an application?

It is used in forensic auditing. For example, if someone if making up false invoices and then having the company pay them: People would tend to (say) make lots of fake invoices that are small enough not to arouse suspicion (eg. less than $1000) but large enough to make it worthwhile (ie go for several hundred rather than one hundred).

The "real" invoices will likely follow benford's law, while the fake ones will distort it, because they are not 'random' (or even pseudo-random).

Mise
Mar 18, 2009, 03:19 AM
For me, the interesting part isn't that "exponential things" have first digits distributed logarithmically (that's quite obvious when you're told it!), or that lots of every day things are exponential. The interesting part, for me, is that, when you take a random number from a random distribution -- even ones that don't obey Benford's law, such as uniform distributions or normal distributions -- you end up with a distribution that obeys Benford's law. I find that quite incredible.

EDIT: It's unfortunate that the wiki only dedicates a single paragraph to this fact, and spends much more time explaining the "exponential" and the "measurement" things, neither of which explain how taking disparate numbers from different newspapers will result in a Benford-distributed set of first digits.

I can't follow the proof (not being well versed in Statistics, or even Maths anymore), so if anyone has a more "intuitive" description of the proof, I'd love to hear it!

@Birdjaguar: Do you mean at my work or in Accounting fraud?

warpus
Mar 18, 2009, 11:07 AM
Mise, once you understand that the probabilities have a logarithmic distribution and why, what else is there to understand?

Mise
Mar 18, 2009, 12:22 PM
Well you might as well do away with the rest of the article, and just leave "the probabilities have a logarithmic distribution" then...

That the distribution of a random sample of random distributions that don't follow Benford's law follows Benford's law is not obvious.

ParadigmShifter
Mar 18, 2009, 12:56 PM
It's because we use base 10.

If we used binary every number except zero would have the first significant digit 1.
If we used base 3 the non-zero numbers would have 50% starting with 1, the rest with 2.

Assuming we generate numbers at random of course (which is where picking numbers from different distributions helps out).

Mise
Mar 18, 2009, 01:35 PM
Sorry, PS, I don't follow.

If we use Base 2, 100% start with 1.
If we use Base 3, 50% start with 1.
If we use Base 4, 33% start with 1.
If we use Base 5, 25% start with 1.
If we use Base 6, 20% start with 1.
... etc etc ...
If we use Base 10, 10% start with 1.

I don't see why this explains why 30% of digits selected at random from random distributions helps?

BTW I've uploaded an excel spreadsheet that demonstrates this, if anyone's interested...

ParadigmShifter
Mar 18, 2009, 01:46 PM
Yeah I wasn't that clear. The pattern breaks down for bases higher than 4.

Let's say we know the distributions are from uniform distributions with range [0,X] where X is a random variable in the range [1,10000].

Half of the random distributions will be from ranges smaller than [0,5000] so that adversely affects the number of 6s, 7s, 8s and 9s in the leading digits.

And the limiting behaviour for X tends to the observed distribution.

Mise
Mar 18, 2009, 01:50 PM
Yeah I wasn't that clear. The pattern breaks down for bases higher than 4.

Let's say we know the distributions are from uniform distributions with range [0,X] where X is a random variable in the range [1,10000].

Half of the random distributions will be from ranges smaller than [0,5000] so that adversely affects the number of 6s, 7s, 8s and 9s in the leading digits.

And the limiting behaviour for X tends to the observed distribution.
THANK YOU! This makes sense :)

EDIT: I guess it's similar for Normal distributions, because the standard deviation is generally of the same order of magnitude as (and therefore in some way proportional to) the mean.

ParadigmShifter
Mar 18, 2009, 02:08 PM
No, the mean and standard deviation are independent in Normal distributions. But they will be grouped around the mean which again follows the rule because of base-10 arithmetic.

The major breaking case is non-finite standard deviations I think.

EDIT: I think ;) I only skimmed the paper.

Mise
Mar 18, 2009, 02:21 PM
I'm confused again! If the mean is a random number between 0 and 10,000, and the s.d. is random between 0 and 10,000, then surely that means that the uniform distribution logic doesn't work? As in, 50% will have a mean between 0 and 5000, but then at least as many of those distributions will have an s.d. wide enough to encompass 5000-10000, as means between 5000-10000 having sd's to exclude numbers between 0 and 5000?

I might try to rephrase that...

ParadigmShifter
Mar 18, 2009, 02:24 PM
Standard deviations tend to be small though in real world data. I suppose you are right that they are in some way correlated to the order of magnitude of the mean, otherwise it would be a bad fit for data.

I'm just saying the Normal distribution doesn't impose any relation between the sd and mean, it's a function of 2 variables.

Birdjaguar
Mar 18, 2009, 06:34 PM
It is used in forensic auditing. For example, if someone if making up false invoices and then having the company pay them: People would tend to (say) make lots of fake invoices that are small enough not to arouse suspicion (eg. less than $1000) but large enough to make it worthwhile (ie go for several hundred rather than one hundred).

The "real" invoices will likely follow benford's law, while the fake ones will distort it, because they are not 'random' (or even pseudo-random).



@Birdjaguar: Do you mean at my work or in Accounting fraud?I was thinking accounting fraud, but now am curious about what you do and how it applies.

Thanks ainwood.

warpus
Mar 18, 2009, 11:26 PM
Well you might as well do away with the rest of the article, and just leave "the probabilities have a logarithmic distribution" then...

That the distribution of a random sample of random distributions that don't follow Benford's law follows Benford's law is not obvious.

well.. i think you're just thinking too hard about this.

the key is that we use 1) arbitrary measuring methods and 2) number systems we encounter tend to be exponential in nature, which both imply 3) logarithmic magic happening, which implies 4) Benford's law

the leap from 1 and 2 to 3 is the tough one, from 3 to 4 is a bit more obvious.

Mise
Mar 19, 2009, 04:09 AM
Except that Benford's law applies to sets of fictitious normal and uniform distributions, for example those in the excel spreadsheet above.

In the spreadsheet, I've got 2 uniform distributions and 5 normal distributions. The normal distributions have randomly selected means between 0 and 100,000, and randomly selected standard deviations between 0 and 100,000. These are far from real world measuring systems, and are certainly not exponential. The uniform distributions are just random numbers between 0 and 100,000 and 0 and 100 respectively. The spreadsheet then selects a random distribution from these 7, and picks out the 1st digit. It does this 200 times (but I've extended it to 1000 since uploading it), and the distribution of 1st digits follows Benford's law.

Neither (1) arbitrary measuring methods, nor (2) naturally exponential things like pop distributions or stock prices explain why normal or uniform distributions, with no reference to measuring methods whatsoever, produces Benford-distributed 1st digits.

The stuff what ParadigmShifter sez about Uniform distributions makes sense to me. But I can't understand why randomly selected normal distributions with random parameters follows Benford's law.

(I should remind folks that neither Normal distributions nor Uniform distributions themselves follow Benford's law.)

Perfection
Mar 23, 2009, 07:58 AM
Consider the case when the standard deviation is zero:

The thing collapses into Benford's law?

:confused:

Mise
Mar 23, 2009, 12:48 PM
No, it doesn't follow Benford's law if s.d. is 0. When s.d. = 0, the normal distribution is just a straight vertical line, with 1 value (the mean). If you had a hundred such distributions, and picked randomly from them the first digits, you'd simply end up picking randomly from the means, which are distributed uniformly. Uniform distributions don't obey Benford's law.

However, if s.d. is very large (orders of magnitude larger than the mean), the normal distribution looks rather like a uniform distribution; PS has already shown why randomly selecting uniform distributions leads to Benford's law. What I suspect is that s.d.'s in real life for normal distributions are sufficiently large to make random selections of random distributions follow Benford's law. However, I am yet to be convinced by this explanation! There's gotta be a better one...

Lord Olleus
Mar 23, 2009, 01:46 PM
No, it doesn't follow Benford's law if s.d. is 0. When s.d. = 0, the normal distribution is just a straight vertical line, with 1 value (the mean). If you had a hundred such distributions, and picked randomly from them the first digits, you'd simply end up picking randomly from the means, which are distributed uniformly. Uniform distributions don't obey Benford's law.

However, if s.d. is very large (orders of magnitude larger than the mean), the normal distribution looks rather like a uniform distribution; PS has already shown why randomly selecting uniform distributions leads to Benford's law. What I suspect is that s.d.'s in real life for normal distributions are sufficiently large to make random selections of random distributions follow Benford's law. However, I am yet to be convinced by this explanation! There's gotta be a better one...

Surely if the sd is 0 then the only number you can pick is the mean. It hardly seems fair to call this a uniform distribution as it isn't distributed at all - its not even a random variable. In that instance benfords law doesnt apply because every single number from that 'distribution' will be the same and therefore begin with the same number!

Mise
Mar 23, 2009, 02:02 PM
I'm not calling one normal distribution with s.d. of zero a uniform distribution. I said that selecting a distribution randomly from a hundred such distributions (with randomly (uniformly) distributed means) will yield a uniform distribution.

Put another way: generate 100 normal distributions with mean X and s.d. 0, where X is a random number taken from a uniform distribution. Then, select randomly from these distributions. Obviously, this will simply generate 100 random numbers from a uniform distribution.

Lord Olleus
Mar 23, 2009, 02:13 PM
Ah I see, sorry I didn't realise that that was what you meant.

Anarchist
Jan 06, 2011, 03:03 AM
Fantastic. Thank you for confirming my belief that Civ players are brighter than the average!

I've just been shown an article in New Scientist that makes out that Benford's law is something magical.

To me it seems obvious that if you have a random sample set, starting at zero (OK that is a bit unlikely, but most sets probably do), the chance if the first significant digit being 1 increases as the first sig fig of the maximum size of the sample set drops toward 1. Once it reaches there, the chance diminishes again until you reach 9, and it starts again.

At any point, you will never have a situation that the chance of the first sig fig is less than 1/9, so that is the minimum. The maximum would be at, say 1 to 199, where the chances are (1/9 + 1)/2 (roughly). This is a range of 11% to 56%. If I simplistically assume an average of these, I get 33%, which is pretty close to what Benford says it is, at 30.1%.

I see comment (elsewhere) that state: “Everyone knows that our number system uses the digits 1 through 9 and that the odds of randomly obtaining any one of them as the first significant digit in a number is 1/9. ”
And that appears immediately false to me.

And people marvelling that phone numbers and lottery numbers do not follow this law. Well, Duh! They have a fixed number of digits so will follow a standard distribution, is all.

Benford's law seemed obvious to me as a perfectly natural thing to occur after about 10 seconds thinking about it.

Thanks, guys!

Souron
Jan 06, 2011, 01:38 PM
So can I use this to up my chances of winning the lottery?The best way to up your chances of winning the lottery is not to play.

Perfection
Jan 08, 2011, 12:52 PM
The best way to up your chances of winning the lottery is not to play.
That's false. The best way to up your chances of winning the lottery to play as much as possible.

ParadigmShifter
Jan 08, 2011, 12:53 PM
Depends if you define "winning" by maximising expected income or not.

Souron
Jan 08, 2011, 01:06 PM
Depends if you define "winning" by maximising expected income or not.This. It's a win if you have more net money from the lottery than the average guy playing.

Mise
Jan 09, 2011, 03:16 PM
I define "winning the lottery" as "winning the lottery" :p

Perfection
Jan 09, 2011, 03:42 PM
Me too. The question wasn't how to maximize your bux.

ParadigmShifter
Jan 09, 2011, 04:28 PM
I define "winning the lottery" as "winning the lottery" :p

That's why I'm a mathematician, and you are an economist ;)

Perfection
Jan 09, 2011, 04:35 PM
You can't win the lottery without winning the lottery. :crazyeye:

ParadigmShifter
Jan 10, 2011, 12:07 AM
A strange game. The only winning move is not to play

See... playing the lottery and global thermonuclear war are very similar.