Wednesday 17 March 2021

Obey Benford's - It's The Law (an introduction to my Benford's Law project)

Introduction:

Some years ago, I read the book, “How Long Is a Piece of String?: More Hidden Mathematics of Everyday Life by Rob Eastaway” (as reviewed here), and one chapter fascinated me.  The chapter was chapter 12 - “Is it a fake?”, and the section that particularly caught my interest was about Benford’s Law.  Excessively simplifying, in naturally occurring numbers, the leading digits will follow a distinct pattern, and will not be randomly distributed.

The expected % of leading numbers for each digit can be seen in the table below:

If you have a large naturally occurring data set that doesn’t conform to this, it tells you there are either constraints on it so that the data doesn’t cover all of the possibilities (e.g. human heights in m are will start with a 1 or a 2, no one has ever been 4 m tall) or something else is going on.

Testing this theory:

I wanted to test this out on *something*.  Problem was, what?  Most sports data is possibility-limited e.g. fewer goals will be scored in football the 9th or 9xths minute than would be scored in the 8th and 8xths minute, not because of the minute, but because the game stops at the 90th minute.  Other data isn’t big enough.  I needed a source of numbers that was large and unlimited.

Eventually, possibly in a fit of cynicism, I decided to try the leading digits of numbers reported in the news.  Advantages to this plan - I can use a single, traceable data source - one article a day from the BBC news website.  The BBC doesn’t tend to delete pages so if someone wanted to double check my numbers, I could give them the links.

Disadvantages to this plan - when I first attempted it, Article 50 was in the news, and skewing my results.

Having looked at the results, and realised this and a few methodological errors, and going a bit stir-crazy because of lockdown 3, I decided to try it again.

Attempt Number 2:

These were the rules I developed to try to avoid that and similar pitfalls:

1 - no numbers in names e.g. 19 in COVID-19 does not count as a leading digit

2 - no numbers from dates (I had done this originally, but worth restating)

3 - only digits written as digits.  This threw up an unexpected problem - the BBC has somewhat intermittent editorial control on whether digits under 10 are written as words or numbers, and this may skew results.  I’ve saved the links to the articles I’ve used to put the project together so I can go through them again if I want to (or if someone else wants to look at them).

I started on the 1st of February 2021, and will carry on till 1st of February 2022 (barring disaster).  The other advantage of this system is that if I miss a day, I can fill them in with more days at the end.  I will give monthly updates and running totals, plus some commentary if I have any.

No comments:

Post a Comment