Wednesday, June 4, 2014

Yet another power-law tail, explained

At the next Boston Python user group meeting, participants will present their solutions to a series of puzzles, posted here.  One of the puzzles lends itself to a solution that uses Python iterators, which is something I was planning to get more familiar with it.  So I took on this puzzle, by John Bohannon, (who says he was inspired by this programming problem).  Here's John's version:
Consider these base-10 digits: 123456789. If you insert spaces between them, you get various sequences of numbers:
1 2 3 4 5 6 7 8 9
12 3 4 5 67 8 9
1 2 34 5 6 7 89
12 34 56 78 9
1 23456 78 9
12345 6789
1) Write a program that generates all possible combinations of those digits.
How many are there?
Now let's insert a maximum of 8 addition or subtraction operators between the numbers, like this:
Notice that those arithmetic expressions equate to different values:
1+2+3+4+5+6+7-8+9 = 29
12-3+4+5-67-8+9 = -48
1+2+34+5-6-7-89 = -60
12-34+56+78+9 = 121
1+23456-78-9 = 23370
12345+6789 = 19134
2) Write a program that generates all possible expressions in this way.
How many sum to 100?
3) Write a program that finds all such expressions for any sum.
Which sum is the most popular, i.e. has the most expressions?
4) Bonus: We can haz pretty data viz?
Like how about a histogram of the number of expressions with sums from -23456788 to 123456789. (A log scale might help. Maybe binning, too.)
I put my solution in an iPython notebook, which you can view here.  Some conclusions:

  1. The distribution of values you can generate by arranging these digits and operators has a power-law tail.
  2. The distribution is a mixture of (approximately) uniform distributions, where each element of the mixture is characterized by the length of the largest term in the expression.  That is, if the biggest term has 3 digits, the values tend to be in the 100s.  If the biggest number has 4 digits, the value is usually in the 1000s, etc.
  3. The reason for the power law is that each element of the mixture is an order of magnitude bigger, but about 0.6 orders of magnitude more rare.  So the parameter of the power tail is about 0.6.
I also ran the analysis with hexadecimal numbers and confirmed that the pattern persists.

For future exploration: what do you think happens if you add multiplication to the operator mix?  Download the notebook and try it!