Numbers-Only Hash

Maybe two years ago, I encountered something that made me think. (I mean, that wasn’t the only time, I’m just… look, never mind.) I was committing something to some git repository, and the truncated commit hash (e.g. f151b8e) was all numbers. Not one of the possible letters a through f appeared in it. Of course, the full hash (e.g. f151b8e71a17fd7c7e074bc4c1500f2dc05da742) had plenty of letters, but by chance those first 7 nibbles were all less than a.

Hmm, I thought. How likely is that? Well, that’s a pretty easy problem. Let’s assume all hashes are equally likely. That means we can consider each character of the hash independently. Each character can be one of sixteen possible values (0 through f), but we want the cases where each character is a member of the subset 0 through 9. That means that the probability of getting a digit for a given character in the hash is given by the following.

Simple enough, but that’s just for one character. To get all seven characters as digits, we need to raise this value to the power 7.

So about 1 out of every 27 commits or so should contain nothing but digits in the first seven characters. To my mind, that’s infrequently enough to be interesting while still frequently enough to be fun.

Other Interesting Parameters

What other neat constraints can we put on?

Well, an obvious one is to expand our digits-only hash from 7 characters to the full 40. In that case, the probability is considerably smaller.

That’s about 1 in 146 million, so we can probably say that there might be a commit out there that only has digits. I bet a focused effort could even find it, but I leave that as an exercise for the reader (or maybe me in the future).

What about a short hash that’s in ascending order? We’ll include letters again, so it could be something like 24569bf. Notice that each character’s value is less than the one after it.1

How could we go about solving this? Well, somebody told me once that probability is just fancy counting, so let’s count. We know there are sixteen possible values and seven spots for them to go in. When we pick the first character, it can be any character up to 9, inclusive, since anything more wouldn’t leave us enough room to fit strictly-increasing characters.

That means that the only case for 9 looks like this: 9abcdef. For 8, there can be one “jump” in the sequence, and that jump can be anywhere, so 8abcdef and 89abcde are both valid. That jump can be in any of seven positions (after the first, after the second, etc.), so for 8 there are seven cases.

7 is where it starts to get more difficult. Here, there can be two jumps of value 1 or one jump of value 2. We know from 8 that the jump of value 2 can be in any of seven positions (e.g. 789cdef). A better way to count these along with the jumps of value 1 might be to think of the jumps as additive. That is, jumps are always of value 1, but if more than one happen to occupy the same position, they “stack up” and become a jump of a larger value. This makes the counting easier, because now we can consider the jumps individually.

Let’s try it. Each jump can occupy any of seven positions, and for 7 there are two jumps. That’s a total of $7^{2} = 49$ cases. For 6, there are three possible jumps, so that’s $7^{3} = 343$. In general, we can sum everything up using the following expression.

As you might expect, most of the cases come from the last term of the sum, when $n=9$.

Now, all we need to do is count up the number of possible arrangements of any sort and divide the two values to come up with the probability. Since there are sixteen possible values for seven different positions, finding the total number of possibilities is pretty straightforward.

And so is dividing the two values.

This value is surprisingly high to me. I wouldn’t have guessed that nearly 1 in 5 truncated commit hashes meet my strictly-increasing criterion, but apparently they do.

We could continue on like this forever.

There are endless criteria we could place on commit hashes (or other nearly-meaningless strings). I encourage the reader to set one or more and try to work out the probability. It’s good for the soul, or whatever.

  1. Now, some of you may be smugly thinking that you know the term for this property—as I did mere moments before writing these words—but you’re (probably) wrong. The parameter I’ve arbitrarily stated explicitly disallows repeating the same value, but a monotonic function can have plenty of repeated values if it wants, as long as the values never start going back the way they came.