On entropy, randomness and shuffling a playlist

Randomness is a surprisingly physical idea.

One of the many introductory Octave commands students are taught in our physics course is the rand() command. All this does is generate a random number; this number could be real so if you want an integer you can simply round() it off. This seemingly trivial capability is present in almost all computing languages[^ At least in all languages I’ve come across.]. But mentioning it in class recently brought back a recurring question to my mind: just how does a computer generate random numbers?

Think about this for a moment: if a computer could in fact generate a random number based purely on chance we would be in a rather dangerous world. The only reason you, as a human, can come up with a random number is because you have some sort of free will. A computer is, by definition, a dumb machine that follows instructions[^ That it can be given specific instructions to ultimately accomplish ‘smart’ tasks is a different issue altogether.] so it should never be able to think up something at random—unless it had free will or some equivalent skill of unbridled thinking. Randomness is proof of free thinking.

This was also the subject of a conversation with a colleague when we were wondering if, given an exhaustive playlist and ample time, we could determine the ‘randomness’ of our phones shuffling our music. After all this is simply a form of random number generation going from the 6th song to the 24th song to the 18th song et cetera using randomly generated numbers 6, 24, 18 and so on. But determinism and randomness are opposites; if we could determine the pattern our songs were playing after all it would mean that our devices were not truly shuffling something but were, instead, following a complex, layered—but periodic or otherwise identifiable—pattern in the process.

Getting started with pseudorandom numbers

The simplest way to generate a random set of numbers is via a cipher. Here is a little exercise I had tried out myself back when I was in school. Let us keep our limits[^ Following the publication of this article some of you wrote back to me wondering why limits were needed. To maintain generality you are correct that no limits are necessary; but you may need them in some situations: if you wanted to shuffle a playlist with 50 songs and the random number generator picked 137, what song would it play? In such a case having 1 and 50 as our limits to drive the program makes sense.] at 1 and 10, so we need to generate a random number between these.

Start off with a top-secret substitution cipher; say we have something like 3–1–8–2–4–7–6–9–5. Use the familiar sequence of natural numbers to compare this with. So 1 is 3; 2 is 1; 3 is 8; 4 is 2; 5 is 4 and so on. Next, pick a seed: say the user inputs 8, then, according to the cipher, you output 9.

At this point you are one-to-one with the seed. You could choose to keep this up and require a seed every time but you can also become a little more independent by using your own output as the next seed. In our example 9 becomes the next seed so the subsequent outputs will be 5, which will be followed in suit by 4, 2, 1, 3 and 8.

This is good enough for a kid starting out at school but there are some obvious problems with this algorithm. For example it would, perhaps, take a user three tries to figure out the pattern here[^ Of course there is nothing special about the number three: quick users will figure it out in two repetitions (the minimum) while slow ones will take many more (there is no upper limit).]. The randomness is instantly broken.

If you chose to go one-to-one a user could figure out your game simply by seeding the same number a minimum of twice and end up getting the same output both times. From then on it would only be a matter of seeding each of the remaining eight numbers and finding out their appropriate substitution.

At this point the user can game your system: if he wanted seven as his output he would simply seed six. This is only slightly more cumbersome than choosing the seventh song manually but it is outright pointless from the perspective of a random number generator.

Surprisingly enough this is exactly how random number generators work on a fundamental level. This is why we normally call these pseudorandom numbers; the numbers chosen thusly are random but not quite. The entire idea can now be re-stated as follows: if there are enough complexities in the number generation process to make the substitution hard to identify the program will appear, for all intents and purposes, like a generator of truly random numbers.

Boosting the complexity, part 1

The most efficient way to boost the complexity of our algorithm is to identify its weakest links and strengthen them. The first one is the length of the cipher[^ Imagine the same algorithm as before but with a billion numbers. If you chained the generator to use past outputs as new seeds automatically it would take considerably longer for users to figure out the entire billion-strong cipher.] but we will not dwell on this because a truly random number generator should work on any sets of limits, including something as small as one to ten.

While we are on the topic of lengthening our cipher itself, which is somewhat of a brute force attack, here is a quick and dirty way of thinking about how this could actually solve our ten-digit cipher problem. Every number can be reduced to a single digit number by simply adding up its digits. This is not real mathematics but so long as we can separate the digits of a number (weight them out by tens) you can add the digits up, e.g. 754 is simply 7 times 100 plus 5 times 10 plus 4 times one, so we ignore the weights of tens and add the multipliers to get 7+5+4=16 and repeat this process to get 7 in the end. Along the same lines if you have a billion-strong substitution cipher you can simply resort to choosing a number by the same means as above and adding its digits up over and over again till you get something within your range of interest (0–9 in our case). This is sufficiently random in the short run and will take users quite long to break (unless they have a computer).

So where do you really introduce complexity in all this? One of the problems with using a cipher is the secrecy involved (remember I said we need a ‘top-secret’ cipher—that was no joke). Of course you can never fully do away with this requirement but you can lessen the burden: instead of protecting a sequence of billion numbers drop the sequence and go for a formula instead. The great thing about this—besides the fact that it is simpler to handle a formula than a billion numbers—is that the thing need not make any sense whatsoever.

Say your formula is something like this: for a given number square it, subtract the previous output (or twelve if there is none), take the modulus, multiply by the previous output (or fourteen if there is none), compute its square-root and round it up to the nearest integer, then, if it is beyond a required range, keep adding up its digits repeatedly until you are within the range.

Of course the formula in this new method is complete gibberish but it does not matter. Nobody can figure it out[^ If there are fifteen possible mathematical operations (say) the permutation to pick seven out of those—which is the number of operations in our formula—is 217,945,728,000 possibilities. At the rate of one permutation per second this would take a computer over 6,000 years to figure out unless it gets incredibly lucky, so we are quite safe.] unless you give away the formula to someone yourself. Why does the gibberish work, though? It works for the same reason why messy passwords that randomly combine upper and lower cases, numbers and symbols are safer than password123—randomness makes things harder to crack (this is not a universal rule, as xkcd once pointed out sometime back in 2011). Still, the whole idea is simple: you keep a human-randomised formula to generate random numbers so that there is greater randomness even if it is not absolute.

For the curious, the use of twelve and fourteen in the method above is to ensure the algorithm works on the first run. These numbers are chosen at random—you could have put in anything else there too—and from the second run it uses neither choosing, instead, to use the previous outputs as mentioned. The twelve and fourteen, in other words, are simply random fallbacks.

Boosting the complexity, part 2

There is a second method of making our number generator random. Notice that regardless of our generator we need a seed. Those are the only two ways of hacking into our approach: either figure out the cipher/formula or figure out the seed. Neither is a complete break-in by itself because you need both to predict the coming sequence of (pseudo)random numbers, but knowing either considerably reduces the effectiveness of the generator because if you know the seed you can game the first output and if you know the formula or cipher you can game all subsequent outputs.

The question then is how to strengthen the seed. If the user picks the seed there is no point discussing security or randomness anymore. The user just has to restart the system and use a brute force attack to figure out which seed gives them their desired output. Our ideal seed, therefore, should be something that changes every time the program starts and which does not require the user’s intervention.

Here is the most obvious solution to this: what number changes everyday, nay every fraction of every second? The time. If our program can take the current time and add up its digits it can make up its own seed. From here on you could choose to use the past output as the next seed (as we did before) or simply keep regenerating a seed because the time changes every second. The former is good enough until the cipher/formula is broken but is still the safer choice because if the user realises that the time is being used they can use a computer to start the program at exactly the right time to successfully game[^ One might have wondered by now why I dwell so much on the importance of not being able to game the system. For starters the whole point of a random number generator is randomness; more important, these form the basis of a lot of technology these days on which the economy works, such as predicting stock markets, gambling and even encrypting your communications and payments on the internet. Being able to game these systems and predict the randomness can have devastating real world consequences.] the system. For our daily use, and for shuffling musing in your car, this sort of random number generator will do. For one you probably have no intention of cracking the sequence so long as it is not obviously predictable; and secondly the stakes are not that high when your iPhone shuffles music.

Taking things further with entropy

To generate a nearly perfectly random number one can turn to physics[^ At this point I am tempted to use the cringeworthy expression ‘duh’.]. Quite literally, you need the entropy of the physical world to help you out here. Purely as an example let us start with an outlandish idea[^ This is not so outlandish to be honest: Cloudflare’s Singapore office actually uses this method in real life to generate random encryption keys for computers.]. The key is a radioactive source (one of the reasons I picked this example is because the students at our department—we started with them, remember?—perform an experiment where they use a radioactive source to show randomness as the deviation from a gaussian peak using a Geiger counter).

What we are getting at here is that if you used a radioactive source with a counter for preset time intervals you will end up with completely random counts of emitted radiation. With enough counts, say a hundred minute-long sessions, you will end up with a gaussian curve, so there is always going to be a number that is more likely than another but the probability of any single number being generated is perfectly random. You can take the count itself or you can take its deviation from the mean (the most likely value) but the point is you now have a perfectly random number generator. Use this as the seed for the random number generator on your computer and you have a pretty solid program. True, your cipher/formula is now the weakest link in the chain but if you can afford to simply generate random numbers based on the radioactive counts things should turn out just fine.

But of course not everyone has access to radioactive sources. But you can actually come up with a more everyday method. How you will feed this into your computer is secondary and is not something I will dwell on here; the physical source of the randomness is what we are currently interested in. If you flip a coin you will end up with either heads or tails. Let us call these one and zero bits respectively. Although the likelihood of either a one or a zero is half-and-half we know that whatever happens on your next toss is in fact completely random. You can speed up the process by tossing four coins simultaneously (again, how you do this is left to you—bring a friend along, perhaps) and your outputs, say H–T–T–H or 1–0–0–1 form a nibble. Toss eight coins and you have a byte. If necessary convert this binary number into a decimal and you have your seed. Around 2010 at the University of Hagen the same strategy was used but with flip flop circuitry to generate random numbers.

There are yet other ways, perhaps the most interesting of which is the lava lamp set-up that Cloudflare[^ Cloudflare is an excellent cached data edge server and I use their services for this website too.] uses. Lava lamps generate bubbles randomly[^ As a physicist I feel I must point out here that, knowing the pressure and temperature conditions and the viscosity of the fluid inside the lava lamp, the bubbles are perfectly predictable. However the effort that goes into such a complex, evolving calculation is crazy enough that I doubt anyone would every bother with it. Also you would need to program this into a computer if you intend to keep up with the lava lamp.]. They constantly make pictures of these lamp set-ups and convert these pictures into streams of binary digits. Naturally no two pictures are ever going to similar so eternal randomness is guaranteed.

You can think of innumerable examples[^ For a more construction-worthy example take a look at Giorgio Vazzana’s random number generator built atop a chua circuit.] of this: a pair of dice (or, better yet, a dozen of them) or skipping stones on water (the skip count is going to be random) or even crumpling a few pieces of paper and counting the number of crumples on each. These are just a few examples I can think of while I write this piece but the point should be clear by now: we need to mix the digital and the physical world[^ At least until we have free-thinking, sentient robots {laughs nervously}.] for true randomness and anything short of the physical world results in pseudorandom sequences.

Choose any phenomenon in nature that involves randomness and use it as your seed digitally. For absolute security keep redoing this so your digital output is exactly as random as your physical one. And for efficiency with some security trade-offs use a cipher/formula based on a new physically generated seed every once in a while. Of course if the randomness has to do with just your iPhone shuffling songs weirdly, you can always choose to overlook the fact that every time you click shuffle on a certain playlist Daniel Boone starts singing ‘Beautiful Sunday’.