In one of Quarkslab's projects, we came across the issue of randomizing a large set of integers, described as a list of disjoint intervals. These intervals can be represented as a sorted list of integers couples, like this one: . The idea is to randomly and uniquely select numbers across these intervals, giving a shuffled list of numbers that belong to them. For instance, is a possible output. Moreover, each possible permutation of the integers set should have equal probability of appearance. If you're just interested in the final library that "do the job", go directly to the implementation section to download the leeloo C++ open-source library on Github !
Trivial algorithm
The not-so-trivial (but still) algorithm is to generate an array containing all the original sorted integers, and then apply a shuffle algorithm (like Fisher–Yates [1]) that uses a common Pseudo Random Number Generator (PRNG). As an example, in C++, std::shuffle can be used to do that.
The main issue is that a buffer of n integers is required. For instance, with 32-bit integers, one needs a buffer of 8GB, which is not acceptable in our situation.
Other trivial algorithm
Another approach to reduce the memory footprint is to randomly select numbers between (using a classical PRNG), and keep a bitfield of already returned candidates not to return twice the same. When we start to reach too many times the same numbers, we change the algorithm:
With the remaining numbers of candidates to find, get a random number and find the position of the R-th bit not set in the bitfield. That can be optimized thanks to SSE instructions ;
Set that bit and return the value ;
Go on with until .
There are multiple drawbacks with this algorithm:
It still needs memory bytes (even if it is less than the previous algorithm) ;
The final stage can be really slow if R is such that the remaining bitfield does not fit in cache.
See also [5] for a description of a similar algorithm.
Problem reduction
Thus, the main issue is to generate a list of unique random numbers between a given interval, with good performances (say about 50 million numbers per seconds on a Core i7 3rd gen) and a small memory footprint ().
So, the final problem is to be able to choose in an equiprobable manner a permutation of among the ones ( being the number of permutations of distinct numbers [4]), using only bytes of memory (keeping in mind the performance criteria).
The first question is to understand if this is even feasible, and the second issue is to figure out an efficient method to achieve this.
Formalization
Let's do some math to formalize this problem.
Some context: let and the numbers' set to shuffle ; is always chosen as a prime number.
The choice of as a prime induces interesting properties (as it is shown below), but the careful reader would notice that we won't always have a prime number of integers to generate. However, we can still live with that.
Indeed, let the original number of integers to generate and chosen as:
is prime ;
;
, is not a prime.
Or, in other words, is the smallest prime number greater than or equals to .
That way, we will produce numbers between , and not . It is not really an issue because:
when a number in is generated, just discard it and compute the next one until it belongs to ;
the density of prime numbers in allows us to do that, as the maximal gap between two consecutive prime numbers is 354 [2].
We will now work in . being a prime, is a division ring [3] (that's the great property).
Then, with the set of permutations of , our problem is equivalent to choose with equal probability a permutation in .
(Partial) resolution
All of that theory is nice, but it does not change a lot of things in concrete. Let's go now in the crux of the issue, and understand what can be done with :)
Permutation polynomial
One can notice that every application can be written as a polynomial of . Indeed, for instance, given an application, a chebytchev polynomial equivalent to can always be found.
Thus, every element of can be described as a polynomial of .
Trivial algorithm
That way, one (still not-so-trivial ;)) algorithm would be:
Generate a random polynomial of . This is equivalent to compute random coefficients ;
Check if this polynom represents a permutation ;
If not, go back to the first step.
But wait... There are multiple issues here.
First, these coefficients need to be stored in memory, giving a memory footprint of bytes.
Moreover, the problem of checking whether a polynomial is a permutation one or not can be somehow complex and slow. Probabilistic methods exist (shown in [6]), but it still leaves us with some potential errors. The performance cost of all of this could be important. We didn't take the time to benchmark this algorithm as it suffers from the memory issue...
And finally, left to generate a buffer of integers, we could just stick to the first "trivial" algorithm described at the beginning of this paper.
The real great stuff
We need to find a better way to generate these polynomials. We will use the division ring properties.
Indeed, it can be demonstrated that, in , every permutation is a bijection, and every bijection is a permutation.
Thus, a whole set of polynomials can be described:
for every is a bijection, and thus belongs to (Equation 1)
for every such as , is also a bijection [6], and belongs to (Equation 2).
Moreover, as the combination of two bijection functions is a bijection, combining these two sets of polynomials will produce new ones.
What's even more interesting is that it can be demonstrated that, for every is a generator of the group, using the composition law. [6]
So, the final result is that, theoretically, every permutation of can be defined as a combination of these polynomials aforementioned.
Entropy and equiprobability: how random is random?
For the following, we will defines three sets :
, the values that can take a in (Equation 1) ;
, the values that can take b in (Equation 1) ;
, the values that can take in (Equation 2).
Let's define these two applications:
The first idea coming to our mind is to randomly combine the polynomials generated by these applications.
Let's define ( stands for 'seed part') as .
For instance, let's randomly choose and . The couple can be considered as the seed of our random number generator.
We know that is a permutation polynomial, is another one, also, etc...
(Note: is the function composition, which means that, for instance, )
Thus, every 'seed' values that belongs to can produce a set of permutations.
Unfortunately, there are main issues with this approach, that we will call "entropy reduction". Indeed, we know that we can create permutations by composing and , but :
can also be expressed as . In other words, a combination of affine functions is an affine function). Moreover, as shown in Appendix A, choosing independently and in and computing is equivalent to randomly choose a number in . The same goes with . Thus, if we choose one seed and compute , this is equivalent to choose a seed ;
The same issue comes with , which is equals to ;
Even by combining , then with and , (giving ), the following question must be answered:
is there any couple such as ?
Another way to formalize this problem is as follows: given a seed taking values in a space integers from ( unknown), is:
a bijective function?
Now, let's demonstrate a somehow intuitive result.
If is a bijection, then , which gives . This means that, in order to generate a random permutation of , we must choose a seed number between the ones. In other words, we must choose unique random numbers. Well, this has just sent us back to the beginning of this article.
Our method for compromises
But the game is not yet finished, we haven't gone this far for nothing. So let's work a bit with our results.
We now understand that, somehow, some compromises have to be made. We know that the size of the seed must be reduced. By doing this, we know that we won't be able to uniquely generate all the possible permutations of . Moreover, we want to do this in such a way that these properties will be conserved the best way:
we still reach a fairly "reasonable" amount of permutations among ;
all these permutations are unique (or a "lot of" them) ;
all of this has still "good" performances (we haven't talk yet a lot about this one, but we don't forget it :)).
At this point, we decided to study the following UPRNG (named ):
This choice is made because it produces a function that can be easily computed, and can still give interesting results.
Number of generated permutations
If is randomly chosen in , then the number of generated permutations with this method, is : (with the Euler totient function [3]).
As we've seen above, the number of unique generated permutations may be inferior to this.
Thus, if we have for instance , we can search for the set of for which same UPRNG is the same.
Let
;
;
.
We need to resolve:
The complete resolution of this equation being a bit human-time consuming, we'll do it with , and using mathematical software, we can find these solutions :
obviously, ;
and .
Which means than, when , only half the numbers of possible values for will give a unique permutation. By the way, this proves the fact that our UPRNG function isn't bijective.
We can test this easily with . being equals to 1, we can define:
And this python code:
def l(x,a,b,p): return (a*x+b)%p
def g(x,c,p): return (x**c)%p
def lgn(x,a,b,c,p,n):
for i in xrange(0,n):
x = g(l(x,a,b,p),c,p)
return x
list_x0 = list()
list_x1 = list()
p = 17
a = 5
b = 0
c = 3
n = 2
for x in range(0,p):
list_x0.append(lgn(x, a, b, c, p, 2))
list_x1.append(lgn(x, p-a, b, c, p, 2))
print(list_x0)
print(list_x1)
Which gives:
[0, 4, 8, 5, 16, 14, 10, 6, 15, 2, 11, 7, 3, 1, 12, 9, 13]
[0, 4, 8, 5, 16, 14, 10, 6, 15, 2, 11, 7, 3, 1, 12, 9, 13]
One possible solution is to reduce the space of , for instance with . But, without the full resolution of the (equation 1), this is just a partial resolution.
Going further with another UPRNG
If we want to reduce the "entropy reduction", we need to improve the size of the seed values. For instance, this UPRNG could be defined as:
As above, we try to find such as .
Using mathematical resolution software, with (for instance), this gives the following solutions:
Let
;
.
We have:
(trivial) ;
.
which gives constraints on the choice of our constants in order to try and have bijective.
The resolution of the full system is left for further work on the subject ;)
Implementation and benchmarks
The implementaion of what's described here (and more) has been done in the C++ "leeloo" open-source library that you can find on github here : https://github.com/quarkslab/libleeloo. It also provides python bindings for python fans around here.
The library allows to manage integer intervals, aggregate them and randomly sort the elements as described in the introduction. It also provides an IPv4 range parser for convenience usage.
There are two main UPRNG implemented:
one that uses the method described here [5]. This one is historical, optimised with SSE/AVX instructions and "fast" (see figures below) ;
one that uses . It is 8 to 14 times slower that the original one (due to the modular exponentation), but provides a larger possible set of permutations.
Moreover, each UPRNG can be instantiated in "atomic" mode, which makes them thread-safe.
Some figures about performances : on a Core i7-3770 (3.4GHz, 4 cores with Hyperthreading), we obtain:
with the first UPRNG, we can generate, with the SSE/AVX and parallelised version, about 290 millions of 32-bit numbers per second. This makes a memory bandwidth of about 1.2GB/s, making this generator CPU-bound (for now) ;
with the second UPRNG, we can generate, with the parallelised version, about 30/n million numbers/s ('n' being the part of the seed that defines the number of compositon of .). This is because the performance of this generator is limited mainly by the modular exponentation computations. This generator is also clearly CPU-bound.
C++ and Python usage samples can be found on github at : https://github.com/quarkslab/libleeloo/tree/master/tools and https://github.com/quarkslab/libleeloo/tree/master/tests.
Conclusion
Giving some compromises, we find a solution to our original problem that is actually good enough for our project needs. We still are a bit frustrated not to have the actual time to go further in this subject.
There exists other ways that haven't been studied here to generate permutation polynomial, as for instance described on this wikipedia page : http://en.wikipedia.org/wiki/Permutation_polynomial. This is also another interesting work that could be done :)
It can also be mentioned that some people already looked into the subject and published articles. For instance, [5] uses the quadratic residues with prime numbers. It can be noticed that the permutation given by this method can also be expressed as a permutation polynomial (but involves more computations).
Finally, thanks to Sebastien Kaczmarek (@deesse_k) for the original talks on the subject (and other ideas), to Ninon Eyrolles for her help on the redaction and some of the mathematics here and Kévin Szkudlapski for his advices.
Going further
For the reader that might want to go further, here are some ideas:
Work on UPRNG2 ;
For the described UPRNGs, find out the number of unique permutations ;
Benchmark and analyze other ways to generate permutation polynmials ;
Something that would be nice: given a seed space of size , find out and a subset of such as:
is a bijection.
You're welcome to send us feedbacks :).
Appendix A
Let:
a prime number ;
(which is a division ring) ;
two independant random variables of .
First, we have:
We have, for every ,
Let and , such as , Thus, can have different values, and is unique for a fixed . The number of such as is then .
So,
which means that choosing independently two random numbers in and sum the two of them is equivalent to only choose one random number in .
Let's demonstrate the same result with .
Let and such as ,
Thus, can have different values. As is a division ring, we have which is unique for a given . So, the number of such as , and
If we take these two results, and let ,
Let and , then exists and is unique for a given and . Thus, the number of such as is , and:
References
[1] | https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle |
[2] | https://en.wikipedia.org/wiki/Prime_gap#Numerical_results |
[3] | (1, 2) https://en.wikipedia.org/wiki/Euler%27s_totient_function |
[4] | https://en.wikipedia.org/wiki/Permutation |
[5] | (1, 2, 3) http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers |
[6] | (1, 2, 3) http://tel.archives-ouvertes.fr/docs/00/43/87/65/PDF/LaigleChapuy_these.pdf |