Algorithms for Integer Factorization
In this post, we will explore the problem of factorizing integers (in particular semi-primes, i.e., products of two primes). We will implement and contrast two different methods to factorize an integer: the brute-force way with the sieve of Eratosthenes and Pollard’s rho algorithm
The problem of integer factorization constitutes a fundamental problem in number theory and provides the security basis of multiple modern encryption schemes.
For example, the RSA asymmetric encryption standard for communication between two parties uses a (public key, private key) pair. In simple terms, RSA computes a large integer from the product of two large primes (n = p.q
), and derives the public and private keys from the prime factors. In particular the public key e
is chosen such that gcd(e, (p-1).(q-1)) = 1
and the private key d
is derived as d
is the inverse of e
modulo (p-1).(q-1)
. The RSA schema assumes that n
and e
are public while the prime factors and the private key are kept secret. It’s easy to see from the formulas that given n
’s prime factors and the public key, one can easily derive the private key. Hence, it is of utmost importance to the security of the algorithm that it is “hard” to factorize a large integer. Otherwise, the security of the schema is broken. Given that the basis of security of most asymmetric encryption standards (which are widely used in the world today as part of symmetric key exchange and certificate authority digital signature) is fundamentally based on the difficulty of integer factorization, a lot of research has been dedicated to improving the efficiency of integer factorization algorithms.
We will limit ourselves to the problem of factoring large numbers of the form: n = p.q
where p
and q
are large prime numbers, which is the form used by RSA. Numbers that satisfy this form (a product of exactly two prime numbers) are called semi-primes.
Note that if we are able to decompose a number into the product of two other numbers, we can iterate this algorithm over the newly found numbers along with a primality check (which is in the order of log(n)
) to obtain the complete prime decomposition of any random integer n
. Therefore, our above limitation can be easily expanded to solve the general problem of integer factorization. While primality testing can be achieved in the order of log(n)
, no classical polynomial-time algorithm for integer decomposition is known.
Brute-Force Factorization with the Sieve of Eratosthenes
We will start with a naïve implementation of our factorization algorithm. We will start by listing all primes less than $n$ and check which prime in that list is a divisor. An improvement would be to only check primes less than or equal to the square root of n
because the number is a product of two primes, so the smaller prime has to be less than or equal to its square root.
We will use the sieve of Eratosthenes to find primes in a specific range. This approach goes through all numbers starting from 2 to the end of the range and progressively removes all other that are multiples of it. We end up with a list of primes, all that is left is to check the first one that divides n
.
import math
def naive(n):
sqrt_n = int(math.sqrt(n))
isPrime = [True for i in range(0,sqrt_n+1)]
primes = []
for p in range(2, sqrt_n+1):
if isPrime[p]:
primes.append(p)
for x in range(p**2, sqrt_n+1, p):
isPrime[x] = False
prime1 = 0
for p in primes:
if n%p == 0:
prime1 = p
break
return prime1
Analysis
The code above will factorize the integer n
by building up the list of primes up to the square root of n
, and checking whether any of these primes divide n
. This algorithm has a log-linear every-case time complexity. Its proof is shown below:
Order notation of the brute-force algorithm
Its space complexity is in the order of square root of n
(since we have to keep track of all numbers less than the square root of n
).
Pollard’s Rho
This algorithm is implemented in a very simple manner but is based on a pseudorandom sequence. This sequence is generated by a polynomial function which is generally chosen as g(x) = (x²+1) mod n
. The algorithm proceeds by progressively applying the polynomial function on a starting seed value (usually 2) and checking once two different random numbers are congruent to the same value modulo n
. If so, the prime is simply the absolute value of the difference between the two numbers.
This approach can be thought of as a form of branch and bound where the state space tree are all pairs of numbers (x,y)
, this pair is considered promising once its difference (in absolute value) isn’t co-prime with n anymore.
import math
def g(x,n): return (x**2 + 1)%n
def rho(n):
x = 2
y = x
d = 1
while (d==1):
x = g(x,n)
y = g(g(y,n),n)
d = math.gcd(abs(x-y), n)
return d
Analysis
The algorithm highly depends on the pseudorandom nature of the polynomial function modulo n
. If this randomness is assumed, we base our analysis on the birthday paradox. This paradox is based on the counterintuitive fact that it only takes a minimum of 23 people to have a probability greater than 50% that two of them share the same birthday. This is due to the fact that any two people could share a birthday, so an added person’s birthday is compared to every other birthday: this combination of possible pairings grows exponentially with the size of the group.
Similarly, the Pollard’s Rho algorithm keeps track of two pseudorandom numbers. Its analysis depends on the implicit sequence { x mod p }
where p is a non-trivial factor of n
. Assuming randomness, we can expect the two computed numbers to be different but the same modulo n to occur after at most the square root of the prime factor p:
The space complexity is O(1)
since we only keep track of two numbers generated from a pseudorandom function and computing whether their promising (their difference isn’t co-prime with n
).
Implementation Analysis
After implementing both algorithms in Python and running them on a test file containing semi-primes from 8 to 58 bits in size (with values from 143
to 149470864377634489
), we measure their time and space complexity and obtain the following dataset description:
Note that to obtain the memory size accumulated by a python script, we use the tracemalloc
module in Python as follows:
import tracemalloc
tracemalloc.start()
# Call the function to analyze
mem = tracemalloc.get_traced_memory()
tracemalloc.stop()
The variable mem
will contain a tuple of the current memory allocated for the script as well as the largest memory that the script used. Hence, we obtain mem[1]
as the memory needed by the function to execute.
As we can see from the output result, the average time of the two algorithms are drastically different. For the brute force implementation, it takes around 22.5
seconds to factorize the integer while it only takes 0.008764
seconds using Pollard’s Rho. The time taken by the naïve implementation also ranges from 0.00003
to 394.77549
seconds as the size of the integer increased while this range is only from 0.00002
to 0.08404
seconds for all input.
The amount of memory used is also considerably different. The brute force implementation used between 336
to 4.1913
x 10⁹
bytes of memory with an average of 2.365
x 10⁸
bytes. This is considerably more memory intensive than the Pollard’s Rho algorithm which only used between 84
to 504
bytes of memory with an average of 223.38
bytes.
While the brute force implementation was able to factorize all inputs it was given (success rate of 100%), Pollard’s Rho wasn’t able to factorize 4 out of the 205 integers for a success rate of 98%. This could be resolved by repeating Pollard’s Rho with a different starting value for x
. However, we didn’t rerun the algorithm to be authentic to its original prediction and to note its limitations due to its randomized nature.
The analysis we performed above was correct where we predicted that Pollard’s Rho would be more time efficient that the brute force. We also correctly determined that Pollard’s Rho space requirement is constant, independent of the input size, while the brute force implementation requires additional space as the size of the input increases to store the list of primes.
We can also plot the time needed to executed with respect to the size of the input N as you can see below (blue plot is for the naïve implementation and the red one is for Pollard’s Rho):
Plotting execution time against the input (right) and its size (left)
This is exponential in the size of N (measured as the number of bits in the binary representation of n
, i.e., ~ log(n)
), but log-linear with N.
Conclusion
In conclusion, we can see how two different approaches to solve the problem of integer factorization can result in different time and space complexities. While the brute force algorithm is deterministic and allows to compute a list of primes as it is solving the problem, it is very inefficient in terms of time and space requirements. On the other hand, Pollard’s Rho algorithm is very efficient in terms of time and space, but it is based on a pseudorandom sequence and doesn’t have an absolute guarantee on factorizing an integer. In fact, proving the time-complexity of Pollard’s Rho is still an open problem in mathematics and computer science.
While the problem of large integer factorization has been explored for decades, no efficient algorithm has been found on classical computers yet. This has led to cryptographic schemas to rely on the difficulty of integer factoring for the confidentiality of the encrypted data. However, efficient quantum algorithms for integer factorization has been proven to exist. And as quantum computers are evolving to be more reliable and faster, efficiently factoring an integer might become a thing of the past.