Posted on February 13, 2015

Sieves in Haskell (part 1)

Ever since Project Euler’s problem 3 that I wanted to experiment with calculating prime numbers using sieves. It’s really not a complicated task but I guess sometimes the problems in Project Euler are so easily solvable with a one liner that I just kept postponing it.

With problem 7, which asks what is the 10001st prime number, it becomes clear that it is not sustainable to keep relying on trial division to find prime numbers so the time has come for implementing and playing around with Sieves.

So this is five day story on prime numbers and sieves and how optimizing programs can become so addictive.

Trial division

We should still try to solve this with trial division, at least to have the baseline for all the other algorithms:

isPrime n 
    | n == 2             = True
    | n < 2 || even n    = False
    | otherwise          = not (any (\\i -> n `mod` i == 0) [3..n-1])

trialAndDivision limit = [n | n <- [2..limit], isPrime n]

This primality test is very basic (it’s the one I used for problem 3) and the amount of optimizations is very low - I never try division by 2 and even numbers return false immediatelly.

So, so how will these tests be done?

I’m running this on my laptop (a 2,66 GHz Intel Core i7 Macbook Pro from 2010, with 8 gigs of RAM) and apart from the command line only Firefox and emacs are running.

All test cases are compiled with -O2 and they never print anything to the screen except for the size of the generated primes list. The app is always something like:

import Problem007

main = print . length $    trialAndDivision 1000

So, back to the unoptimized version:

25 seconds for primes up to 100000 is already way too much but there’s a couple small optimizations we can do: instead of testing for <2 and for even inside isPrime (which happens for all numbers out of the list compprehension) we can make the lazy list go only over odd numbers:

isPrime2 n 
     | n == 2         = True
     | otherwise      = not (any (\\i -> n `mod` i == 0) [3..n-1])

trialAndDivision2 limit = 2:3:5:[n | n <- [7,9..limit], isPrime2 n]

And now:

So there is a difference indeed – it now takes us 1 second more! So, let’s leave trial and division behind and let’s go over to the Sieves.

Eratosthenes

Ok, so here’s my first implementation of the Sieve of Eratosthenes:

eratosthenes1 limit = eSieve1 [2..limit] []

eSieve1 [] primes = primes
eSieve1 (x:xs) primes = eSieve1
            ([y | y <- xs, not (y `isMultiple` x)])
            (primes++[x])

isMultiple n = \\x -> elem n [x,x+x..n]

The first implementation was very naive, and only meant to see how long would a typical calculation take. It also was direct copy of what the definition on wikipedia says:

"…enumerate its multiples by counting to n in increments of p, and mark them in the list…" So, even though this enumeration is just a test for a number being a multiple of another, I went with an implementation which really created a list with all those multiples.

So, one could say that the sieves aren’t really living up to their promises right? But no, it’s just my implementation that sucks really.

Let’s start with optimizations. It is obvious that the function isMultiple is not really needed - instead I can simply verify that the remainder of the division is 0 and if so, the number is a multiple. Even though this requires doing multiple divisions, it’s easy to see that the amount of operations will be largelly inferior to the first implementation.

erastothenes2 limit = eSieve2 [2..limit] []

eSieve2 [] primes = primes
eSieve2 (x:xs) primes = eSieve2
            ([y | y <- xs, y `mod` x /= 0])
            (primes++[x])

Now that is a lot better, and more than twice as fast than trial and division for the same number, and mainly by avoiding the creation of so many lists. So let’s try to beat that then and do some final improvements:

So let’s give it another try:

erastothenes3 limit = eSieve3 [3,5..limit] [2]

eSieve3 [] primes = primes
eSieve3 (x:xs) primes = eSieve3
            ([y | y <- xs, y `mod` x /= 0])
            (x:primes)

Only 2 seconds faster. So I think this is enough optimizations so let’s try a final execution for 1 million

17 minutes!!! I would hope to get to something below 1 minute to be honest… but let’s try that with the next sieves then.

Sundaram

Sundaram also created a Sieve which is said to be better than Erastothenes’ sieve so let’s see if that’s true:

import Data.List

sundaram1 limit = let limit' = (limit `div` 2) in
            2 : (map (\n -> 2*n+1)
                      ([1..limit']
                      \\\\ 
                      [i + j + 2*i*j | i <- [1..limit'], 
                                       j <- [i..limit']]))

sundaram1 1000 runs in 9.59 secs and it is absurdly slow, even worse than Erastothenes. So I must be doing something wrong.

Let’s see now, there’s a few things I don’t like in this version, chief amongst them the fact that I used \\\\ to do something like a set operation on the sieve when instead of that, I think I can use list comprehensions which should decrease the amount of operations.

sundaram2 limit = let limit' = (limit `div` 2) in
                2:[2*n+1 | n <- [1..limit'],
                                not (n `elem` [i + j + 2*i*j | i <- [1..limit'],
                                                                     j <- [i..limit']])]

Still not good. I could also extract the initial sieve creation outside of the calculations:

sundaram3 limit = let limit' = (limit `div` 2) 
                      sieve  = [i + j + 2*i*j | i <- [1..limit'],
                                                j <- [i..limit']] in
                  2:[2*n+1 | n <- [1..limit'],
                             not (n `elem` sieve)]

So the answer has to be in mathematics. Let’s look at the domain of i+j+2ij. It will clearly never go above sqrt(n/2) and since j starts in i, it will never go above (n-i)/(2i+1). So let’s do those changes and try to limit the amount of numbers we generate in the sieve.

sundaram4 :: Int -> [Int]
sundaram4 limit = let limit' = (limit `div` 2) in
            let sieve = [i + j + 2*i*j | let n' = fromIntegral limit',
                                         i <- [1..floor (sqrt (n' / 2))],
                                         let i' = fromIntegral i,
                                         j <- [i..floor( (n'-i')/(2*i'+1))]] in
            2:[2*n+1 | n <- [1..limit'],
                       not (n `elem` sieve)]

So it’s still lousy to be honest, since that is slower than the sieve of Aristothenes and supposedly it should have been faster. So let’s try one more optimization: let’s get rid of the elem call since that forces Haskell to go through the whole list, and let’s rely on recursion instead:

initialSundaramSieve limit = let topi = floor (sqrt ((fromIntegral limit) / 2)) in
                       [i + j + 2*i*j | i <- [1..topi],
                                        j <- [i..floor((fromIntegral(limit-i)) / fromIntegral(2*i+1))]]

sundaram5 limit = let halfLimit = (limit `div` 2) in
                2:removeComposites ([1..halfLimit]) (sort $ initialSundaramSieve halfLimit) 

removeComposites []     _                  = []
removeComposites sieve  []                 = sieve
removeComposites (s:ss) (c:cs) | s == c    = removeComposites ss cs
                               | s > c     = removeComposites (s:ss) cs
                               | otherwise = 2*s+1 : (removeComposites ss (c:cs))

Now that’s unbelievably faster!!!! So let’s see what happens with 1 million:

And this gives me 78498 primes under 1 million, which is correct according to Wolfram Alpha. So that was cool! Let’s see if the Sieve of Atkin can beat that:

Atkin

The Sieve of Atkin has a lot more steps and follows a somewhat different logic. There are 3 mandatory passages over the original sieve, each of which may flip a certain composite number to prime and vice versa more than once.

So I created a few more helper functions and this is the result:

initialAtkinSieve limit = sort $ zip
                  [n | n <- [60 * w + x | w <- [0..limit `div` 60],
                                          x <- [1,7,11,13,17,19,23,29,31,37,41,43,47,49,53,59]],
                       n <= limit]
                  (take limit [0,0..])

aFlip :: (x,Int) -> (x,Int)
aFlip (x,1) = (x,0)
aFlip (x,0) = (x,1)

flipAll :: [Int] -> [(Int, Int)] -> [(Int, Int)]
flipAll _      []     = []
flipAll []     ss     = ss
flipAll (f:fs) ((s,b):ss) = if s < f
                            then (s,b): (flipAll (f:fs) ss)
                            else if s == f
                                 then flipAll fs ((aFlip (s,b)):ss)
                                 else flipAll fs ((s,b):ss)

firstStep sieve = let size = (fst $ last sieve)
                      topx = floor(sqrt(fromIntegral (size `div` 4)))
                      topy = floor(sqrt(fromIntegral size)) in
                  flipAll (sort [n | n <- [4*x^2 + y^2 | x <- [1..topx],
                                                         y <- [1,3..topy]],
                                     n `mod` 60 `elem` [1,13,17,29,37,41,49,53]])
                          sieve
    
secondStep sieve = let size = (fst $ last sieve)
                       topx = floor(sqrt(fromIntegral (size `div` 3)))
                       topy = floor(sqrt(fromIntegral size)) in
                   flipAll (sort [n | n <- [3*x^2 + y^2 | x <- [1,3..topx],
                                                          y <- [2,4..topy]],
                                  n `mod` 60 `elem` [7,19,31,43]])
                           sieve
    
thirdStep sieve = let size = (fst $ last sieve)
                      topx = floor(sqrt(fromIntegral size)) in
                  flipAll (sort [n | n <- [3*x^2 - y^2 | x <- [1..topx],
                                                         y <- [(x-1),(x-3)..1],
                                                         x>y],
                                 n `mod` 60 `elem` [11,23,47,59]])
                           sieve

unmarkMultiples n sieve =
    let limit = (fst $ last sieve) in
        unmarkAll [n | n <-[n,n+n..limit]] sieve
    
unmarkAll _        []         = []
unmarkAll []       sieve      = sieve
unmarkAll (np:nps) ((s,b):ss) = if np == s
                                then (unmarkAll nps ((s,0):ss))
                                else if   np < s
                                     then unmarkAll nps ((s,b):ss)
                                     else (s,b) : (unmarkAll (np:nps) ss)
    
atkin1 limit = aSieve1 (thirdStep . secondStep . firstStep $ initialAtkinSieve limit) [(5,1),(3,1),(2,1)]
    
aSieve1 []         primes = primes
aSieve1 ((x,b):xs) primes = if   b == 1
                            then aSieve1 (unmarkMultiples (x^2) xs) ((x,b): primes)
                            else aSieve1 xs                         primes

So let’s see:

And for the big one:

atkin1 1000000 = 42.98s user 0.32s system 99% cpu 43.455 total

So that performance, although it’s really nice overall, it sucks when compared to the sieve of Sundaram. So let’s try to do some improvements.

One thing I don’t like is the fact that I have to compose three functions to flip the prime status of the inital list of primes. I would prefer to do this operation in one go by concatenating the lists into one, sorting it once and in one passage flip all the primes.

I think that would make the whole thing go faster. However, maybe this is a good opportunity to actually use a profiler and see if that is really the bottleneck of my code. So those tools come out of the box with GHC, we just need to change the compilation and execution commands a little bit:

luis at Curry in ~/dev/euler
$ ghc --make -O2 Try.hs -prof -auto-all && time ./Try +RTS -p
[1 of 2] Compiling Problem007       ( Problem007.hs, Problem007.o )
[2 of 2] Compiling Main             ( Try.hs, Try.o )
Linking Try ...
78498
./Try +RTS -p  75.80s user 1.31s system 99% cpu 1:17.47 total
    
luis at Curry in ~/dev/euler
$ more Try.prof
        Tue Feb 10 09:00 2015 Time and Allocation Profiling Report  (Final)
    
           Try +RTS -p -RTS
    
        total time  =       74.58 secs   (74577 ticks @ 1000 us, 1 processor)
        total alloc = 2,906,769,880 bytes  (excludes profiling overheads)
    
COST CENTRE           MODULE     %time %alloc
    
unmarkMultiples.limit Problem007  96.7    0.0
unmarkAll             Problem007   2.7   85.6
firstStep             Problem007   0.2    3.8
thirdStep             Problem007   0.1    2.5
secondStep            Problem007   0.1    2.4
flipAll               Problem007   0.1    2.0
initialAtkinSieve     Problem007   0.0    3.1
    
(...)
    

So the verdict is very clear then: firstStep, secondStep and thirdStep are responsible only for 0.4 % of the whole time while unmarkMultiples has ~ 97% of the time. In particular the limit calculation, which makes sense since it relies on last and finding the last element on a linked list of size n has O(n).

So let’s just pass the size around from the very first function, including into the three steps:

firstStep size sieve =
      let topx = floor(sqrt(fromIntegral (size `div` 4)))
          topy = floor(sqrt(fromIntegral size)) in
      flipAll (sort [n | n <- [4*x^2 + y^2| x <- [1..topx],
                                            y <- [1,3..topy]],
                         n `mod` 60 `elem` [1,13,17,29,37,41,49,53]])
               sieve

secondStep size sieve =
       let topx = floor(sqrt(fromIntegral (size `div` 3)))
           topy = floor(sqrt(fromIntegral size)) in
           flipAll (sort [n | n <- [3*x^2 + y^2 | x <- [1,3..topx],
                                                  y <- [2,4..topy]],
                              n `mod` 60 `elem` [7,19,31,43]])
                    sieve

thirdStep size sieve =
      let topx = floor(sqrt(fromIntegral size)) in
          flipAll (sort [n | n <- [3*x^2 - y^2 | x <- [1..topx],
                                                 y <- [(x-1),(x-3)..1],
                                                 x > y],
                             n `mod` 60 `elem` [11,23,47,59]])
                  sieve

unmarkMultiples limit n sieve =
            let nonPrimes = [y | y <-[n,n+n..limit]] in
            unmarkAll nonPrimes sieve

unmarkAll _        []         = []
unmarkAll []       sieve      = sieve
unmarkAll (np:nps) ((s,b):ss) 
      | np == s        = unmarkAll nps ss
      | np < s         = unmarkAll nps ((s,b):ss)
      | otherwise      = (s,b) : (unmarkAll (np:nps) ss)

atkin1 limit =
       aSieve1 limit
                  ((thirdStep limit) .
                   (secondStep limit) .
                   (firstStep limit) $
                   initialAtkinSieve limit)
                  [(5,1),(3,1),(2,1)]

aSieve1 _     []         primes = primes
aSieve1 limit ((x,b):xs) primes
           | b == 1       = aSieve1 limit (unmarkMultiples limit (x^2) xs) ((x,b): primes)
           | otherwise    = aSieve1 limit xs primes

So, let’s see again what does the profiler say for atkin1 1000000:

luis at Curry in ~/dev/euler
$ ghc --make -O2 Try.hs -prof -auto-all && time ./Try +RTS -p && more Try.prof
[1 of 2] Compiling Problem007       ( Problem007.hs, Problem007.o )
[2 of 2] Compiling Main             ( Try.hs, Try.o )
Linking Try ...
78498
./Try +RTS -p  1.96s user 0.11s system 98% cpu 2.098 total
        Fri Feb 13 08:04 2015 Time and Allocation Profiling Report  (Final)
    
           Try +RTS -p -RTS
    
        total time  =        1.43 secs   (1433 ticks @ 1000 us, 1 processor)
        total alloc = 2,785,514,448 bytes  (excludes profiling overheads)
    
COST CENTRE       MODULE     %time %alloc
    
unmarkAll         Problem007  72.6   85.0
firstStep         Problem007   8.7    3.9
secondStep        Problem007   5.7    2.5
flipAll           Problem007   4.5    2.1
thirdStep         Problem007   4.3    2.6
initialAtkinSieve Problem007   2.4    3.3
aSieve1           Problem007   1.0    0.2
    
 (...)
 

1 second!!!!! And actually, 1.43 seconds, which is less than what it took for sundaram5.

Now, collectivelly, the three steps take 18,7% of the total time so maybe this could also be an opportunity to do some performance improvement. However, I think this is a good moment to stop with the improvements. Let’s just see how the two best versions of the algorithm behave with 1 million, 10 million and 100 million:

atkin1 1000000 —> 1.14s user 0.06s system 98% cpu 1.224 total sundaram5 1000000 —> 1.67s user 0.11s system 98% cpu 1.806 total

atkin1 10000000 —> 31.68s user 0.97s system 99% cpu 32.778 total sundaram5 10000000 —> 26.02s user 1.36s system 99% cpu 27.458 total

I tried both with 100 million and it still hadn’t finished after 42 seconds, so maybe we can leave that for future work. :-)

Future work

Stay tuned for version 2 of this post. :-)