Happy π approximation day/night (in assembly) !


//   pi_x64.s - calculates Pi using the Leibniz formula.
//              Each iteration prints a closer approximation to 50 digits.
//              This is not an optimal implementation and it runs forever.
//
//  x86-64/SSE3 with for Linux, Intel, gnu assembler, gcc
//
//  assemble: as pi_x64.s -o pi_x64.o
//  link:     gcc -o pi_x64 pi_x64.o
//  run:      ./pi_x64
//  output: 3.14159264858204423376264458056539297103881835937500
//          3.14159265108366625440794450696557760238647460937500
//          3.14159265191852199450295302085578441619873046875000
//          3.14159265233600137889879988506436347961425781250000
//          .... and on forever ...

.section .data
        .align 16
denom:  .double  1.0, 3.0
numer:  .double  4.0, -4.0
add4:   .double  4.0, 4.0
zero:   .double  0.0, 0.0
msg:    .string  "%1.50fn"

.section .text
        .globl main
        .type main, @function
        .align 64
main:
        pushq   %rbp
        movq    %rsp, %rbp

        movdqa  (numer), %xmm2
        movdqa  (denom), %xmm6
        movdqa  (add4), %xmm3
        movdqa  %xmm2, %xmm4
        movdqa  (zero), %xmm5
        movq    $100000000, %r12

loop:
        divpd  %xmm6, %xmm2
        addpd  %xmm2, %xmm5
        movdqa %xmm4, %xmm2
        addpd  %xmm3, %xmm6

        subq $1, %r12
        jnz loop

        movq   $100000000, %r12
        movdqa %xmm5, %xmm0
        movdqa %xmm6, %xmm1
        haddpd %xmm0, %xmm0

        movq  $1, %rax
        movq $msg, %rdi
        call printf

        movdqa (add4), %xmm3

        jmp loop

        movq $0, %rax
        popq %rbp
        ret

Generating π in Haskell

Haskell beats CL quite comfortably using the same algorithm :

module Main( main ) where

import System( getArgs )


arccot :: Integer -> Integer -> Integer
arccot x unity =
    arccot' x unity 0 start 1 1
      where start = unity `div` x
            arccot' x unity sum xpower n sign | xpower `div` n == 0 = sum
                                              | otherwise           =
                arccot' x unity (sum + sign*term) (xpower `div` (x*x)) (n+2) (-sign)
                  where term = xpower `div` n

machin_pi :: Integer -> Integer
machin_pi digits = 
    pi' `div` (10 ^ 10)
      where unity = 10 ^ (digits+10)
            pi' = 4 * (4 * arccot 5 unity - arccot 239 unity)

main :: IO ()
main = do
    args <- getArgs
    putStrLn (show (machin_pi (read (head args) :: Integer)))

The first 10000 digits.

> time ./machin_pi 10000

31415926535897932384626433832795028841971693993751058209749445923078164062862089 ...

real    0m0.381s
user    0m0.290s
sys     0m0.061s




Saintly Man? – That Mitchell & Webb Look – BBC Two

Generating π in CL (faster)

Thanks to metacircular for pointing out that (floor (/ x y)) can be written as (floor x y) while avoiding
the intermediate rational.

(defun machin-pi (digits)
  "Calculates PI digits using fixed point arithmetic and Machin's formula with double recursion"
  (labels 
      ((arccot-minus (xsq n xpower)
         (let ((term (floor xpower n)))
           (if (= term 0)
               0
               (- (arccot-plus xsq (+ n 2) (floor xpower xsq))
                  term))))       
       (arccot-plus (xsq n xpower)
         (let ((term (floor (/ xpower n))))
           (if (= term 0)
               0
               (+ (arccot-minus xsq (+ n 2) (floor xpower xsq))
                  term))))
       (arccot (x unity)
         (let ((xpower (floor (/ unity x))))
           (arccot-plus (* x x) 1 xpower))))
    
    (let* ((unity (expt 10 (+ digits 10)))
           (thispi (* 4 (- (* 4 (arccot 5 unity)) (arccot 239 unity)))))
      (floor thispi (expt 10 10)))))

The first 10000 digits again.

* (time (machin-pi 10000))

Evaluation took:
  0.662 seconds of real time
  0.634038 seconds of total run time (0.495454 user, 0.138584 system)
  [ Run times consist of 0.233 seconds GC time, and 0.402 seconds non-GC time. ]
  95.77% CPU
  1,491,387,858 processor cycles
  109,530,592 bytes consed
  
31415926535897932384626433832795028841971693993751058209749445923078164062862089 ...

Algorithmic optimizations would take us much further. For example the Gauss–Legendre or Salamin–Brent formula.
Then there is the fastest known(at the turn of the millenium), Chudnovsky’s formula :

Generating π in CL

Update 2009-07-23 : Faster version in CL and a Haskell version.
——————————————————————————–

A trivial approximation using the Leibniz formula.

(defun leibniz-pi()
  (labels ((local-pi(sum n)
             (if (&lt; n (expt 10 6))
                 (local-pi
                  (+ sum (/ (expt -1 n) (+ (* 2 n) 1)))
                  (+ n 1))
                 sum)))
    (* 4 (local-pi 0f0 0f0))))

And here’s a longer version(but faster and more precise) using Machin’s formula with fixed point arithmetic to x digits.

(defun machin-pi (digits)
  "Calculates PI digits using fixed point arithmetic and Machin's formula with double recursion"
  (labels
      ((arccot-minus (xsq n xpower)
         (let ((term (floor (/ xpower n))))
           (if (= term 0)
               0
               (- (arccot-plus xsq (+ n 2) (floor (/ xpower xsq)))
                  term))))
       (arccot-plus (xsq n xpower)
         (let ((term (floor (/ xpower n))))
           (if (= term 0)
               0
               (+ (arccot-minus xsq (+ n 2) (floor (/ xpower xsq)))
                  term))))
       (arccot (x unity)
         (let ((xpower (floor (/ unity x))))
           (arccot-plus (* x x) 1 xpower))))

    (let* ((unity (expt 10 (+ digits 10)))
           (thispi (* 4 (- (* 4 (arccot 5 unity)) (arccot 239 unity)))))
      (floor (/ thispi (expt 10 10))))))

The first 10000 digits.

* (time (pidigits 10000))

Evaluation took:
  2.496 seconds of real time
  1.149998 seconds of total run time (0.974647 user, 0.175351 system)
  [ Run times consist of 0.325 seconds GC time, and 0.825 seconds non-GC time. ]
  46.07% CPU
  5,626,335,988 processor cycles
  217,893,792 bytes consed

31415926535897932384626433832795028841971693993751058209749445923078164062862089 ...