Peter-Lawrence . Montgomery on Wed, 4 Apr 2001 23:02:28 +0200 (MET DST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: parallelization


> From: "Mills, D. Dr.              MATH" <ad3943@exmail.usma.army.mil>

> I'm using PARI on a supercomputer, and am curious to know any programming
> "tricks" or "tips" to aid me in efficiently parallelizing my programs. If it
> helps, I am performing a search algorithm on sets of polynomials over finite
> fields in order to see which ones satisfy certain properties, and the search
> is being performed on sets of coefficients (a1,a2,a3,...,ak) via a for loop.
> I am also running programs which factor sets of numbers, none of which are
> too large (no larger than 10^12 at the moment), but since I'm looking at
> sets of such numbers the runtime is of course slow, and I'm hoping to speed
> things up.

      As long as your numbers are around 10^12, the trial division
algorithm needs a table of primes up to about 10^6.
There are fewer than 80000 primes below 10^6.  You can afford
a table with these primes and any convenient information about them.

      If you have 64-bit integer multiply, then 
the data structure might have

          p        (32-bit integer)
          inv64    (2-adic inverse, 1/p mod 2^64)
          qbound   (FLOOR((2^64 - 1)/p)

(one four-byte field, two 8-byte fields, per odd prime)

          uint64 q = n*inv64;  /* mod 2^64 -- exact quotient if p divides n */
          if (q <= qbound) then
               p divides n
          else
               p does not divide n
          end if

(multiply, compare, branch, plus memory references and loop overhead).
The proof notes that the computed q satisfies p*q == n (mod 2^64).
The equality p*q = n will hold precisely when p*q <= 2^64 - 1.
This is equivalent to q <= (2^64 - 1)/p.

    Since the main loop needs only inv64 and qbound, not p,
you might store the two 8-byte entities in one array and
p in another, to improve cache utilization.

    As long as n remains around 10^12, you might test (q < n) instead
of (q <= qbound).  This will reduce memory references.
The test (q < n) will pass only 10^12 times in 2^64, about one in 10^7.
You can afford the overhead to check for false hits.

     If you lack 64-bit integer multiplication, you might 
approximate the quotient by floating point arithmetic
(storing the reciprocals of the primes), and
round this to an integer (avoiding integer overflow
on 32-bit machines when the quotient > 2^31), and
check whether your approximate quotient is close to that
integer.  This is several more instructions.

     SQUFOF is good for factoring integers below 10^18,
after doing trial division up to the cube root
(so only two factors remain).
It needs arithmetic only on integers up to 2*sqrt(n)
(whereas Pollard Rho has intermediate results near n^2).
In the rare case where SQUFOF fails, resort to trial division.

      Good luck!