Ilya Zakharevich on Thu, 12 Sep 2024 15:04:47 +0200


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [pre-announcement of a PATCH 2.16.1-alpha] 90x speedup of forprime() above 2^64


  All the timing below is with low end of mid tier CPU: 13th Gen Intel(R) Core(TM) i5-13500T

SUMMARY: sieving turns out to be much quicker than (even) I thought. 

On Thu, Aug 29, 2024 at 11:13:57PM -0700, Ilya Zakharevich wrote:
> With this patch, I can test where is the REAL crossover (since,
> INDEED, eventually nextprime() IS going to become quicker!).

Well, this turned to be very short-sighted to write!

>   ANSWER: on my computer, I can allocate about 12GB for a prime table,
>   	  and (with primediff) this supports prime table up to 320
> 	  billions, so one can sieve up to 10^23.
> 
> 	  The current implementation is not fully optimized, but still
> 	  it performs about 18 times quicker than the stock forprime()
> 	  even at 10^23.  (With 128-bit numbers closer to 2^64 it
> 	  performs up to >90 times quicker!)
> 
> Moreover, the dependence of speed on the size of numbers “follows a
> simple law”.  Extrapolating this law suggests that the crossover¹⁾
> happens about 0.9·10^26.

>   CONCLUSION: it makes sense to support prime tables up to 10^13.
> 
>     ¹⁽ This is the crossover with the CURRENT IMPLEMENTATION of
>        forprime()-via-nextprime().  As I explained earlier
>          http://megrez.math.u-bordeaux.fr/archives/pari-dev-2401/msg00049.html
>        I expect that this may be sped up about 5 times (see the line
>        in the middle of the attached image).
> 
>        If such an optimization is possible, the crossover happens at ∼2.5·10^24.

OUPS!  Thinking about this more: when investigating these questions, I
did not try²⁾ to inspect the dependence of the speed of sieving on the
size of the sieving arena!

  ¹⁾ My logic must have been that the larger the arena (which is
     forced to be larger than L3 cache due to slowness of calculation
     of remainder in the loop) the slower is going to be the memory
     access.  WRONG!

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ KIND OF SHORT SUMMARY (“theoretical”)

In a “theoretical situation”, when the available memory size is
unbounded (but we assume that memory access performance is not
hindered by this!), sieving is always going to be WAY quicker than
nextprime(): when running through a region of numbers, the time per
10^9 numbers (“a giganumber”) near N is:

  the measured timing for forprime␣via␣nextprime() is >400 sec/giganumber; 
  it is about
     0.025 * log10(N)^2.5   + 310  sec/giganumber (N ≫ 10^19)
     
  the measured³⁾ timing for forprime␣via␣sieving() is about
     0.19 *  log10(N)^1.6   + 2   sec/giganumber  (10^19 ≤ N ≤ 10^23)
  (with very raw optimizations only).  This means⁴⁾ 5.4 … 6.6
  sec/giganumber in the range above. 

     Note the smaller powers and the smaller coefficient (since log10(N) ≫ 10).  

        ³⁾ for 10^23 a bit of extrapolation in the arena size has been needed.
	
        ⁴⁾ … not counting sieving the initial run of primes up to √N.
           This takes a bit more than 1 sec/giganumber (since these
           primes are much smaller!).  (About 6min for N ∼ 10²³.)

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ VERY Practical flavor I (with 10GB for the sieving arena)

This way one can sieve via chunks of 160 giganumbers.  This gives an
extra slowdown — but it is only about 1sec/giganumber at ∼4·10²³.

This is going to be quicker than forprime␣via␣nextprime up to the
crossover at ∼5·10²⁹ (OR⁵⁾ at ∼1.6·10²⁷).

  ⁵⁾ This second number supposes a significant rewriting of the
     algorithm for forprime␣via␣nextprime (which I explained
     elsewhere).  IN ALL THESE COMPARISON WE ASSUME that memory is
     enough for this and for the table of primes up to √N (requiring
     about 2√N/log N bytes with diffprimes).

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Same with 60GB for the sieving arena

The slowdown due to finite arena (1 teranumber) is about
1sec/giganumber at ∼ 9·10²⁴.  The crossover with
forprime␣via␣nextprime is at ∼ 1.3·10³¹ (OR⁵⁾ at ∼4·10²⁸).

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Same with the sieving arena which is 10% of the size of diffprimes table

(Slowdown is always >1 sec/giganumber.)  The slowdown is <4 sec/GN
(speed 11.3 sec/GN) up to 10²⁵.  (This takes 110GB for diffprimes, and 11GB
for the arena).

The crossover with forprime␣via␣nextprime happens at ∼7·10⁷¹ (or 2·10⁴⁵).
(Caveat: this is with unphysical amount of memory!).

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Same with the sieving arena which is 20% of the size of diffprimes table

Slowdown is 1 sec/giganumber already at 4·10¹⁹.

The slowdown is 2.6 sec/GN (speed 10 sec/GN) at 9e25 (this takes 320GB for
diffprimes, and 64GB for the arena).

The crossover with forprime␣via␣nextprime happens at ∼10⁷⁸ (OR⁵⁾ 10⁵¹)
(with the same caveat).

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Same with the sieving arena which is 100% of the size of diffprimes table

Slowdown is 1 sec/giganumber at ∼10²⁸ (this requires 0.1 petabytes for
the primediff table!).  Hence for all practical purposes the slowdown
is negligible.

The crossover with forprime␣via␣nextprime happens at ∼3·10⁹¹ (OR⁵⁾ 10⁶⁴)
(with the same caveat).

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ The main term of the slowdown

With the arena size α (requiring α/16 bytes of memory) the slowdown in
sec/GN is ∼ 0.04 · α^-.78 · N^.44.  The alternative way to describe this:

  In range 10²³..10⁴⁰ the overhead of small arenas is 1 sec/GN at the
  arena size α = 0.01·N^.56 ≈ 0.2 · √N · (N/5e21)^(1/17).  The
  overhead decays as const * α^-.78 for smaller arenas.

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ YET ANOTHER CONCLUSION

For primes above 2⁶⁴, the sieving gives speed up to 2 orders of
magnitude better than forprime␣via␣nextprime — and (if one believes my
extrapolations) the order(s)-of-magnitude advantage may be preserved at
least up to 10³⁰ (given — possibly insane — available amount of memory
needed to store the table of “small” primes — up to √N).

Hope this helps,
Ilya