Ilya Zakharevich on Thu, 12 Sep 2024 15:49:01 +0200


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64


On Sun, Sep 08, 2024 at 03:24:53PM +0200, Bill Allombert wrote:
> On Sun, Aug 18, 2024 at 06:17:50AM -0700, Ilya Zakharevich wrote:
> > On a contemporary low-midlevel CPU (13th Gen Intel(R) Core(TM) i5-13500T), with
> > 
> >     timeit(s,f) = my(t=gettime(), o=f()); if(#o, o=Str("\t",o)); printf("%s%7.3f%s\n",s,gettime()/1000,o);
> >     my(c, L=2*10^9); for(k=8,19, c=0; timeit(Strprintf("10^%d\t",k), (()->forprime(p=10^k,10^k+L,c++);c)))
> > 
> > SUMMARY: with this patch, the overhead of forprime in the range
> > 2^32..2^64 is a constant overhead of “1.5 increments (as in c++)” per
> > a found prime, plus the overhead of “√(N/10^19) increments” for
> > sieving near N.
> 
> Could explain what rem_half does ?

Half-width divll() — less storing the result of division.  (I did not
check — maybe the “whole ½-width analogue” of divll() would be
compiled into the same assembler code — by optimizing away the whole
bruhaha with hiremainder…  Just to be safe, I modified the version
from x86 assembler to return only the remainder.)

> If I understand correctly, 
> you are suggesting to change is optimal_chunk:
> ulong chunk = 0x80000UL;
> to
> ulong chunk = 0x800000UL;
> 
> Is it correct ?

No.  The main bug was in “what are the units of the ‘chunk’ variable”.
The code in one place thinks that it counts “numbers-to-sieve”, in the
other that it counts “bytes-in-the-arena”.  (These differ 16 times, which
led to ∼ order of magnitude slowdown. — Which in turn led
to — ridiculous — claims about no-need-for-longer-prime-tables…  And
nobody thought that “extraordinary claims require extraordinary proofs
;―) .) 

Yours,
Ilya