| Ilya Zakharevich on Thu, 12 Sep 2024 15:49:01 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
| Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64 |
On Sun, Sep 08, 2024 at 03:24:53PM +0200, Bill Allombert wrote:
> On Sun, Aug 18, 2024 at 06:17:50AM -0700, Ilya Zakharevich wrote:
> > On a contemporary low-midlevel CPU (13th Gen Intel(R) Core(TM) i5-13500T), with
> >
> > timeit(s,f) = my(t=gettime(), o=f()); if(#o, o=Str("\t",o)); printf("%s%7.3f%s\n",s,gettime()/1000,o);
> > my(c, L=2*10^9); for(k=8,19, c=0; timeit(Strprintf("10^%d\t",k), (()->forprime(p=10^k,10^k+L,c++);c)))
> >
> > SUMMARY: with this patch, the overhead of forprime in the range
> > 2^32..2^64 is a constant overhead of “1.5 increments (as in c++)” per
> > a found prime, plus the overhead of “√(N/10^19) increments” for
> > sieving near N.
>
> Could explain what rem_half does ?
Half-width divll() — less storing the result of division. (I did not
check — maybe the “whole ½-width analogue” of divll() would be
compiled into the same assembler code — by optimizing away the whole
bruhaha with hiremainder… Just to be safe, I modified the version
from x86 assembler to return only the remainder.)
> If I understand correctly,
> you are suggesting to change is optimal_chunk:
> ulong chunk = 0x80000UL;
> to
> ulong chunk = 0x800000UL;
>
> Is it correct ?
No. The main bug was in “what are the units of the ‘chunk’ variable”.
The code in one place thinks that it counts “numbers-to-sieve”, in the
other that it counts “bytes-in-the-arena”. (These differ 16 times, which
led to ∼ order of magnitude slowdown. — Which in turn led
to — ridiculous — claims about no-need-for-longer-prime-tables… And
nobody thought that “extraordinary claims require extraordinary proofs
;―) .)
Yours,
Ilya