Ilya Zakharevich on Thu, 12 Sep 2024 16:04:49 +0200


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64


On Thu, Sep 12, 2024 at 06:48:57AM -0700, Ilya Zakharevich wrote:
> > Could explain what rem_half does ?
> 
> Half-width divll() — less storing the result of division.  (I did not
> check — maybe the “whole ½-width analogue” of divll() would be
> compiled into the same assembler code — by optimizing away the whole
> bruhaha with hiremainder…  Just to be safe, I modified the version
> from x86 assembler to return only the remainder.)

Forgot to add: on several processors at hand it gives up to 1.8 times
speedup comparing to (ulong x) / (ulong y).

> > If I understand correctly, 
> > you are suggesting to change is optimal_chunk:
> > ulong chunk = 0x80000UL;
> > to
> > ulong chunk = 0x800000UL;
> > 
> > Is it correct ?
> 
> No.  The main bug was in “what are the units of the ‘chunk’ variable”.
> The code in one place thinks that it counts “numbers-to-sieve”, in the
> other that it counts “bytes-in-the-arena”.  (These differ 16 times, which

Got carried away and forgot to focus on the content — sorry!

OK, so there are

  • possible up-to-1.8 speedup due to 64:32 division;
  • the corrected default arena allocation.

(The effects of these are in the table in my original post: compare
two left columns with the “no␣options” column on the right.)

  • Customizable arena size.

To see effects of this, compare the “no␣options” column with two
columns on the right of it.

Hope this helps,
Ilya