Ilya Zakharevich on Thu, 12 Sep 2024 16:04:49 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64 |
On Thu, Sep 12, 2024 at 06:48:57AM -0700, Ilya Zakharevich wrote: > > Could explain what rem_half does ? > > Half-width divll() — less storing the result of division. (I did not > check — maybe the “whole ½-width analogue” of divll() would be > compiled into the same assembler code — by optimizing away the whole > bruhaha with hiremainder… Just to be safe, I modified the version > from x86 assembler to return only the remainder.) Forgot to add: on several processors at hand it gives up to 1.8 times speedup comparing to (ulong x) / (ulong y). > > If I understand correctly, > > you are suggesting to change is optimal_chunk: > > ulong chunk = 0x80000UL; > > to > > ulong chunk = 0x800000UL; > > > > Is it correct ? > > No. The main bug was in “what are the units of the ‘chunk’ variable”. > The code in one place thinks that it counts “numbers-to-sieve”, in the > other that it counts “bytes-in-the-arena”. (These differ 16 times, which Got carried away and forgot to focus on the content — sorry! OK, so there are • possible up-to-1.8 speedup due to 64:32 division; • the corrected default arena allocation. (The effects of these are in the table in my original post: compare two left columns with the “no␣options” column on the right.) • Customizable arena size. To see effects of this, compare the “no␣options” column with two columns on the right of it. Hope this helps, Ilya