| Ilya Zakharevich on Fri, 30 Aug 2024 07:36:18 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
| Re: [PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64 (try III) |
⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ REMARK ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜
The main body of this message was already sent to this list, but
all previous attempts were lost.
The differences w.r.t. the first version:
• gives a link to a conjectural 5x speedup of forprime-via-nextprime.
• emphasizes that <<4 corrects a bug.
⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜
On Sun, Aug 18, 2024 at 06:17:51AM -0700, Ilya Zakharevich wrote:
> ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ I will comment on this patch in a separate email.
> timeit(s,f) = my(t=gettime(), o=f()); if(#o, o=Str("\t",o)); printf("%s%7.3f%s\n",s,gettime()/1000,o);
> my(c, L=2*10^9); for(k=8,19, c=0; timeit(Strprintf("10^%d\t",k), (()->forprime(p=10^k,10^k+L,c++);c)))
>
> one can get the following timings¹⁾ ²⁾ (of sieving an interval of length 2
> billions starting at given numbers) in seconds:
>
> 2.16-2-beta | patched and -p 3.2G | found
> default | -p 370M | no␣options | ar=900K | ar=24M | primes
> —————————————————————————————————————————————————————————————————————————
> 10^8 8.045 7.768 | 5.200 4.885 5.016 | 97125071
> 10^9 7.828 7.826 | 5.005 4.744 4.828 | 93602003
> 10^10 7.503 7.494 | 6.902 6.668 8.150 | 86503340
> 10^11 7.281 7.306 | 6.513 6.291 7.979 | 78934825
> 10^12 7.915 7.917 | 6.219 5.991 7.812 | 72381054
> 10^13 140.761 9.682 | 6.058 5.804 7.713 | 66815381
> 10^14 148.895 13.751 | 6.244 5.728 7.633 | 62036118
> 10^15 157.631 24.830 | 7.093 6.344 7.587 | 57901748
> 10^16 163.787 56.966 | 8.792 7.498 7.588 | 54290341
> 10^17 170.536 151.720 | 13.064 10.115 7.672 | 51100445
> 10^18 177.302 177.388 | 26.452 17.955 8.401 | 48254877
> 10^19 185.912 185.914 | 79.479 48.287 10.218 | 45708757
• This is a part of a large data-collection effort. Done separately for
various ranges of the arena size “ar”, like
forstep(d=20,40,5,my(s=(1+2001\/d)*10^6/16);set_optimize(5,s);print("\n\n\t",s);my(c,L=2*10^9);for(k=8,19,c=0;timeit(Strprintf("10^%d\t",k),(()->forprime(p=10^k,10^k+L,c++);c))))
• The speed up to 10^12 does not depend much on the arena size — even
down to 32K arenas. (It seems that processor’s predictive
auto-prefetching is working fine!)
• The speed is quite consistent if one does not change “the
environment”. However, relocations of the PARI’s stack can lead to
variations as large as 15%.
• The arenas are allocaated on the PARI’s stack. So with larger
arenas, one needs to increase stack (I do not know why, but it seems
that extra 25% overhead w.r.t. “ar” is required; so for 24M arena I
need
allocatemem(30*10^6)
).
> Here “default” means stock 2.16.2-beta without any argument. In the
> next column, -p 370000000 is the optimized value of the count of used
> pre-calculate primes (used in sieving). — This is the break-even point
— the same as “crossover point”.
> between two possible algorithms used by forprime() with the stock
> (buggy) code of forprime(). (I did discuss it in my message of a
> couple days ago.)
> The opposite happens with -p 3200000000: then all the rows in the
> table above correspond to the sieving-method used by forprime().
Two methods used by PARI are “sieve-by-a-chunk-of-the-given-range” and
“slightly optimized²⁾ nextprime()”.
²⁾ This means pre-sieving using primes≤7 (i.e., mod 210). Earlier
this year, I wrote that (my conjecture is that) pre-sieving up
to primes≤10⁶ would lead³⁾ to 5x speedup.
http://megrez.math.u-bordeaux.fr/archives/pari-dev-2401/msg00049.html
³⁾ Such a speedup would decrease the crossover point 25x down.
Still, sieving is going to be ∼3.5 times quicker above 2^64,
and the advantage would keep up to 2^67.
> (It is a draft only.)
This means:
• The “#if 1” chunk should go into the assembler part of the kernel.
• The printf()⸣s below are commented. In the final version, they
should be removed.
> --- src/language/forprime.c-ini 2024-07-09 15:42:01.000000000 -0700
> +++ src/language/forprime.c 2024-08-18 02:27:53.952923105 -0700
…………
> +#if 1
> +#define rem_half(a,b) \
> +__extension__ ({ ulong __arg1 = (a); unsigned int __value, __hiremainder, __arg2 = (b); \
> + __asm__ ("divl %k4" \
> + : "=a" /* %eax */ (__value), "=d" /* %edx */ (__hiremainder) \
> + : "ka" /* %eax */ (__arg1), "kd" /* %edx */ (__arg1>>32), "kmr" (__arg2)); \
> + __hiremainder; \
> +}) /* call only if the return must fit into 32 bits!!! */
> +# define HAVE_rem_half
> +#endif /* !( 1 ) */
> /* s is odd; prime (starting from 3 = known_primes[2]), terminated by a 0 byte.
> * Checks n odd numbers starting at 'start', setting bytes to 0 (composite)
> * or 1 (prime), starting at data */
This (not mine!) comment is complete bonkers. So are many comments in
the body of the related functions. They mention bytes (both instead
of words and of bits), primediff; the meaning of bytes 0 and 1 are
inverted.
Moreover, it seems that some code assumes that prime=3 is at offset 2,
some that it is at offset 3.
> @@ -493,7 +530,7 @@ optimize_chunk(ulong a, ulong b)
> {
> /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large
> * as to force recalculating too often). */
> - ulong chunk = 0x80000UL;
> + ulong chunk = (cache_model.bigarena ? cache_model.bigarena : 0x80000UL)<<4; /* bigarena is in bytes, we use bits, and only odds */
> ulong tmp = (b - a) / chunk + 1;
>
> if (tmp == 1)
This should better be reimplemented as the default value 0x800000UL
for “bigarena”.
!!! Moreover: PAY ATTENTION to the added “<<4”. THIS was the bug leading
!!! to the major part of the speedup of the patch I discuss.
Hope this helps,
Ilya