Ilya Zakharevich on Sun, 18 Aug 2024 15:17:57 +0200


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[PATCH 2.16-2-beta]: about 20x speedup of forprime() in 10^12 .. 2^64


On a contemporary low-midlevel CPU (13th Gen Intel(R) Core(TM) i5-13500T), with

    timeit(s,f) = my(t=gettime(), o=f()); if(#o, o=Str("\t",o)); printf("%s%7.3f%s\n",s,gettime()/1000,o);
    my(c, L=2*10^9); for(k=8,19, c=0; timeit(Strprintf("10^%d\t",k), (()->forprime(p=10^k,10^k+L,c++);c)))

one can get the following timings¹⁾ ²⁾ (of sieving an interval of length 2
billions starting at given numbers) in seconds:

            2.16-2-beta    |       patched and -p 3.2G       |   found
    	default  | -p 370M | no␣options  | ar=900K  | ar=24M |    primes
—————————————————————————————————————————————————————————————————————————
10^8      8.045      7.768 |      5.200      4.885     5.016 |  97125071
10^9      7.828      7.826 |      5.005      4.744     4.828 |  93602003
10^10     7.503      7.494 |      6.902      6.668     8.150 |  86503340
10^11     7.281      7.306 |      6.513      6.291     7.979 |  78934825
10^12     7.915      7.917 |      6.219      5.991     7.812 |  72381054
10^13   140.761      9.682 |      6.058      5.804     7.713 |  66815381
10^14   148.895     13.751 |      6.244      5.728     7.633 |  62036118
10^15   157.631     24.830 |      7.093      6.344     7.587 |  57901748
10^16   163.787     56.966 |      8.792      7.498     7.588 |  54290341
10^17   170.536    151.720 |     13.064     10.115     7.672 |  51100445
10^18   177.302    177.388 |     26.452     17.955     8.401 |  48254877
10^19   185.912    185.914 |     79.479     48.287    10.218 |  45708757

  ¹) The overhead of
     	   my(c); for(k=1,66000000,c++)
     is 3.9 sec.  Throwing in another “c++” increases it to 6.9 sec.
     In other words: the timings at start of the table are
     “essentially 0”, and the timings below 7 should be characterized
     as “practically free”.
  ²⁾ Currently, the patch handles only forprime() in the range 2^32 .. 2^64.

Here “default” means stock 2.16.2-beta without any argument.  In the
next column, -p 370000000 is the optimized value of the count of used
pre-calculate primes (used in sieving). — This is the break-even point
between two possible algorithms used by forprime() with the stock
(buggy) code of forprime().  (I did discuss it in my message of a
couple days ago.)

The opposite happens with -p 3200000000: then all the rows in the
table above correspond to the sieving-method used by forprime().  (We
do not show what happens with this option and the stock 2.16.2-beta:
with this parameter the sqrt-growth observed with -p 370M up to 10^17
just continues to the end of the table.

For the patched version we show only the results of -p 3200000000:
precalculating primes up to 3.2 billions.  We show the timings with
the default arena size (now 512K — was set to 32K before due to a
bug), as well as with size of the arena tuned to the size of L2 cache³⁾
and/or L3 cache.

  ³⁾ lscpu shows the L2 size as 1.3M per core.  I found that arena
     works slightly better if it assumes 900K-cache.  (Code/data
     caching conflicting???) 

Although on this CPU, L3 cache is shared between cores, on an idle
machine this value works the best (of the values above 1.3M).

With the patch, one can tune the arena size by
  install(set_optimize,lLDG);
  set_optimize(5, 24<<20);

SUMMARY: with this patch, the overhead of forprime in the range
2^32..2^64 is a constant overhead of “1.5 increments (as in c++)” per
a found prime, plus the overhead of “√(N/10^19) increments” for
sieving near N.

EXTRAPOLATION: It is natural to assume that switching the 2^64 barrier
would increase the sqrt()-like overhead about 6 times (3 times due to
3-step “naive remainder” for 2-words BigInt⸣s, and 2 times due to the
need to use the “honest-64bit⁴⁾ % processor instructions”.

  ⁴⁾ Vs. the 64/32-bit remainder which is used by the patch most of
     the time.

Given that near 10^22 forprime()⸣s overhead is about 300 “increments”,
with this patch the break-even point between two strategies of doing
forprime() moves from 370 millions to 150 billions.

CONCLUSION: there is every sense in a framework for storing the
pre-calculated prime numbers way above 2^32 — this can improve the
speed of forprime() by “up to” more-than-an-order of magnitude⁵⁾ in the
range 2^64 .. 2^70.  The only practical way I know is the
“primediffs-with-hiccups” scheme — as it was⁶⁾ actually used in PARI
for decades.

  ⁵⁾ This is the expected advantage above-but-close-to 2^64.  It would
     gradually decrease due to sqrt()-overhead (the competing method
     has almost constant overhead).
  ⁶⁾ For unfathomable (-to-me) reasons, this functionality was removed
     from PARI.

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ I will comment on this patch in a separate email.

  (It is a draft only.)

Hope this helps,
Ilya

--- src/language/forprime.c-ini	2024-07-09 15:42:01.000000000 -0700
+++ src/language/forprime.c	2024-08-18 02:27:53.952923105 -0700
@@ -173,9 +173,10 @@ typedef struct {
     ulong arena;
     double power;
     double cutoff;
+    ulong bigarena;
 } cache_model_t;
 
-static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF };
+static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF, 0 };
 
 /* Assume that some calculation requires a chunk of memory to be
    accessed often in more or less random fashion (as in sieving).
@@ -282,7 +283,8 @@ good_arena_size(ulong slow2_size, ulong
 /* Use as in
     install(set_optimize,lLDG)          \\ Through some M too?
     set_optimize(2,1) \\ disable dependence on limit
-    \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff
+    \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff,
+    \\ 5: cache size (typically whole L2 or L3) in bytes to use in forprime() 
     \\ 2,3,4 are in units of 0.001
 
     { time_primes_arena(ar,limit) =     \\ ar = arena size in K
@@ -312,6 +314,9 @@ set_optimize(long what, GEN g)
   case 4:
     ret = (long)(cache_model.cutoff * 1000);
     break;
+  case 5:
+    ret = (long)(cache_model.bigarena);
+    break;
   default:
     pari_err_BUG("set_optimize");
     break;
@@ -324,25 +329,46 @@ set_optimize(long what, GEN g)
     case 2: slow2_in_roots     = (double)val / 1000.; break;
     case 3: cache_model.power  = (double)val / 1000.; break;
     case 4: cache_model.cutoff = (double)val / 1000.; break;
+    case 5: cache_model.bigarena = val; break;
     }
   }
   return ret;
 }
 
+#if 1
+#define rem_half(a,b) \
+__extension__ ({ ulong __arg1 = (a); unsigned int __value, __hiremainder, __arg2 = (b); \
+   __asm__ ("divl %k4" \
+        : "=a" /* %eax */ (__value), "=d" /* %edx */ (__hiremainder) \
+        : "ka" /* %eax */ (__arg1), "kd" /* %edx */ (__arg1>>32), "kmr" (__arg2)); \
+   __hiremainder; \
+})				/* call only if the return must fit into 32 bits!!! */
+#  define HAVE_rem_half
+#endif	/* !( 1 ) */ 
+
 /* s is odd; prime (starting from 3 = known_primes[2]), terminated by a 0 byte.
  * Checks n odd numbers starting at 'start', setting bytes to 0 (composite)
  * or 1 (prime), starting at data */
 static void
 sieve_chunk(pari_prime *known_primes, ulong s, byteptr data, ulong n)
 {
-  ulong p, cnt = n-1, start = s;
+  ulong p, cnt = n-1, start = s, swtch;
   pari_prime *q;
 
   memset(data, 0, n);
   start >>= 1;  /* (start - 1)/2 */
   start += n; /* Corresponds to the end */
   /* data corresponds to start, q runs over primediffs */
+#ifdef HAVE_rem_half
+//  printf("   \tst=%lu\tn=%lu\n", start, n);   fflush(stdout);
+  swtch = (2*start)/((((unsigned long)1)<<33)-1); /* For p less than swtch, the operation %
+						   * taken below has quotient <2^32. */
+//  printf("sw=%lu\tst=%lu\tn=%lu\n", swtch, start, n);   fflush(stdout);
+  
+  for (q = known_primes + 1, p = 3; p && p < swtch; p = *++q)
+#else
   for (q = known_primes + 1, p = 3; p; p = *++q)
+#endif
   { /* first odd number >= start > p and divisible by p
        = last odd number <= start + 2p - 2 and 0 (mod p)
        = p + last number <= start + p - 2 and 0 (mod 2p)
@@ -351,6 +377,15 @@ sieve_chunk(pari_prime *known_primes, ul
     long off = cnt - ((start+(p>>1)) % p);
     while (off >= 0) { data[off] = 1; off -= p; }
   }
+#ifdef HAVE_rem_half
+//  printf("   \tp=%lu\n", p);   fflush(stdout);
+  for (; p; p = *++q)
+  {
+    long off = cnt - rem_half((start+(p>>1)), p);
+    while (off >= 0) { data[off] = 1; off -= p; }
+  }
+//  printf("   \tp=%lu\n", p);   fflush(stdout);
+#endif
 }
 
 static void
@@ -404,6 +439,8 @@ initprimes0(ulong maxnum, long *lenp, pa
    * thus we do not include it in fixed_to_cache */
   asize = good_arena_size((ulong)(rootnum * slow2_in_roots), remains+1, 0,
                           &cache_model) - 1;
+//  printf("...Arena size=%lu", asize);
+  
   /* enough room on the stack ? */
   alloced = (((byteptr)avma) <= ((byteptr)bot) + asize);
   p = (byteptr)(alloced? pari_malloc(asize+1): stack_malloc(asize+1));
@@ -493,7 +530,7 @@ optimize_chunk(ulong a, ulong b)
 {
   /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large
    * as to force recalculating too often). */
-  ulong chunk = 0x80000UL;
+  ulong chunk = (cache_model.bigarena ? cache_model.bigarena : 0x80000UL)<<4; /* bigarena is in bytes, we use bits, and only odds */
   ulong tmp = (b - a) / chunk + 1;
 
   if (tmp == 1)
@@ -724,6 +761,11 @@ static void
 sieve_block(ulong a, ulong b, ulong maxpos, unsigned char* sieve)
 {
   ulong i, lim = usqrt(b), sz = (b-a) >> 1;
+#ifdef HAVE_rem_half
+  ulong swtch = a>>32; /* If p > swtch, then a/p < 2^32 */
+//  printf("  a=%lu\tb=%lu\tsw=%lu\n", a, b, swtch);
+#endif	/* defined HAVE_rem_half */
+  
   (void)memset(sieve, 0, maxpos+1);
   for (i = 2;; i++)
   { /* p is odd */
@@ -731,6 +773,11 @@ sieve_block(ulong a, ulong b, ulong maxp
     if (p > lim) break;
 
     /* solve a + 2k = 0 (mod p) */
+#ifdef HAVE_rem_half
+    if (p > swtch)
+      r = rem_half(a, p);
+    else
+#endif	/* defined HAVE_rem_half */ 
     r = a % p;
     if (r == 0)
       k = 0;
--- src/language/forprime.c-ini	2024-07-09 15:42:01.000000000 -0700
+++ src/language/forprime.c	2024-08-18 02:27:53.952923105 -0700
@@ -173,9 +173,10 @@ typedef struct {
     ulong arena;
     double power;
     double cutoff;
+    ulong bigarena;
 } cache_model_t;
 
-static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF };
+static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF, 0 };
 
 /* Assume that some calculation requires a chunk of memory to be
    accessed often in more or less random fashion (as in sieving).
@@ -282,7 +283,8 @@ good_arena_size(ulong slow2_size, ulong
 /* Use as in
     install(set_optimize,lLDG)          \\ Through some M too?
     set_optimize(2,1) \\ disable dependence on limit
-    \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff
+    \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff,
+    \\ 5: cache size (typically whole L2 or L3) in bytes to use in forprime() 
     \\ 2,3,4 are in units of 0.001
 
     { time_primes_arena(ar,limit) =     \\ ar = arena size in K
@@ -312,6 +314,9 @@ set_optimize(long what, GEN g)
   case 4:
     ret = (long)(cache_model.cutoff * 1000);
     break;
+  case 5:
+    ret = (long)(cache_model.bigarena);
+    break;
   default:
     pari_err_BUG("set_optimize");
     break;
@@ -324,25 +329,46 @@ set_optimize(long what, GEN g)
     case 2: slow2_in_roots     = (double)val / 1000.; break;
     case 3: cache_model.power  = (double)val / 1000.; break;
     case 4: cache_model.cutoff = (double)val / 1000.; break;
+    case 5: cache_model.bigarena = val; break;
     }
   }
   return ret;
 }
 
+#if 1
+#define rem_half(a,b) \
+__extension__ ({ ulong __arg1 = (a); unsigned int __value, __hiremainder, __arg2 = (b); \
+   __asm__ ("divl %k4" \
+        : "=a" /* %eax */ (__value), "=d" /* %edx */ (__hiremainder) \
+        : "ka" /* %eax */ (__arg1), "kd" /* %edx */ (__arg1>>32), "kmr" (__arg2)); \
+   __hiremainder; \
+})				/* call only if the return must fit into 32 bits!!! */
+#  define HAVE_rem_half
+#endif	/* !( 1 ) */ 
+
 /* s is odd; prime (starting from 3 = known_primes[2]), terminated by a 0 byte.
  * Checks n odd numbers starting at 'start', setting bytes to 0 (composite)
  * or 1 (prime), starting at data */
 static void
 sieve_chunk(pari_prime *known_primes, ulong s, byteptr data, ulong n)
 {
-  ulong p, cnt = n-1, start = s;
+  ulong p, cnt = n-1, start = s, swtch;
   pari_prime *q;
 
   memset(data, 0, n);
   start >>= 1;  /* (start - 1)/2 */
   start += n; /* Corresponds to the end */
   /* data corresponds to start, q runs over primediffs */
+#ifdef HAVE_rem_half
+//  printf("   \tst=%lu\tn=%lu\n", start, n);   fflush(stdout);
+  swtch = (2*start)/((((unsigned long)1)<<33)-1); /* For p less than swtch, the operation %
+						   * taken below has quotient <2^32. */
+//  printf("sw=%lu\tst=%lu\tn=%lu\n", swtch, start, n);   fflush(stdout);
+  
+  for (q = known_primes + 1, p = 3; p && p < swtch; p = *++q)
+#else
   for (q = known_primes + 1, p = 3; p; p = *++q)
+#endif
   { /* first odd number >= start > p and divisible by p
        = last odd number <= start + 2p - 2 and 0 (mod p)
        = p + last number <= start + p - 2 and 0 (mod 2p)
@@ -351,6 +377,15 @@ sieve_chunk(pari_prime *known_primes, ul
     long off = cnt - ((start+(p>>1)) % p);
     while (off >= 0) { data[off] = 1; off -= p; }
   }
+#ifdef HAVE_rem_half
+//  printf("   \tp=%lu\n", p);   fflush(stdout);
+  for (; p; p = *++q)
+  {
+    long off = cnt - rem_half((start+(p>>1)), p);
+    while (off >= 0) { data[off] = 1; off -= p; }
+  }
+//  printf("   \tp=%lu\n", p);   fflush(stdout);
+#endif
 }
 
 static void
@@ -404,6 +439,8 @@ initprimes0(ulong maxnum, long *lenp, pa
    * thus we do not include it in fixed_to_cache */
   asize = good_arena_size((ulong)(rootnum * slow2_in_roots), remains+1, 0,
                           &cache_model) - 1;
+//  printf("...Arena size=%lu", asize);
+  
   /* enough room on the stack ? */
   alloced = (((byteptr)avma) <= ((byteptr)bot) + asize);
   p = (byteptr)(alloced? pari_malloc(asize+1): stack_malloc(asize+1));
@@ -493,7 +530,7 @@ optimize_chunk(ulong a, ulong b)
 {
   /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large
    * as to force recalculating too often). */
-  ulong chunk = 0x80000UL;
+  ulong chunk = (cache_model.bigarena ? cache_model.bigarena : 0x80000UL)<<4; /* bigarena is in bytes, we use bits, and only odds */
   ulong tmp = (b - a) / chunk + 1;
 
   if (tmp == 1)
@@ -724,6 +761,11 @@ static void
 sieve_block(ulong a, ulong b, ulong maxpos, unsigned char* sieve)
 {
   ulong i, lim = usqrt(b), sz = (b-a) >> 1;
+#ifdef HAVE_rem_half
+  ulong swtch = a>>32; /* If p > swtch, then a/p < 2^32 */
+//  printf("  a=%lu\tb=%lu\tsw=%lu\n", a, b, swtch);
+#endif	/* defined HAVE_rem_half */
+  
   (void)memset(sieve, 0, maxpos+1);
   for (i = 2;; i++)
   { /* p is odd */
@@ -731,6 +773,11 @@ sieve_block(ulong a, ulong b, ulong maxp
     if (p > lim) break;
 
     /* solve a + 2k = 0 (mod p) */
+#ifdef HAVE_rem_half
+    if (p > swtch)
+      r = rem_half(a, p);
+    else
+#endif	/* defined HAVE_rem_half */ 
     r = a % p;
     if (r == 0)
       k = 0;