Ilya Zakharevich on Tue, 01 Oct 2024 10:04:04 +0200


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[very-early-bird-PATCH 2.16.1-alpha] Re: NNNx speedup of forprime() etc.


This is a very early variant of the (working) patch for 2.1.16-alpha primediff support.  As I already explained, the code shows:

 • Up to 20x speedup of sieving-for-primes up to 2^64.
 • More than 1000x speedup of looking for primes between 2^64 to primelimit².
 • More than 50x (and up to 90x) speedup of looking for pseudoprimes in the same range.
 • primelimit is only memory-limited, with large prime tables taking¹⁾ about 1.01 bytes per prime.

     ¹⁾ However, the current values of macros to lead about 1.07 bytes per prime
        (for optimizations below).  For smaller tables of primes, the overhead is
        (currently) larger — but this may be fixed later.

In addition to this,

 • Speedup up to²⁾ 5x of looping through primes (or pseudoprimes) above primelimit².

     ²⁾ However, such speedup requires big tables of primes (11GB tested) and sieving for
        long regions (like 10^9) of very large primes (such as 10^203).  For smaller values,
        the speedup about 2–2.5x is more typical.

 • Possibilities  to loop through primes, or pseudoprimes, or N-rough numbers (with N≤primelimit).

 • Sieving within the region of known primes is “ALMOST” as quick-starting as with 2.16.2-beta.

      Essentially, to startup sieving in the latter case, one needs more-or-less the equivalent of
      precprime().  With 2.16.2-beta (inside the region of known-primes) it takes about 55 ns
      when repeated on random values, 38 ns when called chained, and 12 ns when called repeatedly
      on the same value.  (And the 2.16-1-alpha code is glacially slow!)  Dpending on parameters,
      the new code does

	  (Numbers in ns)		Random	Chained	Repeated

        		2.16.2-beta	 55	 38	 12
        new with 1.07 bytes per prime	350	 70	 20
        new with 1.01 bytes per prime	470	120	120

     These timings are on a particular machine.  For reference: it takes about 70 ns per the
     assignment m=n, and about 30 ns per n++.  (So these numbers are relevant for very short
     runs of forprime() only.)

       (With such quick code so dependent on memory caching, it is hard to design a benchmark
       strategy close to real-usage scenarios.  But with numbers for the starup delay this low,
       they would matter only with an extremely quick execute-per-prime code, where the
       benchmark numbers have a better chance to reflect the reality.)

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Caveat

The current code of forprime() is EXTREMELY over-engineered.  It could have been 
implemented (as the function in this patch) as a sequence of directives (first go this
way, then that way etc.).  Instead, there are dozens of intertwining C subroutines
implementing different strategies without a transparent way to add/modify strategies.

So: first, the new code to make forprime() to run through the table of known primes IS
integrated into forprime().  HOWEVER, the updated sieving code is not integrated into
the mess described above (since I have no clue how to do it nicely!), but exists as a
separate entry point.

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Caveat²: not tested on 32-bit machines

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ New (temporary) entry point: biggap_prime_le_ge(n,do_ge,do_p)

When do_p is  TRUE: returns precprime(n)/nextprime(n)          for do_ge = 0 or 1.
When do_p is FALSE: returns primepi(precprime(n)/nextprime(n)) for do_ge = 0 or 1.

Fails out of range 3..primelimit.  Import (with defaults of 0,0):

  install(biggap_primepi_le_ge,"uUD0,U,D0,U,");

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ New (temporary) entry point: sieve_block2(start,end-start,lim,is_pure);

  install(sieve_block2,"GUD0,U,D0,U,");

Start should be odd; end-start should fit in one word.  The intent is to be as
forprime(), but currently cannot take the “code to execute” as an argument.  (Instead,
returns the count of covered primes.)

Sieves the range start…end using the list of known primes — unless lim>1.  If lim is
FALSE, fails if this list is too short.  If lim>1, uses primes up to lim, possibly
post-checking the numbers left (i.e., not divisible by the primes in the list).

The post-processing depends on is_pure: if false, post-selects pseudo-primes.
If positive, post-selects “honest” primes.  If negative, there is no post-selection,
so the lim-rough numbers are counted.

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ Remarks on the (early-bird) patch

The code is fuzzed for tens of millions of examples (in a wide range of primelimit⸣s).

 • Only the file forprime.c is touched.
 • The assembler (enclosed by #if 1) should go to x64 assembler file.
 • The macro NEXT_PRIME_VIA_DIFF_BIGGAPS() should go into the global definition file.
 • Currently there is no PREV_PRIME_VIA_DIFF_BIGGAPS() macro.  It is easy to writae
   though, since the “hiccup” sequences are easy to recognize: their last byte is odd.
 • There is a lot of //-commmented debugging code.
 • No tuning for speed is done on the macros controlling allocation of ancillary-lists
   (FAKE_SKIP_EVERY_BEFORE and nearby).
 • No attempt to integrate biggap_prime_le_ge() into prec/nextprime()/primepi() is made.
 • The code to cover the functionality of prime() would be very similar to
   biggap_prime_le_ge() — but this flavor is not written yet.
 • No attempt to merge with the 2.16.2-beta code is done.
 • I did not try to proof-read through the text of the diffs (yet).

The patch is included AND attached.

Enjoy,
Ilya

--- src/language/forprime.c.orig0	2023-11-14 07:25:17.000000000 -0800
+++ src/language/forprime.c	2024-09-30 16:23:36.912528591 -0700
@@ -22,7 +22,9 @@ Foundation, Inc., 51 Franklin Street, Fi
 /**********************************************************************/
 
 static ulong _maxprime = 0;
+static ulong _maxprime_biggaps = 0;
 static ulong _maxprimelim = 0;
+static ulong _maxprimelim_biggaps = 0;
 static ulong diffptrlen;
 static GEN _prodprimes,_prodprimes_addr;
 
@@ -174,9 +176,10 @@ typedef struct {
     ulong arena;
     double power;
     double cutoff;
+    ulong bigarena;
 } cache_model_t;
 
-static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF };
+static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF, 0 };
 
 /* Assume that some calculation requires a chunk of memory to be
    accessed often in more or less random fashion (as in sieving).
@@ -200,10 +203,18 @@ static ulong
 good_arena_size(ulong slow2_size, ulong total, ulong fixed_to_cache,
                 cache_model_t *cache_model)
 {
+  static char *s;  
   ulong asize, cache_arena = cache_model->arena;
   double Xmin, Xmax, A, B, C1, C2, D, V;
   double alpha = cache_model->power, cut_off = cache_model->cutoff;
 
+  if (!s)
+  {
+    s = getenv("PARI_ARENA");
+    if (s)  cache_model->arena = cache_arena = atol(s);
+    else    s = (char*)-1;
+  }
+  
   /* Estimated relative slowdown,
      with overhead = max((fixed_to_cache+arena)/cache_arena - 1, 0):
 
@@ -283,7 +294,8 @@ good_arena_size(ulong slow2_size, ulong
 /* Use as in
     install(set_optimize,lLDG)          \\ Through some M too?
     set_optimize(2,1) \\ disable dependence on limit
-    \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff
+    \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff,
+    \\ 5: cache size (typically whole L2 or L3) in bytes to use in forprime() 
     \\ 2,3,4 are in units of 0.001
 
     { time_primes_arena(ar,limit) =     \\ ar = arena size in K
@@ -313,6 +325,9 @@ set_optimize(long what, GEN g)
   case 4:
     ret = (long)(cache_model.cutoff * 1000);
     break;
+  case 5:
+    ret = (long)(cache_model.bigarena);
+    break;
   default:
     pari_err_BUG("set_optimize");
     break;
@@ -325,25 +340,83 @@ set_optimize(long what, GEN g)
     case 2: slow2_in_roots     = (double)val / 1000.; break;
     case 3: cache_model.power  = (double)val / 1000.; break;
     case 4: cache_model.cutoff = (double)val / 1000.; break;
+    case 5: cache_model.bigarena = val; break;
     }
   }
   return ret;
 }
 
+#define DELTA_GUARD	0xff
+#define ToBIGGAP_SZ	7 /* The largest gaps-between-gaps up to 16e12 are the first 4;
+			   * below 5e7 after 16.5e9. */
+#define offsetOfGapBIGGGAP	2
+#define setGapBIGGAP(p,n,nn)				\
+    (*(p)++ = (n)&0xff, *(p)++ = (((n)&0xff00)>>8), *(p)++ = (((n)&0xff0000)>>16), *(p)++ = (((n)&0xff000000)>>24), *(p)++ = (nn)&0xff, *(p)++ = (((nn)&0xff00)>>8), *(p)++ = (((nn)&0xff0000)>>16))
+#define getGapBIGGAP(p)		\
+    ((p)[0] + ((p)[1]<<8) + ((p)[2]<<16) + ((p)[3]<<24)) /* Speed irrelevant */
+#define offsetGapBIGGAP(p)	((p)[4] + ((p)[5]<<8) + ((p)[6]<<16))
+#define NEXT_PRIME_VIADIFF_BIGGAPS(p,d) STMT_START			\
+  { if (!*(d)) {(p) += ((*++(d))<<8)-1; (d)+=ToBIGGAP_SZ+1;} (p) += *(d)++; } STMT_END /* Overshoots if overflows */
+#define UNDO_NEXT_PRIME_VIADIFF(p,d)	(p) -= *--(d)
+#define SIEVE_ROUND1		(1ul<<17)
+#define FAKE_SKIP_EVERY_BEFORE	200 // 400000	/* 45000 seems to be reasonable, but slows down large values too much */
+#define FAKE_SKIP_EVERY_SWCH	(50000*FAKE_SKIP_EVERY_BEFORE)   /* In units of primes after the 1st biggap */
+#define FAKE_SKIP_EVERY_AFTER	FAKE_SKIP_EVERY_BEFORE
+#define STORE_EVERY_BIGGAP	5 /* Every N-th fake/biggap stored; decreases overhead to 2+ToBIGGAP_SZ+3*sizeof(ulong)/STORE_EVERY_BIGGAP */
+#define STORE_EVERY_SMALLPRIME	1  /* pi(436273289) = 23163293) */
+#define STORE_EVERY_MIDPRIME_ST	82025 /* primepi(1<<20); at least up to SIEVE_ROUND1 */
+#define STORE_EVERY_MIDPRIME	1000
+
+static ulong primes_store_mx, primes_store_last, *primes_store,
+    *primes_store_n;
+byteptr *primes_store_off;
+
+#if 1
+#define rem_half(a,b) \
+__extension__ ({ ulong __arg1 = (a); unsigned int __value, __hiremainder, __arg2 = (b); \
+   __asm__ ("divl %k4" \
+        : "=a" /* %eax */ (__value), "=d" /* %edx */ (__hiremainder) \
+        : "ka" /* %eax */ (__arg1), "kd" /* %edx */ (__arg1>>32), "kmr" (__arg2)); \
+   __hiremainder; \
+})				/* call only if the return must fit into 32 bits!!! */
+#define bsf64(x) \
+__extension__ ({ ulong __arg = (x); \
+   long trailing_one_position; \
+  __asm__ ("bsfq %1,%0" : "=r" (trailing_one_position) : "rm" (__arg) : "cc"); \
+  trailing_one_position; \
+})
+#  define HAVE_rem_half
+#  define bsf__maybe	bsf64
+#else	/* !( 1 ) */ 
+#  define bsf__maybe(x)	0
+#endif	/* !( 1 ) */ 
+
 /* s is odd; prime differences (starting from 5-3=2) start at known_primes[2],
   terminated by a 0 byte. Checks n odd numbers starting at 'start', setting
-  bytes starting at data to 0 (composite) or 1 (prime) */
+  bytes starting at data to 0 (composite) or 1 (prime)/
+  Assumes that no big gaps are in the used range addressed by known_primes. */
 static void
 sieve_chunk(byteptr known_primes, ulong s, byteptr data, ulong n)
-{
-  ulong p, cnt = n-1, start = s, delta = 1;
-  byteptr q;
+{	/* Good for small chunks: Writes 1 byte per an odd number to sieve, no reads */
+  ulong p = 3, cnt = n-1, start = s, delta = 1, swtch;
+  byteptr q = known_primes + 1;
 
   memset(data, 0, n);
   start >>= 1;  /* (start - 1)/2 */
   start += n; /* Corresponds to the end */
   /* data corresponds to start, q runs over primediffs */
-  for (q = known_primes + 1, p = 3; delta; delta = *++q, p += delta)
+ while (1)
+ {
+  if (!delta)	return;		/* TRUE unless the first round */
+#ifdef HAVE_rem_half
+//  printf("   \tst=%lu\tn=%lu\n", start, n);   fflush(stdout);
+  swtch = start>>32; /* If p > swtch, then start/p < 2^32 */
+//  printf("sw=%lu\tst=%lu\tn=%lu\n", swtch, start, n);   fflush(stdout);
+
+  for (; delta && p <= swtch; delta = *++q, p += delta)
+#else
+  for (; delta; delta = *++q, p += delta)
+#endif
   { /* first odd number >= start > p and divisible by p
        = last odd number <= start + 2p - 2 and 0 (mod p)
        = p + last number <= start + p - 2 and 0 (mod 2p)
@@ -352,16 +425,26 @@ sieve_chunk(byteptr known_primes, ulong
     long off = cnt - ((start+(p>>1)) % p);
     while (off >= 0) { data[off] = 1; off -= p; }
   }
+#ifdef HAVE_rem_half
+//  printf("   \tp=%lu\n", p);   fflush(stdout);
+  for (; delta; delta = *++q, p += delta)
+  {
+    long off = cnt - rem_half((start+(p>>1)), p);
+    while (off >= 0) { data[off] = 1; off -= p; }
+  }
+//  printf("   \tp=%lu\n", p);   fflush(stdout);
+#endif
+ }
 }
 
 static void
 set_prodprimes(void)
 {
   pari_sp ltop = avma, av;
-  long m = expu(_maxprime) + 1 - 7;
-  GEN W, w, v = primes_interval_zv(3, _maxprime);
+  ulong b = 1UL << 8, mxp = minuu(2097169, _maxprime);
+  long m = expu(mxp) + 1 - 7; /* nextprime(2<20) */
+  GEN W, w, v = primes_interval_zv(3, mxp);
   long s, j, jold, lv = lg(v), u = 1;
-  ulong b = 1UL << 8;
 
   W = cgetg(m+1, t_VEC);
   for (jold = j = 1; j < lv; j++)
@@ -371,7 +454,7 @@ set_prodprimes(void)
       w = v+jold-1; w[0] = evaltyp(t_VECSMALL) | _evallg(lw);
       gel(W,u++) = zv_prod_Z(w); /* p_jold ... p_{j-1} */
       jold = j; b *= 2;
-      if (b > _maxprime) b = _maxprime; /* truncate last run */
+      if (b > mxp) b = mxp; /* truncate last run */
     }
   m = u - 1; setlg(W, u);
   for (j = 2; j <= m; j++) gel(W,j) = mulii(gel(W,j-1), gel(W,j));
@@ -386,69 +469,131 @@ set_prodprimes(void)
 /* assume maxnum <= 436273289 < 2^29 */
 static void
 initprimes0(ulong maxnum, long *lenp, ulong *lastp, byteptr p1)
-{
+{   /* Not thread save!  The caller should have exclusive control over p1 */
   pari_sp av = avma, bot = pari_mainstack->bot;
   long alloced, psize;
   byteptr q, end, p, end1, plast, curdiff;
-  ulong last, remains, curlow, rootnum, asize;
-  ulong prime_above;
-  byteptr p_prime_above;
+  ulong last, remains, curlow, maxnum0 = maxnum, asize, prefake = 0, postgap = 0;
+  ulong prime_above, lastBigGap = 0, stored_primes = 0, prestored = 0;
+  ulong prime_curr = 2, primen_curr = 0, dogap;
+  byteptr p_prime_above, posLastBigGap = 0;
+  unsigned char fake[4], *write_gap_gap = fake;
 
-  maxnum |= 1; /* make it odd. */
+  if (!odd(maxnum)) maxnum--; /* make it odd. */
   /* base case */
-  if (maxnum < 1ul<<17) { initprimes1(maxnum>>1, lenp, lastp, p1); return; }
-
-  /* Checked to be enough up to 40e6, attained at 155893 */
-  rootnum = usqrt(maxnum) | 1;
-  initprimes1(rootnum>>1, &psize, &last, p1);
+  if (maxnum >= SIEVE_ROUND1)  maxnum0 = usqrt(maxnum) | 1; /* Assume <=436273289 */
+  initprimes1(maxnum0>>1, lenp, lastp, p1);
+  while (stored_primes < *lenp - 1)
+  {
+    primes_store_off[stored_primes]   = p1 + stored_primes + 1; /* at end would point at the trailing 0 */
+    primes_store    [stored_primes]   = prime_curr;
+    primes_store_n  [stored_primes]   = ++primen_curr;
+    prime_curr += p1[++stored_primes];
+  }
+//  printf(" p=%lu, %lu, %lu, %lu, %lu, %lu\n", primes_store[0], primes_store[1], primes_store[2], primes_store[3], primes_store[4], primes_store[5]);
+//  printf(" n=%lu, %lu, %lu, %lu, %lu, %lu\n", primes_store_n[0], primes_store_n[1], primes_store_n[2], primes_store_n[3], primes_store_n[4], primes_store_n[5]);
+//  printf(" addr=%p, %ld, %ld, %ld, %ld, %ld, %ld\n", p1, primes_store_off[0] - p1, primes_store_off[1] - p1, primes_store_off[2] - p1, primes_store_off[3] - p1, primes_store_off[4] - p1, primes_store_off[5] - p1);
+  if (maxnum < 1ul<<17)
+  {
+    p1[(*lenp)++] = DELTA_GUARD;
+    primes_store_last = stored_primes-1;
+    return;
+  }
+  psize = *lenp;
+  last = *lastp;
+  /* Now siave between last+2 and maxnum */
   end1 = p1 + psize - 1;
+//  end1[1] = DELTA_GUARD;
   remains = (maxnum - last) >> 1; /* number of odd numbers to check */
 
   /* we access primes array of psize too; but we access it consecutively,
    * thus we do not include it in fixed_to_cache */
-  asize = good_arena_size((ulong)(rootnum * slow2_in_roots), remains+1, 0,
+  asize = good_arena_size((ulong)(maxnum0 * slow2_in_roots), remains+1, 0,
                           &cache_model) - 1;
+//  printf("...Arena size=%lu", asize);
+  
   /* enough room on the stack ? */
   alloced = (((byteptr)avma) <= ((byteptr)bot) + asize);
   if (alloced)
     p = (byteptr)pari_malloc(asize+1);
   else
     p = (byteptr)stack_malloc(asize+1);
-  end = p + asize; /* the 0 sentinel goes at end. */
-  curlow = last + 2; /* First candidate: know primes up to last (odd). */
+  end = p + asize;		     /* the 0 sentinel goes at end. */
+  curlow = last + 2; /* First candidate in arena, at p: know primes up to last (odd). */
   curdiff = end1;
 
   /* During each iteration p..end-1 represents a range of odd
      numbers.  plast is a pointer which represents the last prime seen,
      it may point before p..end-1. */
-  plast = p - 1;
-  p_prime_above = p1 + 2;
+  plast = p - 1;	    /* Number corresponding to p-1 is prime */
+  p_prime_above = p1 + 2;   /* Skip "diffs 2, 1" --> prime=3 */
   prime_above = 3;
   while (remains)
   { /* cycle over arenas; performance not crucial */
     unsigned char was_delta;
+    ulong delta;
     if (asize > remains) { asize = remains; end = p + asize; }
     /* Fake the upper limit appropriate for the given arena */
     while (prime_above*prime_above <= curlow + (asize << 1) && *p_prime_above)
-      prime_above += *p_prime_above++;
+      prime_above += *p_prime_above++;	/* By restricting maxnum to <436273009^2 we do not encounter biggaps. */
     was_delta = *p_prime_above;
-    *p_prime_above = 0; /* sentinel for sieve_chunk */
-    sieve_chunk(p1, curlow, p, asize);
+//    was_delta2 = p_prime_above[1];
+    *p_prime_above = 0; /* sentinel for sieve_chunk */ /* Not thread save! */
+//    p_prime_above[1] = DELTA_GUARD; /* sentinel for sieve_chunk */
+    sieve_chunk(p1, curlow, p, asize);	/* Uses 1 byte per an odd number to sieve */
     *p_prime_above = was_delta; /* restore */
+//    p_prime_above[1] = was_delta2; /* restore */
 
-    p[asize] = 0; /* sentinel */
+//    printf("New arena, p=%lu, n=%lu\n", prime_curr, primen_curr);
+    p[asize] = 0; /* byte at end; sentinel */
     for (q = p; ; plast = q++)
     { /* q runs over addresses corresponding to primes */
-      while (*q) q++; /* use sentinel at end */
+      while (*q) q++; /* use sentinel at end = p+asize */
       if (q >= end) break;
-      *curdiff++ = (unsigned char)(q-plast) << 1; /* < 255 for q < 436273291 */
+      delta = (ulong)(q-plast) << 1;	/* < 255 for q < 436273291 */
+      dogap = delta > 255 || (postgap && ++prefake >= (primen_curr > (23163293 + FAKE_SKIP_EVERY_SWCH) ? FAKE_SKIP_EVERY_AFTER : FAKE_SKIP_EVERY_BEFORE));
+      if ((postgap || dogap) ? dogap && (++prestored >= STORE_EVERY_BIGGAP || !postgap)
+	  : ++prestored >= (primen_curr >= STORE_EVERY_MIDPRIME_ST ? STORE_EVERY_MIDPRIME : STORE_EVERY_SMALLPRIME))
+      {	   /* On the first biggap, and (rarely) on biggaps, or on
+	    * primes pre-first-biggap */
+	primes_store_off[stored_primes]   = curdiff;
+	primes_store    [stored_primes]   = prime_curr; /* pre-delta */
+	primes_store_n  [stored_primes++] = primen_curr;
+	prestored = 0;
+      }
+      primen_curr++;
+      prime_curr += delta;
+      if (dogap)
+      {
+	setGapBIGGAP(write_gap_gap, curlow + ((q-p)<<1) - delta - lastBigGap, curdiff - posLastBigGap);
+	lastBigGap = curlow + ((q-p)<<1) - delta;
+	posLastBigGap = curdiff;
+        *curdiff++ = 0;
+        *curdiff++ = delta>>8;
+	write_gap_gap = curdiff;
+	curdiff += ToBIGGAP_SZ;
+	prefake = 0;
+	postgap++;
+      }
+      *curdiff++ = (delta & 0xff) | dogap; /* The last bit marked */
     }
-    plast -= asize;
+    plast -= asize;			/* points before the arena now */
     remains -= asize;
     curlow += (asize<<1);
-  }
+  } /* currdiff-1 is delta corresponding to prime_curr */
+  primes_store_off[stored_primes] = curdiff; /* Points at the trailing 0. May be a duplicate… */
+  primes_store  [stored_primes] =  prime_curr;
+  primes_store_n[stored_primes] = primen_curr;
+  primes_store_last = stored_primes;
+  if (primes_store_last > primes_store_mx) /* We use experiments/heuristics for size! */
+    pari_err(e_MISC,"Panic: primestore overflow, curr=%lu, mx=%lu", primes_store_last, primes_store_mx);
+//  printf(" p=%lu, %lu, %lu, %lu, %lu, %lu\n", primes_store[0], primes_store[1], primes_store[2], primes_store[3], primes_store[4], primes_store[5]);
+//  printf(" n=%lu, %lu, %lu, %lu, %lu, %lu\n", primes_store_n[0], primes_store_n[1], primes_store_n[2], primes_store_n[3], primes_store_n[4], primes_store_n[5]);
+//  printf(" addr=%p, %ld, %ld, %ld, %ld, %ld, %ld\n", p1, primes_store_off[0] - p1, primes_store_off[1] - p1, primes_store_off[2] - p1, primes_store_off[3] - p1, primes_store_off[4] - p1, primes_store_off[5] - p1);
   last = curlow - ((p - plast) << 1);
-  *curdiff++ = 0; /* sentinel */
+  setGapBIGGAP(write_gap_gap, last - lastBigGap, curdiff - posLastBigGap);
+  *curdiff++ = 0;		/* sentinel1 */
+  *curdiff++ = DELTA_GUARD;	/* sentinel2 */
   *lenp = curdiff - p1;
   *lastp = last;
   if (alloced) pari_free(p); else set_avma(av);
@@ -457,6 +602,8 @@ initprimes0(ulong maxnum, long *lenp, ul
 ulong
 maxprime(void) { return diffptr ? _maxprime : 0; }
 ulong
+maxprime_biggaps(void) { return diffptr ? _maxprime_biggaps : 0; }
+ulong
 maxprimelim(void) { return diffptr ? _maxprimelim : 0; }
 ulong
 maxprimeN(void) { return diffptr ? diffptrlen-1: 0; }
@@ -468,20 +615,83 @@ maxprime_check(ulong c) { if (_maxprime
 
 /* We ensure 65302 <= maxnum <= 436273289: the LHS ensures modular function
  * have enough fast primes to work, the RHS ensures that p_{n+1} - p_n < 255
+ *   (UPDATE: This is not needed anymore: the caller takes care of _maxprime.)
  * (N.B. RHS would be incorrect since initprimes0 would make it odd, thereby
  * increasing it by 1) */
 byteptr
 initprimes(ulong maxnum, long *lenp, ulong *lastp)
 {
   byteptr t;
+  double L, overhead_biggaps;
+  size_t l1, l2, biggaps_fake = 0, biggaps_store, cont;
   if (maxnum < 65537)
     maxnum = 65537;
-  else if (maxnum > 436273289)
-    maxnum = 436273289;
-  t = (byteptr)pari_malloc((size_t) (1.09 * maxnum/log((double)maxnum)) + 146);
+#ifdef LONG_IS_64BIT
+  else if (maxnum >= 436273009UL*436273009UL)	/* Avoid biggaps when sieving to construct a sieve; */
+    maxnum = 436273009UL*436273009UL - 2;
+#else
+  else if (maxnum > 4294901999UL)
+    maxnum = 4294901999UL;	/* Ensure that NEXT_PRIME_VIADIFF_BIGGAPS overshoots at end */
+#endif	/* !( defined LONG_IS_64BIT ) */
+  L = 1/log((double)maxnum);  /* Majorate Li(maxnum) which majorates */
+  l1 = 1.09 * maxnum*L + 145; /* primepi(maxnum) for reasonable log(maxnum) */
+  l2 = maxnum*L*(1+L*(1+2*L*(1+3*L*(1+9.6*L)))); /* Good at 5,290. */
+  /* exp(-256*L) is a naive “independent residues” estimate for gaps.  
+     The experiments (up to maxnum=1.5e14) suggest the measured
+     fraction is much smaller (as 0.01 ... 1/3): with x=maxnum/1e9:
+        exp(-Q(L) - C(x)/x^2 +- 0.75/x^(2/3))
+     with a quadratic Q (and 0.75 may be changed to 0.47 outside of the
+     x-interval [1.18,1.3]).  (No trace of the need for higher degree
+     terms with degree <= 20 is detected.)  Here C(x)/x^2 is a
+     correction for "a hill" at small x's (with C <= 1.8).
+        Moreover, Q(L)>256*L (although comes within 0.127 near
+     maxnum=1e54; but this is way above the experimental range!).  */
+  overhead_biggaps = exp(-256*L); 	   /* Proportion of big gaps */
+  ulong ll1 = minuu(l1,l2);	      /* Min.  Crossover at 3.8e5; error <0.01% at >=6.66e9. */
+  if (l2 > 23163298)			   /* Have big gaps.  May create fake big gaps */
+    biggaps_fake = (minuu(l2 - 23163298, FAKE_SKIP_EVERY_SWCH) + 1)/FAKE_SKIP_EVERY_BEFORE + 1;
+  if (l2 >= 23163298 + FAKE_SKIP_EVERY_SWCH)
+    biggaps_fake += (l2 - 23163298 - FAKE_SKIP_EVERY_SWCH + 1)/FAKE_SKIP_EVERY_AFTER + 1;
+  biggaps_store = ((maxuu(ll1+1,23163298)-23163298)*overhead_biggaps + 1 + biggaps_fake)/STORE_EVERY_BIGGAP + 2; /* >=2 */
+  cont = 1.09*2*usqrt(maxnum)*L + 145;	/* Estimate from above */
+  cont = maxuu(STORE_EVERY_MIDPRIME_ST, cont);
+  biggaps_store += minuu(ll1, cont)/STORE_EVERY_SMALLPRIME + 2; /* >=2 */
+  if (ll1 > cont)
+    biggaps_store += (minuu(ll1, 23163298) - minuu(23163298, cont))/STORE_EVERY_MIDPRIME + 1;
+
+//  printf("  l1=%lu l2=%lu ll1=%lu ovhd=%g (units) store=%lu\n", l1, l2, ll1, overhead_biggaps, biggaps_store);
+  overhead_biggaps *= (2+ToBIGGAP_SZ);	   /* In bytes */
+  l2 *= 1 + overhead_biggaps;		   /* No actual insertions in case of l1 */
+  l2 += biggaps_fake * (2+ToBIGGAP_SZ);
+  if (maxnum >= 436273289 || l1 > l2) l1 = l2; /* Crossover at 3.8e5; error <0.01% at >=6.66e9. */
+  l1 += 2 + ToBIGGAP_SZ;      /* Allow same code to run at end */
+//  printf("  l1=%lu ovhd=%g (bytes)\n", l1, overhead_biggaps);
+  l1 = (l1 + sizeof(ulong) - 1)/sizeof(ulong)*sizeof(ulong); /* Pad */
+  t = (byteptr)pari_malloc((size_t)l1 + 3*biggaps_store*sizeof(ulong));
+//  printf(" diffptr=%p to %p", t, t + (size_t)l1 + 3*biggaps_store*sizeof(ulong));
+  primes_store     = (ulong*)(t+l1);
+  primes_store_n   = primes_store + biggaps_store;
+  primes_store_off = (byteptr*)(primes_store_n + biggaps_store);
+  primes_store_mx    = biggaps_store - 1;
+  primes_store_last = 0;		/* Not initialized yet! */
+
   initprimes0(maxnum, lenp, lastp, t);
-  _maxprimelim = maxnum;
-  return (byteptr)pari_realloc(t, *lenp);
+  if (*lenp > l1) pari_err(e_MISC, "wrong estimate %lu of length %lu of primediff", l1, *lenp);
+  _maxprimelim_biggaps = _maxprimelim = maxnum;
+  if (maxnum >= 436273291)
+    _maxprimelim = 436273291 - 2;
+//  printf("Sieving done\n");  fflush(stdout);
+  if (*lenp < l1 - (l1>>7) - 2 - ToBIGGAP_SZ - sizeof(ulong))		/* Between 200,000 and 500,000 up to 0.3% */
+    fprintf(stderr, "Warning: Need to reallocate primediff from %ld to %ld (fence %ld))\n", 
+	    l1, *lenp, l1 - (l1>>7) - 2 - ToBIGGAP_SZ - sizeof(ulong)); /* Almost hit @69991 */
+  else return t;	  /* Avoiding realloc() may save 1/2 of memory footprint.) */
+  ulong n = (*lenp + sizeof(ulong) - 1)/sizeof(ulong)*sizeof(ulong);
+  memmove(t + n, primes_store, 3*biggaps_store*sizeof(ulong));
+  t = (byteptr)pari_realloc(t, n + 3*biggaps_store*sizeof(ulong)); /* Want to avoid this: may overuse memory */
+  primes_store     = (ulong*)(t+n);
+  primes_store_n   = primes_store + biggaps_store;
+  primes_store_off = (byteptr*)(primes_store_n + biggaps_store);
+  return t;
 }
 
 void
@@ -492,7 +702,7 @@ initprimetable(ulong maxnum)
   byteptr p = initprimes(maxnum, &len, &last), old = diffptr;
   diffptrlen = minss(diffptrlen, len);
   _maxprime  = minss(_maxprime,last); /*Protect against ^C*/
-  diffptr = p; diffptrlen = len; _maxprime = last;
+  diffptr = p; diffptrlen = len; _maxprime = minss(436273009,last); _maxprime_biggaps = last;
   set_prodprimes();
   if (old) free(old);
 }
@@ -547,7 +757,7 @@ optimize_chunk(ulong a, ulong b)
 {
   /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large
    * as to force recalculating too often). */
-  ulong chunk = 0x80000UL;
+  ulong chunk = (cache_model.bigarena ? cache_model.bigarena : 0x80000UL)<<4; /* bigarena is in bytes, we use bits, and only odds */
   ulong tmp = (b - a) / chunk + 1;
 
   if (tmp == 1)
@@ -572,6 +782,123 @@ sieve_init(forprime_t *T, ulong a, ulong
 
 enum {PRST_none, PRST_diffptr, PRST_sieve, PRST_unextprime, PRST_nextprime};
 
+/* Return 0 if !ge, and a<3; then ptr_* do not point to correct values. */
+ulong /* Assume 2 < a <= _maxprime_biggaps; and >=2 stored primes.  Not tested with ge>0 */
+biggap_prime_le_ge(ulong a, ulong ge, ulong *ptr_p, byteptr *ptr_d)
+{		  /* a>2.  Start with binary search over the stored table */
+  ulong *s = primes_store + 1, *e = s + primes_store_last, *mid;
+  while (s <= e) /* The "insertion point" is between s and e+1 */
+  {		 /* Return it in s ()may be max+1) */
+    mid = s + (e - s)/2;
+    if (a == *mid)
+    {
+     intable:
+      *ptr_p = a;
+      *ptr_d = primes_store_off[mid - primes_store];
+      return   primes_store_n  [mid - primes_store];
+    }
+    if (a > *mid)    s = mid + 1;
+    else	     e = mid - 1;
+  } /* Now s==max+1, or a < *s.  Moreover, a is not in the list */
+  if (s > primes_store + primes_store_last)
+  {
+    if (!ge && a <= _maxprimelim_biggaps)
+    {
+      mid = primes_store + primes_store_last;
+      a = *mid;
+      goto intable;
+    }
+    if (ge) pari_err(e_MISC, "the argument %lu is above the max stored prime %lu (%lu))", a, _maxprime_biggaps, primes_store[primes_store_last]);
+    pari_err(e_MISC, "the argument %lu is above the prime limit %lu", a, _maxprimelim_biggaps);
+  } /* Now a < *s */
+  if (s == primes_store+1)		/* a is on the left */
+  {
+    if (ge)
+    {
+      mid = primes_store;
+      a = *mid;
+      goto intable;
+    }
+    pari_err(e_MISC, "the argument %lu is below 3", a);
+  } /* Now s[-1] < a < s[0] */
+  ulong p = 436273009, n = 23163298, nn, pp, pp1, off; /* Need 1 iteration to become correct */
+  byteptr d = diffptr + offsetOfGapBIGGGAP, dd, dd1;
+  dd = primes_store_off[s - 1 - primes_store] + offsetOfGapBIGGGAP;
+  nn = primes_store_n  [s - 1 - primes_store];
+  if (*s > p)		     /* After the first gap; s - 1 is a big gap */
+  {			     /* Loop over big gaps >= s - 1 */
+    p = s[-1];
+    while (1)			/* Know a < _maxprime_biggaps */
+    {				/* Know a > pp */
+      pp  = p;
+      p  += getGapBIGGAP(dd);
+      off = offsetGapBIGGAP(dd);
+      if (p >= a) break;
+      dd += off;			/* Corresponds to pp */
+      nn += off - 2 - ToBIGGAP_SZ;	/* Corresponds to pp */
+  //    printf("->pp=%lu p=%lu poff=%lu off=%lu doff=%lu pn=%lu n=%lu\t%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x\n", pp, p, (ulong)(d-diffptr), (ulong)(d-diffptr)+off, off, nn, nn+off-2-ToBIGGAP_SZ, d[-2], d[-1], d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7]);
+    }
+    dd -= offsetOfGapBIGGGAP;
+    d = dd + off;
+    n = nn + off - 2 - ToBIGGAP_SZ;
+    if (a == p)
+    {
+      *ptr_p = a;
+      *ptr_d = d;
+      return  n;
+    } /* Now special-case the first diff (of the big gap) */
+    dd1 = dd;
+    pp1 = pp;
+    NEXT_PRIME_VIADIFF_BIGGAPS(pp,dd);
+    if (a <= pp)	     /* Need to treat the big gap specially */
+    {
+      if (ge || a == pp)
+      {
+	*ptr_p = pp;
+	*ptr_d = dd;
+	return  nn + 1;
+      }
+      *ptr_p = pp1;
+      *ptr_d = dd1;
+      return  nn;
+    }
+    nn++;
+  }
+  else					/* No gaps nearby */
+  {
+    pp = s[-1];
+    dd -= offsetOfGapBIGGGAP;
+    p = *s;
+    d = primes_store_off[s - primes_store];
+    n = primes_store_n  [s - primes_store];
+  } /* Now can scan froward from pp/dd/nn (or backwards) */
+  ulong md = pp + ((ulong)(p - pp))/2;  
+  if (a <= md)				/* Better to search forward */
+  {
+    d = dd;
+    p = pp;	      /* So NEXT_PRIME_VIADIFF gets a correct value */
+    do { NEXT_PRIME_VIADIFF(p,d); } while (p < a);
+    if (!ge && p!=a) PREC_PRIME_VIADIFF(p,d);
+  }
+  else					/* Better to search backwards */
+  {
+    do { PREC_PRIME_VIADIFF(p,d); } while (p > a);
+    if (ge && p!=a)  NEXT_PRIME_VIADIFF(p,d);
+  }
+  *ptr_p = p;
+  *ptr_d = d;
+  return nn + (d - dd);
+}
+
+ulong /* Assume 436273009 < a <= _maxprime_biggaps.  Not tested with ge>0 */
+biggap_primepi_le_ge(ulong a, ulong ge, ulong return_p)
+{
+  ulong n, p;
+  byteptr fakep;
+  n = biggap_prime_le_ge(a, ge, &p, &fakep);
+  return return_p ? p : n;
+}
+
 static void
 u_forprime_set_prime_table(forprime_t *T, ulong a)
 {
@@ -581,8 +908,9 @@ u_forprime_set_prime_table(forprime_t *T
     T->p = 0;
     T->d = diffptr;
   }
-  else
+  else if (a <= 3)//436273009)
     T->p = init_primepointer_lt(a, &T->d);
+  else     biggap_prime_le_ge(a - !!a, 0, &(T->p), &(T->d)); /* Check lt, not ge */
 }
 
 /* Set p so that p + q the smallest integer = c (mod q) and > original p.
@@ -600,6 +928,9 @@ arith_set(forprime_t *T)
   T->p = itou_or_0(d); set_avma(av);
 }
 
+/* FIXME: should take into account the arena size and b */
+#define SLOWDOWN_NEXTPRIME_BILLION	26 /* How many times slower is nextprime() in range of 10^9 */
+
 /* run through primes in arithmetic progression = c (mod q) */
 static int
 u_forprime_sieve_arith_init(forprime_t *T, struct pari_sieve *psieve,
@@ -615,7 +946,7 @@ u_forprime_sieve_arith_init(forprime_t *
     T->d = diffptr;
     return 0;
   }
-  maxp = maxprime();
+  maxp = maxprime_biggaps();
   if (q != 1)
   {
     c %= q;
@@ -645,22 +976,15 @@ u_forprime_sieve_arith_init(forprime_t *
   else
     u_forprime_set_prime_table(T, a);
 
-  maxp2 = (maxp & HIGHMASK)? 0 : maxp*maxp;
+  maxp2 = (maxp & HIGHMASK)? ULONG_MAX : maxp*maxp;
   /* FIXME: should sieve as well if q != 1, adapt sieve code */
-  if (q != 1 || (maxp2 && maxp2 <= a)
-             || T->b - maxuu(a,maxp) < maxp / expu(b))
+  if (q != 1 || (maxp2 <= a)
+             || T->b - maxuu(a,maxp) < 1000000000/SLOWDOWN_NEXTPRIME_BILLION)
   { if (T->strategy==PRST_none) T->strategy = PRST_unextprime; }
   else
   { /* worth sieving */
-#ifdef LONG_IS_64BIT
-    const ulong UPRIME_MAX = 18446744073709551557UL;
-#else
-    const ulong UPRIME_MAX = 4294967291UL;
-#endif
-    ulong sieveb;
-    if (b > UPRIME_MAX) b = UPRIME_MAX;
-    sieveb = b;
-    if (maxp2 && maxp2 < b) sieveb = maxp2;
+    ulong sieveb = b;
+    if (maxp2 < b) sieveb = maxp2;
     if (T->strategy==PRST_none) T->strategy = PRST_sieve;
     sieve_init(T, maxuu(maxp+2, a), sieveb);
   }
@@ -754,15 +1078,25 @@ static void
 sieve_block(ulong a, ulong b, ulong maxpos, unsigned char* sieve)
 {
   ulong p = 2, lim = usqrt(b), sz = (b-a) >> 1;
+#ifdef HAVE_rem_half
+  ulong swtch = a>>32; /* If p > swtch, then a/p < 2^32 */
+//  printf("  a=%lu\tb=%lu\tsw=%lu\n", a, b, swtch);
+#endif	/* defined HAVE_rem_half */
+ 
   byteptr d = diffptr+1;
   (void)memset(sieve, 0, maxpos+1);
   for (;;)
   { /* p is odd */
     ulong k, r;
-    NEXT_PRIME_VIADIFF(p, d); /* starts at p = 3 */
+    NEXT_PRIME_VIADIFF_BIGGAPS(p, d); /* starts at p = 3 */
     if (p > lim) break;
 
     /* solve a + 2k = 0 (mod p) */
+#ifdef HAVE_rem_half
+    if (p > swtch)
+      r = rem_half(a, p);
+    else
+#endif	/* defined HAVE_rem_half */ 
     r = a % p;
     if (r == 0)
       k = 0;
@@ -778,6 +1112,205 @@ sieve_block(ulong a, ulong b, ulong maxp
   }
 }
 
+int
+byteBitCNT(ulong b)
+{
+  b = (b & 0x55) + ((b & 0xAA)>>1);
+  b = (b & 0x33) + ((b & 0xCC)>>2);
+  return (b & 0x0f) + ((b & 0xf0)>>4);
+}
+
+ulong
+try_div__(GEN aa, ulong b)
+{
+  ulong tot = 0, i = 0;
+  for (; i<100000; i++)
+    tot += umodiu(aa, b+i);
+  return tot;
+}
+
+ulong
+try_bfffo(ulong in)
+{
+  unsigned char v = in;
+  byteptr p = &v;
+  ulong s = bsf__maybe((unsigned char)~*p), e = BITS_IN_LONG - 1 - bfffo((unsigned char)~*p);
+  return 1000000 + 100*s + e;
+}
+
+#define LOOP_RESDS(k,sz,SIEVE,p)    while ((k) <= (sz)) { (SIEVE)[(k)>>3] |= 1 << ((k)&7); (k) += (p); /* 2k += 2p */ }
+
+/* m = a + 2r is the smallest odd m >= a, p | m */
+/* position n (corresponds to a+2n) is sieve[n>>3], bit n&7 */
+#define MARK_RESDS(r,sz,SIEVE,p)	\
+    STMT_START { if (r != 0) { r = p - r; if (odd(r)) r += p; r >>= 1; } LOOP_RESDS(r,sz,sieve,p); } STMT_END
+
+static long
+chk_isprime(GEN N) { return signe(gisprime(N,0)); }
+	
+GEN
+sieve_block2(GEN aa, ulong b_a, ulong lm, long doRoughPrimes)     /* a+b_a is included, aa should be odd */
+{
+  pari_sp av = avma, av1 = av, bot = pari_mainstack->bot;
+  ulong a = *int_LSW(aa), shiftword = *int_MSW(aa), rc = 0, rc1 = 0;
+  ulong p = 3, lim, glim, sz = b_a >> 1, maxpos = (b_a>>4) + 1, sel = 0;
+  byteptr sieve, end, tofree = 0;
+  long (*checker)(GEN) = BPSW_psp;
+  
+  if (glength(aa) > 2) shiftword = ULONG_MAX;
+  else if (glength(aa) < 2) shiftword = 0;
+#ifdef HAVE_rem_half
+  ulong swtch = a>>32; /* If p > swtch, then a/p < 2^32 */
+//  printf("  a=%lu\tb=%lu\tsw=%lu\n", a, b, swtch);
+#endif	/* defined HAVE_rem_half */
+
+  if (lm > 1)
+  {
+    glim = lm;
+    sel = (doRoughPrimes >= 0);
+  }  
+  else if (shiftword) glim = itou(sqrtint(gadd(aa, utoi(b_a))));
+  else                glim = usqrt(a + b_a);
+  if (glim > _maxprimelim_biggaps)
+  {
+      if (lm != 1)  pari_err(e_MISC, "max calculated primes %lu too small (need %lu) for sieve_block2", _maxprimelim_biggaps, glim);
+      glim = _maxprimelim_biggaps;
+      sel = (doRoughPrimes >= 0);
+  }
+  if (doRoughPrimes > 0)  checker = chk_isprime;
+  if (((byteptr)avma) <= ((byteptr)bot) + maxpos + 102400)
+    tofree = sieve = (byteptr)pari_calloc(maxpos);
+  else
+  {
+    sieve = (byteptr)stack_calloc(maxpos);
+    av1 = avma;
+  }
+  if (!sieve) pari_err(e_MEM);
+  
+  byteptr d = diffptr+2;	   /* matches p=3 */
+//  (void)memset(sieve, 0, maxpos);
+  end = sieve + maxpos - 1;
+  lim = minuu(glim, 436273009);
+  for (;;)				/* Loop over biggaps */
+  {
+   restart:
+    if (p > glim) break;
+    if (p > 3)			    /* Process a biggap. */
+    {
+      UNDO_NEXT_PRIME_VIADIFF(p, d);	/* The preceding increment was fake */
+      while (!d[-2]) --d; /* We may need to trace back over a few "fake" 0s; the trailing byte is never 0 */
+      d++;
+      lim = p + getGapBIGGAP(d);
+      lim = minuu(lim, glim);
+      d -= 2;
+//      printf("p=%lu d=%lu -> %02x.%02x.%02x.%02x.%02x.%02x.%02x\n", p, (ulong)d, d[0], d[1], d[2], d[3], d[4], d[5], d[6]);
+      NEXT_PRIME_VIADIFF_BIGGAPS(p, d);
+    }
+
+    ulong tlim = minuu(shiftword, lim), r;
+//    printf("    tlim=%lu shiftword=%lu lim=%lu glim=%lu\n", tlim, shiftword, lim, glim);
+    /* Below, we repeat the cycle over p many times with slightly
+     * different code to find r.  This way, extra checks are removed
+     * from very tight cycles. */
+    if (p <= tlim)		      /* Cannot use division on CPU */
+    {
+      for (;;)			 /* Loop in the run between biggaps */
+      {				 /* p is odd */
+	r = umodiu(aa,p);	 /* Whole-blown division */
+	MARK_RESDS(r,sz,sieve,p); /* solve a + 2r = 0 (mod p) and sieve */
+	NEXT_PRIME_VIADIFF(p, d);
+	if (p > tlim) break;
+      }
+      if (p > lim) goto restart;
+    }	     /* Now, can use the division instructions on the CPU */
+//    printf("S  p=%lu d=%lu -> %02x.%02x.%02x.%02x.%02x.%02x.%02x\n", p, (ulong)d, d[0], d[1], d[2], d[3], d[4], d[5], d[6]);
+
+    if (shiftword || p <= swtch)     /* Need to use 128:64 division */
+    {
+      tlim = shiftword ? lim : minuu(swtch, lim);
+      for (;;)				/* Loop in the run between biggaps */
+      {					/* p is odd */
+	{
+	  LOCAL_HIREMAINDER;
+	  hiremainder = shiftword;
+	  divll(a,p);
+	  r = hiremainder;
+	}
+	MARK_RESDS(r,sz,sieve,p); /* solve a + 2r = 0 (mod p) and sieve */
+	NEXT_PRIME_VIADIFF(p, d);
+	if (p > tlim) break;
+      }
+      if (p > lim) goto restart;
+    }	     /* Now, can use division with 64bit divident */
+//    printf("L  p=%lu d=%lu -> %02x.%02x.%02x.%02x.%02x.%02x.%02x\n", p, (ulong)d, d[0], d[1], d[2], d[3], d[4], d[5], d[6]);
+
+    for (;;) /* Now can use 64:32 division.  Loop in the run between biggaps */
+    {					/* p is odd */
+#ifdef HAVE_rem_half
+      r = rem_half(a, p);
+#else	/* !defined HAVE_rem_half */
+#  ifdef PREFER_C_LANG_MOD
+      r = a % p;
+#  else	 /* !( defined PREFER_C_LANG_MOD ) */ 
+      {
+	LOCAL_HIREMAINDER;
+	hiremainder = 0;
+	divll(a,p);
+	r = hiremainder;
+      }
+#  endif /* !( defined PREFER_C_LANG_MOD ) */ 
+#endif	/* !defined HAVE_rem_half */
+      MARK_RESDS(r,sz,sieve,p); /* solve a + 2r = 0 (mod p) and sieve */
+      NEXT_PRIME_VIADIFF(p, d);
+      if (p > lim) break;
+    }
+//    printf("   p=%lu d=%lu -> %02x.%02x.%02x.%02x.%02x.%02x.%02x\n", p, (ulong)d, d[0], d[1], d[2], d[3], d[4], d[5], d[6]);
+  }
+#define S(k)  (sieve[k])
+//  printf("%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x\n", S(0), S(1), S(2), S(3), S(4), S(5), S(6), S(7), S(8), S(9), S(10), S(11), S(12), S(13), S(14), S(15));
+// #undef S
+  if (sel)				/* Need isprime() calls too */
+  {
+    ulong off=0, bit=0;
+    if ((sz+1) & 0x7)
+    {
+      *end |= 0xff << ((sz+1) & 0x7);	/* Do not check unwritten bits at end */
+      rc -= 8 - ((sz+1) & 0x7);
+    }
+//    printf("%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x*%lu\n", S(0), S(1), S(2), S(3), S(4), S(5), S(6), S(7), S(8), S(9), S(10), S(11), S(12), S(13), S(14), S(15), -rc);
+    while (sieve <= end)
+    {
+      if (*sieve == 0xff)
+	  rc += 8;
+      else
+      {
+	  ulong s = bsf__maybe((unsigned char)~*sieve), e = BITS_IN_LONG - 1 - bfffo((unsigned char)~*sieve);
+	  rc += 8 - (e - s + 1);
+	  for (bit = s; bit <= e; bit++)
+//	  for (bit = 0; bit < 8; bit++)
+	  {
+	    if (*sieve & (1 << bit))	/* is not primelimit-rough */
+	      rc++;
+	    else if (!checker(addiu(aa, off + (bit<<1))))
+	      rc1++;		   /* rough, but not a pseudo prime */
+	  }
+      }
+      if (! (off & 0xffff) )
+	set_avma(av1);
+      sieve++;
+      off += 16;
+    }
+  }
+  else	      
+    while (end >= sieve) /* printf("0x%02x ", (int)*end), */ rc += byteBitCNT(*end--);
+  if (tofree)
+    pari_free(tofree);
+  set_avma(av);
+  if (sel)
+    retmkvec2(utoi(sz + 1 - rc - rc1), utoi(sz + 1 - rc));
+  return utoi(sz + 1 - rc);
+}
+
 static void
 pari_sieve_init(struct pari_sieve *s, ulong a, ulong b)
 {
@@ -856,7 +1389,7 @@ u_forprime_next(forprime_t *T)
   {
     for(;;)
     {
-      if (!*(T->d))
+      if (!*(T->d) && (T->d)[1] == DELTA_GUARD)
       {
         T->strategy = T->isieve? PRST_sieve: PRST_unextprime;
         if (T->q != 1) { arith_set(T); if (!T->p) return 0; }
@@ -865,6 +1398,11 @@ u_forprime_next(forprime_t *T)
       }
       else
       {
+        if (!*(T->d))
+	{
+	  T->p += ((*++(T->d))<<8) - 1;	/* Last byte is incremented by 1 */
+	  T->d += ToBIGGAP_SZ+1; /* Replaces NEXT_PRIME_VIADIFF_BIGGAPS */
+        }
         NEXT_PRIME_VIADIFF(T->p, T->d);
         if (T->p > T->b) return 0;
         if (T->q == 1 || T->p % T->q == T->c) return T->p;
--- src/language/forprime.c.orig0	2023-11-14 07:25:17.000000000 -0800
+++ src/language/forprime.c	2024-09-30 16:23:36.912528591 -0700
@@ -22,7 +22,9 @@ Foundation, Inc., 51 Franklin Street, Fi
 /**********************************************************************/
 
 static ulong _maxprime = 0;
+static ulong _maxprime_biggaps = 0;
 static ulong _maxprimelim = 0;
+static ulong _maxprimelim_biggaps = 0;
 static ulong diffptrlen;
 static GEN _prodprimes,_prodprimes_addr;
 
@@ -174,9 +176,10 @@ typedef struct {
     ulong arena;
     double power;
     double cutoff;
+    ulong bigarena;
 } cache_model_t;
 
-static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF };
+static cache_model_t cache_model = { CACHE_ARENA, CACHE_ALPHA, CACHE_CUTOFF, 0 };
 
 /* Assume that some calculation requires a chunk of memory to be
    accessed often in more or less random fashion (as in sieving).
@@ -200,10 +203,18 @@ static ulong
 good_arena_size(ulong slow2_size, ulong total, ulong fixed_to_cache,
                 cache_model_t *cache_model)
 {
+  static char *s;  
   ulong asize, cache_arena = cache_model->arena;
   double Xmin, Xmax, A, B, C1, C2, D, V;
   double alpha = cache_model->power, cut_off = cache_model->cutoff;
 
+  if (!s)
+  {
+    s = getenv("PARI_ARENA");
+    if (s)  cache_model->arena = cache_arena = atol(s);
+    else    s = (char*)-1;
+  }
+  
   /* Estimated relative slowdown,
      with overhead = max((fixed_to_cache+arena)/cache_arena - 1, 0):
 
@@ -283,7 +294,8 @@ good_arena_size(ulong slow2_size, ulong
 /* Use as in
     install(set_optimize,lLDG)          \\ Through some M too?
     set_optimize(2,1) \\ disable dependence on limit
-    \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff
+    \\ 1: how much cache usable, 2: slowdown of setup, 3: alpha, 4: cutoff,
+    \\ 5: cache size (typically whole L2 or L3) in bytes to use in forprime() 
     \\ 2,3,4 are in units of 0.001
 
     { time_primes_arena(ar,limit) =     \\ ar = arena size in K
@@ -313,6 +325,9 @@ set_optimize(long what, GEN g)
   case 4:
     ret = (long)(cache_model.cutoff * 1000);
     break;
+  case 5:
+    ret = (long)(cache_model.bigarena);
+    break;
   default:
     pari_err_BUG("set_optimize");
     break;
@@ -325,25 +340,83 @@ set_optimize(long what, GEN g)
     case 2: slow2_in_roots     = (double)val / 1000.; break;
     case 3: cache_model.power  = (double)val / 1000.; break;
     case 4: cache_model.cutoff = (double)val / 1000.; break;
+    case 5: cache_model.bigarena = val; break;
     }
   }
   return ret;
 }
 
+#define DELTA_GUARD	0xff
+#define ToBIGGAP_SZ	7 /* The largest gaps-between-gaps up to 16e12 are the first 4;
+			   * below 5e7 after 16.5e9. */
+#define offsetOfGapBIGGGAP	2
+#define setGapBIGGAP(p,n,nn)				\
+    (*(p)++ = (n)&0xff, *(p)++ = (((n)&0xff00)>>8), *(p)++ = (((n)&0xff0000)>>16), *(p)++ = (((n)&0xff000000)>>24), *(p)++ = (nn)&0xff, *(p)++ = (((nn)&0xff00)>>8), *(p)++ = (((nn)&0xff0000)>>16))
+#define getGapBIGGAP(p)		\
+    ((p)[0] + ((p)[1]<<8) + ((p)[2]<<16) + ((p)[3]<<24)) /* Speed irrelevant */
+#define offsetGapBIGGAP(p)	((p)[4] + ((p)[5]<<8) + ((p)[6]<<16))
+#define NEXT_PRIME_VIADIFF_BIGGAPS(p,d) STMT_START			\
+  { if (!*(d)) {(p) += ((*++(d))<<8)-1; (d)+=ToBIGGAP_SZ+1;} (p) += *(d)++; } STMT_END /* Overshoots if overflows */
+#define UNDO_NEXT_PRIME_VIADIFF(p,d)	(p) -= *--(d)
+#define SIEVE_ROUND1		(1ul<<17)
+#define FAKE_SKIP_EVERY_BEFORE	200 // 400000	/* 45000 seems to be reasonable, but slows down large values too much */
+#define FAKE_SKIP_EVERY_SWCH	(50000*FAKE_SKIP_EVERY_BEFORE)   /* In units of primes after the 1st biggap */
+#define FAKE_SKIP_EVERY_AFTER	FAKE_SKIP_EVERY_BEFORE
+#define STORE_EVERY_BIGGAP	5 /* Every N-th fake/biggap stored; decreases overhead to 2+ToBIGGAP_SZ+3*sizeof(ulong)/STORE_EVERY_BIGGAP */
+#define STORE_EVERY_SMALLPRIME	1  /* pi(436273289) = 23163293) */
+#define STORE_EVERY_MIDPRIME_ST	82025 /* primepi(1<<20); at least up to SIEVE_ROUND1 */
+#define STORE_EVERY_MIDPRIME	1000
+
+static ulong primes_store_mx, primes_store_last, *primes_store,
+    *primes_store_n;
+byteptr *primes_store_off;
+
+#if 1
+#define rem_half(a,b) \
+__extension__ ({ ulong __arg1 = (a); unsigned int __value, __hiremainder, __arg2 = (b); \
+   __asm__ ("divl %k4" \
+        : "=a" /* %eax */ (__value), "=d" /* %edx */ (__hiremainder) \
+        : "ka" /* %eax */ (__arg1), "kd" /* %edx */ (__arg1>>32), "kmr" (__arg2)); \
+   __hiremainder; \
+})				/* call only if the return must fit into 32 bits!!! */
+#define bsf64(x) \
+__extension__ ({ ulong __arg = (x); \
+   long trailing_one_position; \
+  __asm__ ("bsfq %1,%0" : "=r" (trailing_one_position) : "rm" (__arg) : "cc"); \
+  trailing_one_position; \
+})
+#  define HAVE_rem_half
+#  define bsf__maybe	bsf64
+#else	/* !( 1 ) */ 
+#  define bsf__maybe(x)	0
+#endif	/* !( 1 ) */ 
+
 /* s is odd; prime differences (starting from 5-3=2) start at known_primes[2],
   terminated by a 0 byte. Checks n odd numbers starting at 'start', setting
-  bytes starting at data to 0 (composite) or 1 (prime) */
+  bytes starting at data to 0 (composite) or 1 (prime)/
+  Assumes that no big gaps are in the used range addressed by known_primes. */
 static void
 sieve_chunk(byteptr known_primes, ulong s, byteptr data, ulong n)
-{
-  ulong p, cnt = n-1, start = s, delta = 1;
-  byteptr q;
+{	/* Good for small chunks: Writes 1 byte per an odd number to sieve, no reads */
+  ulong p = 3, cnt = n-1, start = s, delta = 1, swtch;
+  byteptr q = known_primes + 1;
 
   memset(data, 0, n);
   start >>= 1;  /* (start - 1)/2 */
   start += n; /* Corresponds to the end */
   /* data corresponds to start, q runs over primediffs */
-  for (q = known_primes + 1, p = 3; delta; delta = *++q, p += delta)
+ while (1)
+ {
+  if (!delta)	return;		/* TRUE unless the first round */
+#ifdef HAVE_rem_half
+//  printf("   \tst=%lu\tn=%lu\n", start, n);   fflush(stdout);
+  swtch = start>>32; /* If p > swtch, then start/p < 2^32 */
+//  printf("sw=%lu\tst=%lu\tn=%lu\n", swtch, start, n);   fflush(stdout);
+
+  for (; delta && p <= swtch; delta = *++q, p += delta)
+#else
+  for (; delta; delta = *++q, p += delta)
+#endif
   { /* first odd number >= start > p and divisible by p
        = last odd number <= start + 2p - 2 and 0 (mod p)
        = p + last number <= start + p - 2 and 0 (mod 2p)
@@ -352,16 +425,26 @@ sieve_chunk(byteptr known_primes, ulong
     long off = cnt - ((start+(p>>1)) % p);
     while (off >= 0) { data[off] = 1; off -= p; }
   }
+#ifdef HAVE_rem_half
+//  printf("   \tp=%lu\n", p);   fflush(stdout);
+  for (; delta; delta = *++q, p += delta)
+  {
+    long off = cnt - rem_half((start+(p>>1)), p);
+    while (off >= 0) { data[off] = 1; off -= p; }
+  }
+//  printf("   \tp=%lu\n", p);   fflush(stdout);
+#endif
+ }
 }
 
 static void
 set_prodprimes(void)
 {
   pari_sp ltop = avma, av;
-  long m = expu(_maxprime) + 1 - 7;
-  GEN W, w, v = primes_interval_zv(3, _maxprime);
+  ulong b = 1UL << 8, mxp = minuu(2097169, _maxprime);
+  long m = expu(mxp) + 1 - 7; /* nextprime(2<20) */
+  GEN W, w, v = primes_interval_zv(3, mxp);
   long s, j, jold, lv = lg(v), u = 1;
-  ulong b = 1UL << 8;
 
   W = cgetg(m+1, t_VEC);
   for (jold = j = 1; j < lv; j++)
@@ -371,7 +454,7 @@ set_prodprimes(void)
       w = v+jold-1; w[0] = evaltyp(t_VECSMALL) | _evallg(lw);
       gel(W,u++) = zv_prod_Z(w); /* p_jold ... p_{j-1} */
       jold = j; b *= 2;
-      if (b > _maxprime) b = _maxprime; /* truncate last run */
+      if (b > mxp) b = mxp; /* truncate last run */
     }
   m = u - 1; setlg(W, u);
   for (j = 2; j <= m; j++) gel(W,j) = mulii(gel(W,j-1), gel(W,j));
@@ -386,69 +469,131 @@ set_prodprimes(void)
 /* assume maxnum <= 436273289 < 2^29 */
 static void
 initprimes0(ulong maxnum, long *lenp, ulong *lastp, byteptr p1)
-{
+{   /* Not thread save!  The caller should have exclusive control over p1 */
   pari_sp av = avma, bot = pari_mainstack->bot;
   long alloced, psize;
   byteptr q, end, p, end1, plast, curdiff;
-  ulong last, remains, curlow, rootnum, asize;
-  ulong prime_above;
-  byteptr p_prime_above;
+  ulong last, remains, curlow, maxnum0 = maxnum, asize, prefake = 0, postgap = 0;
+  ulong prime_above, lastBigGap = 0, stored_primes = 0, prestored = 0;
+  ulong prime_curr = 2, primen_curr = 0, dogap;
+  byteptr p_prime_above, posLastBigGap = 0;
+  unsigned char fake[4], *write_gap_gap = fake;
 
-  maxnum |= 1; /* make it odd. */
+  if (!odd(maxnum)) maxnum--; /* make it odd. */
   /* base case */
-  if (maxnum < 1ul<<17) { initprimes1(maxnum>>1, lenp, lastp, p1); return; }
-
-  /* Checked to be enough up to 40e6, attained at 155893 */
-  rootnum = usqrt(maxnum) | 1;
-  initprimes1(rootnum>>1, &psize, &last, p1);
+  if (maxnum >= SIEVE_ROUND1)  maxnum0 = usqrt(maxnum) | 1; /* Assume <=436273289 */
+  initprimes1(maxnum0>>1, lenp, lastp, p1);
+  while (stored_primes < *lenp - 1)
+  {
+    primes_store_off[stored_primes]   = p1 + stored_primes + 1; /* at end would point at the trailing 0 */
+    primes_store    [stored_primes]   = prime_curr;
+    primes_store_n  [stored_primes]   = ++primen_curr;
+    prime_curr += p1[++stored_primes];
+  }
+//  printf(" p=%lu, %lu, %lu, %lu, %lu, %lu\n", primes_store[0], primes_store[1], primes_store[2], primes_store[3], primes_store[4], primes_store[5]);
+//  printf(" n=%lu, %lu, %lu, %lu, %lu, %lu\n", primes_store_n[0], primes_store_n[1], primes_store_n[2], primes_store_n[3], primes_store_n[4], primes_store_n[5]);
+//  printf(" addr=%p, %ld, %ld, %ld, %ld, %ld, %ld\n", p1, primes_store_off[0] - p1, primes_store_off[1] - p1, primes_store_off[2] - p1, primes_store_off[3] - p1, primes_store_off[4] - p1, primes_store_off[5] - p1);
+  if (maxnum < 1ul<<17)
+  {
+    p1[(*lenp)++] = DELTA_GUARD;
+    primes_store_last = stored_primes-1;
+    return;
+  }
+  psize = *lenp;
+  last = *lastp;
+  /* Now siave between last+2 and maxnum */
   end1 = p1 + psize - 1;
+//  end1[1] = DELTA_GUARD;
   remains = (maxnum - last) >> 1; /* number of odd numbers to check */
 
   /* we access primes array of psize too; but we access it consecutively,
    * thus we do not include it in fixed_to_cache */
-  asize = good_arena_size((ulong)(rootnum * slow2_in_roots), remains+1, 0,
+  asize = good_arena_size((ulong)(maxnum0 * slow2_in_roots), remains+1, 0,
                           &cache_model) - 1;
+//  printf("...Arena size=%lu", asize);
+  
   /* enough room on the stack ? */
   alloced = (((byteptr)avma) <= ((byteptr)bot) + asize);
   if (alloced)
     p = (byteptr)pari_malloc(asize+1);
   else
     p = (byteptr)stack_malloc(asize+1);
-  end = p + asize; /* the 0 sentinel goes at end. */
-  curlow = last + 2; /* First candidate: know primes up to last (odd). */
+  end = p + asize;		     /* the 0 sentinel goes at end. */
+  curlow = last + 2; /* First candidate in arena, at p: know primes up to last (odd). */
   curdiff = end1;
 
   /* During each iteration p..end-1 represents a range of odd
      numbers.  plast is a pointer which represents the last prime seen,
      it may point before p..end-1. */
-  plast = p - 1;
-  p_prime_above = p1 + 2;
+  plast = p - 1;	    /* Number corresponding to p-1 is prime */
+  p_prime_above = p1 + 2;   /* Skip "diffs 2, 1" --> prime=3 */
   prime_above = 3;
   while (remains)
   { /* cycle over arenas; performance not crucial */
     unsigned char was_delta;
+    ulong delta;
     if (asize > remains) { asize = remains; end = p + asize; }
     /* Fake the upper limit appropriate for the given arena */
     while (prime_above*prime_above <= curlow + (asize << 1) && *p_prime_above)
-      prime_above += *p_prime_above++;
+      prime_above += *p_prime_above++;	/* By restricting maxnum to <436273009^2 we do not encounter biggaps. */
     was_delta = *p_prime_above;
-    *p_prime_above = 0; /* sentinel for sieve_chunk */
-    sieve_chunk(p1, curlow, p, asize);
+//    was_delta2 = p_prime_above[1];
+    *p_prime_above = 0; /* sentinel for sieve_chunk */ /* Not thread save! */
+//    p_prime_above[1] = DELTA_GUARD; /* sentinel for sieve_chunk */
+    sieve_chunk(p1, curlow, p, asize);	/* Uses 1 byte per an odd number to sieve */
     *p_prime_above = was_delta; /* restore */
+//    p_prime_above[1] = was_delta2; /* restore */
 
-    p[asize] = 0; /* sentinel */
+//    printf("New arena, p=%lu, n=%lu\n", prime_curr, primen_curr);
+    p[asize] = 0; /* byte at end; sentinel */
     for (q = p; ; plast = q++)
     { /* q runs over addresses corresponding to primes */
-      while (*q) q++; /* use sentinel at end */
+      while (*q) q++; /* use sentinel at end = p+asize */
       if (q >= end) break;
-      *curdiff++ = (unsigned char)(q-plast) << 1; /* < 255 for q < 436273291 */
+      delta = (ulong)(q-plast) << 1;	/* < 255 for q < 436273291 */
+      dogap = delta > 255 || (postgap && ++prefake >= (primen_curr > (23163293 + FAKE_SKIP_EVERY_SWCH) ? FAKE_SKIP_EVERY_AFTER : FAKE_SKIP_EVERY_BEFORE));
+      if ((postgap || dogap) ? dogap && (++prestored >= STORE_EVERY_BIGGAP || !postgap)
+	  : ++prestored >= (primen_curr >= STORE_EVERY_MIDPRIME_ST ? STORE_EVERY_MIDPRIME : STORE_EVERY_SMALLPRIME))
+      {	   /* On the first biggap, and (rarely) on biggaps, or on
+	    * primes pre-first-biggap */
+	primes_store_off[stored_primes]   = curdiff;
+	primes_store    [stored_primes]   = prime_curr; /* pre-delta */
+	primes_store_n  [stored_primes++] = primen_curr;
+	prestored = 0;
+      }
+      primen_curr++;
+      prime_curr += delta;
+      if (dogap)
+      {
+	setGapBIGGAP(write_gap_gap, curlow + ((q-p)<<1) - delta - lastBigGap, curdiff - posLastBigGap);
+	lastBigGap = curlow + ((q-p)<<1) - delta;
+	posLastBigGap = curdiff;
+        *curdiff++ = 0;
+        *curdiff++ = delta>>8;
+	write_gap_gap = curdiff;
+	curdiff += ToBIGGAP_SZ;
+	prefake = 0;
+	postgap++;
+      }
+      *curdiff++ = (delta & 0xff) | dogap; /* The last bit marked */
     }
-    plast -= asize;
+    plast -= asize;			/* points before the arena now */
     remains -= asize;
     curlow += (asize<<1);
-  }
+  } /* currdiff-1 is delta corresponding to prime_curr */
+  primes_store_off[stored_primes] = curdiff; /* Points at the trailing 0. May be a duplicate… */
+  primes_store  [stored_primes] =  prime_curr;
+  primes_store_n[stored_primes] = primen_curr;
+  primes_store_last = stored_primes;
+  if (primes_store_last > primes_store_mx) /* We use experiments/heuristics for size! */
+    pari_err(e_MISC,"Panic: primestore overflow, curr=%lu, mx=%lu", primes_store_last, primes_store_mx);
+//  printf(" p=%lu, %lu, %lu, %lu, %lu, %lu\n", primes_store[0], primes_store[1], primes_store[2], primes_store[3], primes_store[4], primes_store[5]);
+//  printf(" n=%lu, %lu, %lu, %lu, %lu, %lu\n", primes_store_n[0], primes_store_n[1], primes_store_n[2], primes_store_n[3], primes_store_n[4], primes_store_n[5]);
+//  printf(" addr=%p, %ld, %ld, %ld, %ld, %ld, %ld\n", p1, primes_store_off[0] - p1, primes_store_off[1] - p1, primes_store_off[2] - p1, primes_store_off[3] - p1, primes_store_off[4] - p1, primes_store_off[5] - p1);
   last = curlow - ((p - plast) << 1);
-  *curdiff++ = 0; /* sentinel */
+  setGapBIGGAP(write_gap_gap, last - lastBigGap, curdiff - posLastBigGap);
+  *curdiff++ = 0;		/* sentinel1 */
+  *curdiff++ = DELTA_GUARD;	/* sentinel2 */
   *lenp = curdiff - p1;
   *lastp = last;
   if (alloced) pari_free(p); else set_avma(av);
@@ -457,6 +602,8 @@ initprimes0(ulong maxnum, long *lenp, ul
 ulong
 maxprime(void) { return diffptr ? _maxprime : 0; }
 ulong
+maxprime_biggaps(void) { return diffptr ? _maxprime_biggaps : 0; }
+ulong
 maxprimelim(void) { return diffptr ? _maxprimelim : 0; }
 ulong
 maxprimeN(void) { return diffptr ? diffptrlen-1: 0; }
@@ -468,20 +615,83 @@ maxprime_check(ulong c) { if (_maxprime
 
 /* We ensure 65302 <= maxnum <= 436273289: the LHS ensures modular function
  * have enough fast primes to work, the RHS ensures that p_{n+1} - p_n < 255
+ *   (UPDATE: This is not needed anymore: the caller takes care of _maxprime.)
  * (N.B. RHS would be incorrect since initprimes0 would make it odd, thereby
  * increasing it by 1) */
 byteptr
 initprimes(ulong maxnum, long *lenp, ulong *lastp)
 {
   byteptr t;
+  double L, overhead_biggaps;
+  size_t l1, l2, biggaps_fake = 0, biggaps_store, cont;
   if (maxnum < 65537)
     maxnum = 65537;
-  else if (maxnum > 436273289)
-    maxnum = 436273289;
-  t = (byteptr)pari_malloc((size_t) (1.09 * maxnum/log((double)maxnum)) + 146);
+#ifdef LONG_IS_64BIT
+  else if (maxnum >= 436273009UL*436273009UL)	/* Avoid biggaps when sieving to construct a sieve; */
+    maxnum = 436273009UL*436273009UL - 2;
+#else
+  else if (maxnum > 4294901999UL)
+    maxnum = 4294901999UL;	/* Ensure that NEXT_PRIME_VIADIFF_BIGGAPS overshoots at end */
+#endif	/* !( defined LONG_IS_64BIT ) */
+  L = 1/log((double)maxnum);  /* Majorate Li(maxnum) which majorates */
+  l1 = 1.09 * maxnum*L + 145; /* primepi(maxnum) for reasonable log(maxnum) */
+  l2 = maxnum*L*(1+L*(1+2*L*(1+3*L*(1+9.6*L)))); /* Good at 5,290. */
+  /* exp(-256*L) is a naive “independent residues” estimate for gaps.  
+     The experiments (up to maxnum=1.5e14) suggest the measured
+     fraction is much smaller (as 0.01 ... 1/3): with x=maxnum/1e9:
+        exp(-Q(L) - C(x)/x^2 +- 0.75/x^(2/3))
+     with a quadratic Q (and 0.75 may be changed to 0.47 outside of the
+     x-interval [1.18,1.3]).  (No trace of the need for higher degree
+     terms with degree <= 20 is detected.)  Here C(x)/x^2 is a
+     correction for "a hill" at small x's (with C <= 1.8).
+        Moreover, Q(L)>256*L (although comes within 0.127 near
+     maxnum=1e54; but this is way above the experimental range!).  */
+  overhead_biggaps = exp(-256*L); 	   /* Proportion of big gaps */
+  ulong ll1 = minuu(l1,l2);	      /* Min.  Crossover at 3.8e5; error <0.01% at >=6.66e9. */
+  if (l2 > 23163298)			   /* Have big gaps.  May create fake big gaps */
+    biggaps_fake = (minuu(l2 - 23163298, FAKE_SKIP_EVERY_SWCH) + 1)/FAKE_SKIP_EVERY_BEFORE + 1;
+  if (l2 >= 23163298 + FAKE_SKIP_EVERY_SWCH)
+    biggaps_fake += (l2 - 23163298 - FAKE_SKIP_EVERY_SWCH + 1)/FAKE_SKIP_EVERY_AFTER + 1;
+  biggaps_store = ((maxuu(ll1+1,23163298)-23163298)*overhead_biggaps + 1 + biggaps_fake)/STORE_EVERY_BIGGAP + 2; /* >=2 */
+  cont = 1.09*2*usqrt(maxnum)*L + 145;	/* Estimate from above */
+  cont = maxuu(STORE_EVERY_MIDPRIME_ST, cont);
+  biggaps_store += minuu(ll1, cont)/STORE_EVERY_SMALLPRIME + 2; /* >=2 */
+  if (ll1 > cont)
+    biggaps_store += (minuu(ll1, 23163298) - minuu(23163298, cont))/STORE_EVERY_MIDPRIME + 1;
+
+//  printf("  l1=%lu l2=%lu ll1=%lu ovhd=%g (units) store=%lu\n", l1, l2, ll1, overhead_biggaps, biggaps_store);
+  overhead_biggaps *= (2+ToBIGGAP_SZ);	   /* In bytes */
+  l2 *= 1 + overhead_biggaps;		   /* No actual insertions in case of l1 */
+  l2 += biggaps_fake * (2+ToBIGGAP_SZ);
+  if (maxnum >= 436273289 || l1 > l2) l1 = l2; /* Crossover at 3.8e5; error <0.01% at >=6.66e9. */
+  l1 += 2 + ToBIGGAP_SZ;      /* Allow same code to run at end */
+//  printf("  l1=%lu ovhd=%g (bytes)\n", l1, overhead_biggaps);
+  l1 = (l1 + sizeof(ulong) - 1)/sizeof(ulong)*sizeof(ulong); /* Pad */
+  t = (byteptr)pari_malloc((size_t)l1 + 3*biggaps_store*sizeof(ulong));
+//  printf(" diffptr=%p to %p", t, t + (size_t)l1 + 3*biggaps_store*sizeof(ulong));
+  primes_store     = (ulong*)(t+l1);
+  primes_store_n   = primes_store + biggaps_store;
+  primes_store_off = (byteptr*)(primes_store_n + biggaps_store);
+  primes_store_mx    = biggaps_store - 1;
+  primes_store_last = 0;		/* Not initialized yet! */
+
   initprimes0(maxnum, lenp, lastp, t);
-  _maxprimelim = maxnum;
-  return (byteptr)pari_realloc(t, *lenp);
+  if (*lenp > l1) pari_err(e_MISC, "wrong estimate %lu of length %lu of primediff", l1, *lenp);
+  _maxprimelim_biggaps = _maxprimelim = maxnum;
+  if (maxnum >= 436273291)
+    _maxprimelim = 436273291 - 2;
+//  printf("Sieving done\n");  fflush(stdout);
+  if (*lenp < l1 - (l1>>7) - 2 - ToBIGGAP_SZ - sizeof(ulong))		/* Between 200,000 and 500,000 up to 0.3% */
+    fprintf(stderr, "Warning: Need to reallocate primediff from %ld to %ld (fence %ld))\n", 
+	    l1, *lenp, l1 - (l1>>7) - 2 - ToBIGGAP_SZ - sizeof(ulong)); /* Almost hit @69991 */
+  else return t;	  /* Avoiding realloc() may save 1/2 of memory footprint.) */
+  ulong n = (*lenp + sizeof(ulong) - 1)/sizeof(ulong)*sizeof(ulong);
+  memmove(t + n, primes_store, 3*biggaps_store*sizeof(ulong));
+  t = (byteptr)pari_realloc(t, n + 3*biggaps_store*sizeof(ulong)); /* Want to avoid this: may overuse memory */
+  primes_store     = (ulong*)(t+n);
+  primes_store_n   = primes_store + biggaps_store;
+  primes_store_off = (byteptr*)(primes_store_n + biggaps_store);
+  return t;
 }
 
 void
@@ -492,7 +702,7 @@ initprimetable(ulong maxnum)
   byteptr p = initprimes(maxnum, &len, &last), old = diffptr;
   diffptrlen = minss(diffptrlen, len);
   _maxprime  = minss(_maxprime,last); /*Protect against ^C*/
-  diffptr = p; diffptrlen = len; _maxprime = last;
+  diffptr = p; diffptrlen = len; _maxprime = minss(436273009,last); _maxprime_biggaps = last;
   set_prodprimes();
   if (old) free(old);
 }
@@ -547,7 +757,7 @@ optimize_chunk(ulong a, ulong b)
 {
   /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large
    * as to force recalculating too often). */
-  ulong chunk = 0x80000UL;
+  ulong chunk = (cache_model.bigarena ? cache_model.bigarena : 0x80000UL)<<4; /* bigarena is in bytes, we use bits, and only odds */
   ulong tmp = (b - a) / chunk + 1;
 
   if (tmp == 1)
@@ -572,6 +782,123 @@ sieve_init(forprime_t *T, ulong a, ulong
 
 enum {PRST_none, PRST_diffptr, PRST_sieve, PRST_unextprime, PRST_nextprime};
 
+/* Return 0 if !ge, and a<3; then ptr_* do not point to correct values. */
+ulong /* Assume 2 < a <= _maxprime_biggaps; and >=2 stored primes.  Not tested with ge>0 */
+biggap_prime_le_ge(ulong a, ulong ge, ulong *ptr_p, byteptr *ptr_d)
+{		  /* a>2.  Start with binary search over the stored table */
+  ulong *s = primes_store + 1, *e = s + primes_store_last, *mid;
+  while (s <= e) /* The "insertion point" is between s and e+1 */
+  {		 /* Return it in s ()may be max+1) */
+    mid = s + (e - s)/2;
+    if (a == *mid)
+    {
+     intable:
+      *ptr_p = a;
+      *ptr_d = primes_store_off[mid - primes_store];
+      return   primes_store_n  [mid - primes_store];
+    }
+    if (a > *mid)    s = mid + 1;
+    else	     e = mid - 1;
+  } /* Now s==max+1, or a < *s.  Moreover, a is not in the list */
+  if (s > primes_store + primes_store_last)
+  {
+    if (!ge && a <= _maxprimelim_biggaps)
+    {
+      mid = primes_store + primes_store_last;
+      a = *mid;
+      goto intable;
+    }
+    if (ge) pari_err(e_MISC, "the argument %lu is above the max stored prime %lu (%lu))", a, _maxprime_biggaps, primes_store[primes_store_last]);
+    pari_err(e_MISC, "the argument %lu is above the prime limit %lu", a, _maxprimelim_biggaps);
+  } /* Now a < *s */
+  if (s == primes_store+1)		/* a is on the left */
+  {
+    if (ge)
+    {
+      mid = primes_store;
+      a = *mid;
+      goto intable;
+    }
+    pari_err(e_MISC, "the argument %lu is below 3", a);
+  } /* Now s[-1] < a < s[0] */
+  ulong p = 436273009, n = 23163298, nn, pp, pp1, off; /* Need 1 iteration to become correct */
+  byteptr d = diffptr + offsetOfGapBIGGGAP, dd, dd1;
+  dd = primes_store_off[s - 1 - primes_store] + offsetOfGapBIGGGAP;
+  nn = primes_store_n  [s - 1 - primes_store];
+  if (*s > p)		     /* After the first gap; s - 1 is a big gap */
+  {			     /* Loop over big gaps >= s - 1 */
+    p = s[-1];
+    while (1)			/* Know a < _maxprime_biggaps */
+    {				/* Know a > pp */
+      pp  = p;
+      p  += getGapBIGGAP(dd);
+      off = offsetGapBIGGAP(dd);
+      if (p >= a) break;
+      dd += off;			/* Corresponds to pp */
+      nn += off - 2 - ToBIGGAP_SZ;	/* Corresponds to pp */
+  //    printf("->pp=%lu p=%lu poff=%lu off=%lu doff=%lu pn=%lu n=%lu\t%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x\n", pp, p, (ulong)(d-diffptr), (ulong)(d-diffptr)+off, off, nn, nn+off-2-ToBIGGAP_SZ, d[-2], d[-1], d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7]);
+    }
+    dd -= offsetOfGapBIGGGAP;
+    d = dd + off;
+    n = nn + off - 2 - ToBIGGAP_SZ;
+    if (a == p)
+    {
+      *ptr_p = a;
+      *ptr_d = d;
+      return  n;
+    } /* Now special-case the first diff (of the big gap) */
+    dd1 = dd;
+    pp1 = pp;
+    NEXT_PRIME_VIADIFF_BIGGAPS(pp,dd);
+    if (a <= pp)	     /* Need to treat the big gap specially */
+    {
+      if (ge || a == pp)
+      {
+	*ptr_p = pp;
+	*ptr_d = dd;
+	return  nn + 1;
+      }
+      *ptr_p = pp1;
+      *ptr_d = dd1;
+      return  nn;
+    }
+    nn++;
+  }
+  else					/* No gaps nearby */
+  {
+    pp = s[-1];
+    dd -= offsetOfGapBIGGGAP;
+    p = *s;
+    d = primes_store_off[s - primes_store];
+    n = primes_store_n  [s - primes_store];
+  } /* Now can scan froward from pp/dd/nn (or backwards) */
+  ulong md = pp + ((ulong)(p - pp))/2;  
+  if (a <= md)				/* Better to search forward */
+  {
+    d = dd;
+    p = pp;	      /* So NEXT_PRIME_VIADIFF gets a correct value */
+    do { NEXT_PRIME_VIADIFF(p,d); } while (p < a);
+    if (!ge && p!=a) PREC_PRIME_VIADIFF(p,d);
+  }
+  else					/* Better to search backwards */
+  {
+    do { PREC_PRIME_VIADIFF(p,d); } while (p > a);
+    if (ge && p!=a)  NEXT_PRIME_VIADIFF(p,d);
+  }
+  *ptr_p = p;
+  *ptr_d = d;
+  return nn + (d - dd);
+}
+
+ulong /* Assume 436273009 < a <= _maxprime_biggaps.  Not tested with ge>0 */
+biggap_primepi_le_ge(ulong a, ulong ge, ulong return_p)
+{
+  ulong n, p;
+  byteptr fakep;
+  n = biggap_prime_le_ge(a, ge, &p, &fakep);
+  return return_p ? p : n;
+}
+
 static void
 u_forprime_set_prime_table(forprime_t *T, ulong a)
 {
@@ -581,8 +908,9 @@ u_forprime_set_prime_table(forprime_t *T
     T->p = 0;
     T->d = diffptr;
   }
-  else
+  else if (a <= 3)//436273009)
     T->p = init_primepointer_lt(a, &T->d);
+  else     biggap_prime_le_ge(a - !!a, 0, &(T->p), &(T->d)); /* Check lt, not ge */
 }
 
 /* Set p so that p + q the smallest integer = c (mod q) and > original p.
@@ -600,6 +928,9 @@ arith_set(forprime_t *T)
   T->p = itou_or_0(d); set_avma(av);
 }
 
+/* FIXME: should take into account the arena size and b */
+#define SLOWDOWN_NEXTPRIME_BILLION	26 /* How many times slower is nextprime() in range of 10^9 */
+
 /* run through primes in arithmetic progression = c (mod q) */
 static int
 u_forprime_sieve_arith_init(forprime_t *T, struct pari_sieve *psieve,
@@ -615,7 +946,7 @@ u_forprime_sieve_arith_init(forprime_t *
     T->d = diffptr;
     return 0;
   }
-  maxp = maxprime();
+  maxp = maxprime_biggaps();
   if (q != 1)
   {
     c %= q;
@@ -645,22 +976,15 @@ u_forprime_sieve_arith_init(forprime_t *
   else
     u_forprime_set_prime_table(T, a);
 
-  maxp2 = (maxp & HIGHMASK)? 0 : maxp*maxp;
+  maxp2 = (maxp & HIGHMASK)? ULONG_MAX : maxp*maxp;
   /* FIXME: should sieve as well if q != 1, adapt sieve code */
-  if (q != 1 || (maxp2 && maxp2 <= a)
-             || T->b - maxuu(a,maxp) < maxp / expu(b))
+  if (q != 1 || (maxp2 <= a)
+             || T->b - maxuu(a,maxp) < 1000000000/SLOWDOWN_NEXTPRIME_BILLION)
   { if (T->strategy==PRST_none) T->strategy = PRST_unextprime; }
   else
   { /* worth sieving */
-#ifdef LONG_IS_64BIT
-    const ulong UPRIME_MAX = 18446744073709551557UL;
-#else
-    const ulong UPRIME_MAX = 4294967291UL;
-#endif
-    ulong sieveb;
-    if (b > UPRIME_MAX) b = UPRIME_MAX;
-    sieveb = b;
-    if (maxp2 && maxp2 < b) sieveb = maxp2;
+    ulong sieveb = b;
+    if (maxp2 < b) sieveb = maxp2;
     if (T->strategy==PRST_none) T->strategy = PRST_sieve;
     sieve_init(T, maxuu(maxp+2, a), sieveb);
   }
@@ -754,15 +1078,25 @@ static void
 sieve_block(ulong a, ulong b, ulong maxpos, unsigned char* sieve)
 {
   ulong p = 2, lim = usqrt(b), sz = (b-a) >> 1;
+#ifdef HAVE_rem_half
+  ulong swtch = a>>32; /* If p > swtch, then a/p < 2^32 */
+//  printf("  a=%lu\tb=%lu\tsw=%lu\n", a, b, swtch);
+#endif	/* defined HAVE_rem_half */
+ 
   byteptr d = diffptr+1;
   (void)memset(sieve, 0, maxpos+1);
   for (;;)
   { /* p is odd */
     ulong k, r;
-    NEXT_PRIME_VIADIFF(p, d); /* starts at p = 3 */
+    NEXT_PRIME_VIADIFF_BIGGAPS(p, d); /* starts at p = 3 */
     if (p > lim) break;
 
     /* solve a + 2k = 0 (mod p) */
+#ifdef HAVE_rem_half
+    if (p > swtch)
+      r = rem_half(a, p);
+    else
+#endif	/* defined HAVE_rem_half */ 
     r = a % p;
     if (r == 0)
       k = 0;
@@ -778,6 +1112,205 @@ sieve_block(ulong a, ulong b, ulong maxp
   }
 }
 
+int
+byteBitCNT(ulong b)
+{
+  b = (b & 0x55) + ((b & 0xAA)>>1);
+  b = (b & 0x33) + ((b & 0xCC)>>2);
+  return (b & 0x0f) + ((b & 0xf0)>>4);
+}
+
+ulong
+try_div__(GEN aa, ulong b)
+{
+  ulong tot = 0, i = 0;
+  for (; i<100000; i++)
+    tot += umodiu(aa, b+i);
+  return tot;
+}
+
+ulong
+try_bfffo(ulong in)
+{
+  unsigned char v = in;
+  byteptr p = &v;
+  ulong s = bsf__maybe((unsigned char)~*p), e = BITS_IN_LONG - 1 - bfffo((unsigned char)~*p);
+  return 1000000 + 100*s + e;
+}
+
+#define LOOP_RESDS(k,sz,SIEVE,p)    while ((k) <= (sz)) { (SIEVE)[(k)>>3] |= 1 << ((k)&7); (k) += (p); /* 2k += 2p */ }
+
+/* m = a + 2r is the smallest odd m >= a, p | m */
+/* position n (corresponds to a+2n) is sieve[n>>3], bit n&7 */
+#define MARK_RESDS(r,sz,SIEVE,p)	\
+    STMT_START { if (r != 0) { r = p - r; if (odd(r)) r += p; r >>= 1; } LOOP_RESDS(r,sz,sieve,p); } STMT_END
+
+static long
+chk_isprime(GEN N) { return signe(gisprime(N,0)); }
+	
+GEN
+sieve_block2(GEN aa, ulong b_a, ulong lm, long doRoughPrimes)     /* a+b_a is included, aa should be odd */
+{
+  pari_sp av = avma, av1 = av, bot = pari_mainstack->bot;
+  ulong a = *int_LSW(aa), shiftword = *int_MSW(aa), rc = 0, rc1 = 0;
+  ulong p = 3, lim, glim, sz = b_a >> 1, maxpos = (b_a>>4) + 1, sel = 0;
+  byteptr sieve, end, tofree = 0;
+  long (*checker)(GEN) = BPSW_psp;
+  
+  if (glength(aa) > 2) shiftword = ULONG_MAX;
+  else if (glength(aa) < 2) shiftword = 0;
+#ifdef HAVE_rem_half
+  ulong swtch = a>>32; /* If p > swtch, then a/p < 2^32 */
+//  printf("  a=%lu\tb=%lu\tsw=%lu\n", a, b, swtch);
+#endif	/* defined HAVE_rem_half */
+
+  if (lm > 1)
+  {
+    glim = lm;
+    sel = (doRoughPrimes >= 0);
+  }  
+  else if (shiftword) glim = itou(sqrtint(gadd(aa, utoi(b_a))));
+  else                glim = usqrt(a + b_a);
+  if (glim > _maxprimelim_biggaps)
+  {
+      if (lm != 1)  pari_err(e_MISC, "max calculated primes %lu too small (need %lu) for sieve_block2", _maxprimelim_biggaps, glim);
+      glim = _maxprimelim_biggaps;
+      sel = (doRoughPrimes >= 0);
+  }
+  if (doRoughPrimes > 0)  checker = chk_isprime;
+  if (((byteptr)avma) <= ((byteptr)bot) + maxpos + 102400)
+    tofree = sieve = (byteptr)pari_calloc(maxpos);
+  else
+  {
+    sieve = (byteptr)stack_calloc(maxpos);
+    av1 = avma;
+  }
+  if (!sieve) pari_err(e_MEM);
+  
+  byteptr d = diffptr+2;	   /* matches p=3 */
+//  (void)memset(sieve, 0, maxpos);
+  end = sieve + maxpos - 1;
+  lim = minuu(glim, 436273009);
+  for (;;)				/* Loop over biggaps */
+  {
+   restart:
+    if (p > glim) break;
+    if (p > 3)			    /* Process a biggap. */
+    {
+      UNDO_NEXT_PRIME_VIADIFF(p, d);	/* The preceding increment was fake */
+      while (!d[-2]) --d; /* We may need to trace back over a few "fake" 0s; the trailing byte is never 0 */
+      d++;
+      lim = p + getGapBIGGAP(d);
+      lim = minuu(lim, glim);
+      d -= 2;
+//      printf("p=%lu d=%lu -> %02x.%02x.%02x.%02x.%02x.%02x.%02x\n", p, (ulong)d, d[0], d[1], d[2], d[3], d[4], d[5], d[6]);
+      NEXT_PRIME_VIADIFF_BIGGAPS(p, d);
+    }
+
+    ulong tlim = minuu(shiftword, lim), r;
+//    printf("    tlim=%lu shiftword=%lu lim=%lu glim=%lu\n", tlim, shiftword, lim, glim);
+    /* Below, we repeat the cycle over p many times with slightly
+     * different code to find r.  This way, extra checks are removed
+     * from very tight cycles. */
+    if (p <= tlim)		      /* Cannot use division on CPU */
+    {
+      for (;;)			 /* Loop in the run between biggaps */
+      {				 /* p is odd */
+	r = umodiu(aa,p);	 /* Whole-blown division */
+	MARK_RESDS(r,sz,sieve,p); /* solve a + 2r = 0 (mod p) and sieve */
+	NEXT_PRIME_VIADIFF(p, d);
+	if (p > tlim) break;
+      }
+      if (p > lim) goto restart;
+    }	     /* Now, can use the division instructions on the CPU */
+//    printf("S  p=%lu d=%lu -> %02x.%02x.%02x.%02x.%02x.%02x.%02x\n", p, (ulong)d, d[0], d[1], d[2], d[3], d[4], d[5], d[6]);
+
+    if (shiftword || p <= swtch)     /* Need to use 128:64 division */
+    {
+      tlim = shiftword ? lim : minuu(swtch, lim);
+      for (;;)				/* Loop in the run between biggaps */
+      {					/* p is odd */
+	{
+	  LOCAL_HIREMAINDER;
+	  hiremainder = shiftword;
+	  divll(a,p);
+	  r = hiremainder;
+	}
+	MARK_RESDS(r,sz,sieve,p); /* solve a + 2r = 0 (mod p) and sieve */
+	NEXT_PRIME_VIADIFF(p, d);
+	if (p > tlim) break;
+      }
+      if (p > lim) goto restart;
+    }	     /* Now, can use division with 64bit divident */
+//    printf("L  p=%lu d=%lu -> %02x.%02x.%02x.%02x.%02x.%02x.%02x\n", p, (ulong)d, d[0], d[1], d[2], d[3], d[4], d[5], d[6]);
+
+    for (;;) /* Now can use 64:32 division.  Loop in the run between biggaps */
+    {					/* p is odd */
+#ifdef HAVE_rem_half
+      r = rem_half(a, p);
+#else	/* !defined HAVE_rem_half */
+#  ifdef PREFER_C_LANG_MOD
+      r = a % p;
+#  else	 /* !( defined PREFER_C_LANG_MOD ) */ 
+      {
+	LOCAL_HIREMAINDER;
+	hiremainder = 0;
+	divll(a,p);
+	r = hiremainder;
+      }
+#  endif /* !( defined PREFER_C_LANG_MOD ) */ 
+#endif	/* !defined HAVE_rem_half */
+      MARK_RESDS(r,sz,sieve,p); /* solve a + 2r = 0 (mod p) and sieve */
+      NEXT_PRIME_VIADIFF(p, d);
+      if (p > lim) break;
+    }
+//    printf("   p=%lu d=%lu -> %02x.%02x.%02x.%02x.%02x.%02x.%02x\n", p, (ulong)d, d[0], d[1], d[2], d[3], d[4], d[5], d[6]);
+  }
+#define S(k)  (sieve[k])
+//  printf("%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x\n", S(0), S(1), S(2), S(3), S(4), S(5), S(6), S(7), S(8), S(9), S(10), S(11), S(12), S(13), S(14), S(15));
+// #undef S
+  if (sel)				/* Need isprime() calls too */
+  {
+    ulong off=0, bit=0;
+    if ((sz+1) & 0x7)
+    {
+      *end |= 0xff << ((sz+1) & 0x7);	/* Do not check unwritten bits at end */
+      rc -= 8 - ((sz+1) & 0x7);
+    }
+//    printf("%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x.%02x*%lu\n", S(0), S(1), S(2), S(3), S(4), S(5), S(6), S(7), S(8), S(9), S(10), S(11), S(12), S(13), S(14), S(15), -rc);
+    while (sieve <= end)
+    {
+      if (*sieve == 0xff)
+	  rc += 8;
+      else
+      {
+	  ulong s = bsf__maybe((unsigned char)~*sieve), e = BITS_IN_LONG - 1 - bfffo((unsigned char)~*sieve);
+	  rc += 8 - (e - s + 1);
+	  for (bit = s; bit <= e; bit++)
+//	  for (bit = 0; bit < 8; bit++)
+	  {
+	    if (*sieve & (1 << bit))	/* is not primelimit-rough */
+	      rc++;
+	    else if (!checker(addiu(aa, off + (bit<<1))))
+	      rc1++;		   /* rough, but not a pseudo prime */
+	  }
+      }
+      if (! (off & 0xffff) )
+	set_avma(av1);
+      sieve++;
+      off += 16;
+    }
+  }
+  else	      
+    while (end >= sieve) /* printf("0x%02x ", (int)*end), */ rc += byteBitCNT(*end--);
+  if (tofree)
+    pari_free(tofree);
+  set_avma(av);
+  if (sel)
+    retmkvec2(utoi(sz + 1 - rc - rc1), utoi(sz + 1 - rc));
+  return utoi(sz + 1 - rc);
+}
+
 static void
 pari_sieve_init(struct pari_sieve *s, ulong a, ulong b)
 {
@@ -856,7 +1389,7 @@ u_forprime_next(forprime_t *T)
   {
     for(;;)
     {
-      if (!*(T->d))
+      if (!*(T->d) && (T->d)[1] == DELTA_GUARD)
       {
         T->strategy = T->isieve? PRST_sieve: PRST_unextprime;
         if (T->q != 1) { arith_set(T); if (!T->p) return 0; }
@@ -865,6 +1398,11 @@ u_forprime_next(forprime_t *T)
       }
       else
       {
+        if (!*(T->d))
+	{
+	  T->p += ((*++(T->d))<<8) - 1;	/* Last byte is incremented by 1 */
+	  T->d += ToBIGGAP_SZ+1; /* Replaces NEXT_PRIME_VIADIFF_BIGGAPS */
+        }
         NEXT_PRIME_VIADIFF(T->p, T->d);
         if (T->p > T->b) return 0;
         if (T->q == 1 || T->p % T->q == T->c) return T->p;