Ilya Zakharevich on Fri, 12 Jan 2024 23:27:51 +0100 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Big GOTCHAs WITH forprime() |
[I start with a longread supplying the context, then there is a short stopgap-suggestion at the end.] About 20–25 years ago forprime() • was optimized through the teeth (at least for the case when the limits are within the size of one word); • the optimizations were highly tunable for idiosyncrasies of particular processors; • the logic of tuneups was documented (at least in the source); and • it seems that several people did at least vgrep⸣ped through the code. Later the code was completely rewamped, and it was replaced/enhanced “by a much better code”. Today I did some timing on this “much better code”. Here are the results (with something like my(s=190*10^15); forprime(p=s,s+7*10^6,) ): s len= 6e5 6e6 6e7 6e8 ————————————————————————————————————————————— 15e16 | 0.999 25.366 17e16 | 1.000 26.958 267.979 190e15 | 0.125 0.984 28.361 282.534 191e15 | 0.125 0.999 9.860 98.734 It is obvious that no tuneup was performed on this code. I would not be surprised if it can be sped up an order of magnitude… (Moreover, it segfaults… — reported, see https://pari.math.u-bordeaux.fr/cgi-bin/bugreport.cgi?bug=2520 .) ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ language/forprime.c I browsed through the code. One critical parameter, the chunk size, is chosen by: /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large * as to force recalculating too often). */ ulong chunk = 0x80000UL; ulong tmp = (b - a) / chunk + 1; if (tmp == 1) chunk = b - a + 16; else chunk = (b - a) / tmp + 15; Hmm, no possibility of tuneup indeed?! ⁜⁜⁜⁜ Another critical parameter: when to switch between “sieving” and the “nextprime” method: maxp2 = (maxp & HIGHMASK)? 0 : maxp*maxp; /* FIXME: should sieve as well if q != 1, adapt sieve code */ if (q != 1 || (maxp2 && maxp2 <= a) || T->b - maxuu(a,maxp) < maxp / expu(b)) WHY disable sieving if maxp is wider than half-a-word?! WHY disable sieving if b-a < primelimit/⌊log₂(b)⌋ ?! ⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ 1/10000th of a workaround At some moment, the code which allowed to support arbitrarily large primelimit (up to a word size) was removed from PARI. (Compare with https://pari.math.u-bordeaux.fr/archives/pari-dev-2401/msg00008.html .) So now only the (unoptimized untunable buggy) code above can be used in the range about [436e6, 436e6²]. This removal has advantages: when I introduced the code in question, it could give a slowdown of 3–5% to forprime() — due to one extra (very rarely taken) conditional branch inside a hot loop. However: • With the progress in branch prediction in processors during the last 20 years, I would not be surprised that this slowdown is removed COMPLETELY nowadays. • As the timings above show, the replacement code can be pessimized UP TO 3 TIMES. On this background, even these 3–5% (on the architectures of last millennium?!) pale in comparison. So I think that a small part of the problems indicated at the start of this message may be alleviated by reinstalling these (few lines of) code allowing placing more forprime() invocations below the primelimit. Thanks, Ilya P.S. The next (obvious) useful steps seem to be • document the current “logic” of “optimizations” in the current code. • Make ALL the “arbitrary choices” tunable by users. • (Would be nice to) do massive experiments with timings to make the default tuneup reasonable. And MAYBE, when this is covered, more possibilities for optimizations would become visible…