| Karim Belabas on Sun, 14 Jul 2019 18:12:01 +0200 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
| Re: Fwd: Re: Using local variable for memoizing in recursion |
* Mike Day [2019-07-14 15:03]:
> I've just noticed an edge-condition failure in Karim's version (below):
> need to replace
> vecsum(t[a-1]); \\ at end of function by
> vecsum(t[kx]); \\ was problem when a=1 !
Good catch :-)
> I'd accepted Dmitry's assertion that execution of a particular line took
> half the execution time, so sought a way to avoid the calculation.
The assertion is plausible: the innermost loop is going to take 100% of CPU
time and contains only 2 lines
ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */
t[k2][j]=t[kx][j-1]+t[ky][j]);
which look of comparable cost (see below: more atomic steps in the first one
but tiny inputs compared to the large addition and assignment in the second
line).
> I still wonder, though, how I can _examine_ a function's performance,
> line by line, in order to optimise by hand/eye/brain...
I don't see an easy way in GP (esp. under Windows). The following usually
works but is cumbersome: replace in the program the instruction block
to be timed by a loop repeating the same block enough times that an
individual timing become meaningfull, and surround the loop by something like
gettime();
for(cnt = 1, MAXCNT, ...);
T += gettime();
Make sure the loop operates on dummy variables (say, my_t) so as to have no
effect on the program flow. Then print T / MAXCNT and compare. One must also
check than an empty loop for(cnt = 1, MAXCNT, ) takes negligible time [will be
the case here] or take that into account also...
What I sometimes do is have a look at gprof [external profiling tool] and
decide from there. Just did it: in this case, it's hard to identify what takes
time from the C function names:
< gprof excerpt, ran on the original pnkv_3(10000,2000) >
index % time self children called name
[3] 92.0 3.05 3.64 20010027+1 closure_eval <cycle 2> [3]
0.94 0.53 60014000/60014000 gadd [4]
0.05 0.35 20008000/20008000 change_compo [7]
0.01 0.25 20008003/20008003 changelex [9]
0.07 0.17 20007999/20007999 gmod [10]
0.12 0.07 40016003/80028023 gunclone_deep [8]
0.19 0.00 120008001/120008001 check_array_index [12]
0.09 0.09 40011998/40011998 copylex [13]
0.14 0.00 80002000/80002000 closure_castgen [16]
0.10 0.00 40016010/40016010 stoi [22]
0.05 0.00 39996001/39996001 gsub [28]
(N.B. The timings from gprof are usually inaccurate but give a rough idea
of relative cost; the number of calls is more useful.)
A line such as
ky=1+(k1-j+i)%(k1); /* to be optimized, takes half the CPU time */
translates to 2 gadd, 1 gsub, 1 gmod and 1 changelex. It is run ~ 2.10^7 times
(the number of 'gmod' in gprof output, but also ~ n * k from direct code
inspection)
The other line in the innermost loop
t[k2][j]=t[kx][j-1]+t[ky][j]);
translates to 1 gadd, 1 gsub, 1 change_compo (=) and a 6 check_array_index
([] construct). Note that this latter gadd operates on much larger inputs than
the preceding ones [ or the gsub and gmod ] from the other line but it's
impossible to estimate that only from gprof's output.
My version returns this:
< gprof excerpt, ran on my pnk_4(10000,2000) >
index % time self children called name
[3] 91.0 3.06 3.09 40018025+1 closure_eval <cycle 2> [3]
0.96 0.25 19988002/19990001 gadd [4]
0.02 0.31 20038000/20038000 changelex [6]
0.17 0.07 40055999/40074019 gunclone_deep [8]
0.19 0.00 79972003/79972003 closure_castgen [10]
0.04 0.13 19998000/19998000 geq [11]
0.14 0.00 40028027/40028027 pari_stack_alloc [13]
0.03 0.10 19998000/19998000 gsub1e [16]
0.11 0.00 79982004/79982004 check_array_index [18]
0.10 0.00 19988002/19988002 gsub [19]
0.08 0.00 39986017/39986017 stoi [22]
and 100% of the CPU time is now concentrated on the following three lines:
if (!(ky--), ky = k1);
if (j == 1, t[ky][1]
, t[kx][j-1] + t[ky][j]));
The reasons it's a little faster:
1) I removed the assignments to the t[k+2] accumulator using the
t[a] = vector() construct [which replaces t[a] in one atomic step
instead of changing in place the t[a][j], which wouldn't work since the loop
still needs the values)
2) I removed many operations on small operands; letting the implicit loop
in vector() do some of them in a faster way (+ and assignment).
> And while we're at it, is Dmitry's use of a vector of vectors, t, to be
> preferred to my earlier use of a matrix, also t, with similar dimensions?
A vector of vectors will be a little faster but it should not make a major
difference. What made a difference is to replace a for() loop by a vector()
construct in the innermost loop and removing the "deep" assignments
t[k2][j] = ...
(we still have an equivalent of the global copy t[kx] = t[k2] in the t[a] =
vector(...))
Cheers,
K.B.
--
Karim Belabas, IMB (UMR 5251) Tel: (+33) (0)5 40 00 26 17
Universite de Bordeaux Fax: (+33) (0)5 40 00 21 23
351, cours de la Liberation http://www.math.u-bordeaux.fr/~kbelabas/
F-33405 Talence (France) http://pari.math.u-bordeaux.fr/ [PARI/GP]
`