Here we have a simple routine, generating all the combinations of
four integer numbers ranging from 1 to 100.
Let's see how many levels of optimization we can apply to it. ::)
Straightforward CODE, we let the compiler do the optimization task:
'-----------------------------------------------------------------------------------------
' pgm: Nested_4.bas
'-----------------------------------------------------------------------------------------
' Nested Loops - How to optimize them.
' Generating all the combinations of 4 integers ranging from 1 to 100
' 05 may 2010 - frktons and the masmforum guys
'-----------------------------------------------------------------------------------------
#COMPILE EXE
#DIM ALL
#OPTIMIZE SPEED
' ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
'
DECLARE FUNCTION GetTickCount LIB "KERNEL32.DLL" ALIAS "GetTickCount" () AS DWORD
'
' ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
FUNCTION PBMAIN AS LONG
DIM x1 AS LONG, x2 AS LONG, x3 AS LONG, x4 AS LONG
DIM Counter AS LONG
LOCAL tc AS DWORD, cycles AS QUAD
cycles = 0
Counter = 0
TIX cycles
tc = GetTickCount
FOR x1 = 1 TO 97
FOR x2 = x1 + 1 TO 98
FOR x3 = x2 + 1 TO 99
FOR x4 = x3 + 1 TO 100
INCR Counter
NEXT x4
NEXT x3
NEXT x2
NEXT x1
tc = GetTickCount - tc
TIX END cycles
LOCATE 10, 20
PRINT "Total combinations ==> ";FORMAT$(Counter,"#,");
LOCATE 12, 20
PRINT "Elapsed time in MS ==> ";tc;
LOCATE 14,20
PRINT "Elapsed cycles ==> ";FORMAT$(cycles,"#,");
WAITKEY$
END FUNCTION
This routine performs in this way:
Total combinations ==> 3,921,225
Elapsed time in MS ==> 15
Elapsed cycles ==> 30,835,980
The first step of optimization is not going to use ASM, only
PB CODE. After we move a step further using ASM. :P
Please feel free to suggest any optimization you could imagine.
Frank
Are you planning to do anything in each iteration that might actually be fold-able?
You could unfold the inner loop with Counter = Counter + (99 - x3), and obviously keep unfolding from there.
Quote from: clive on May 03, 2010, 07:08:10 PM
Are you planning to do anything in each iteration that might actually be fold-able?
You could unfold the inner loop with Counter = Counter + (99 - x3), and obviously keep unfolding from there.
I'm not sure I understand what you mean. :red
I actually need all the four numbers [x1, x2, x3, x4] generated.
The unfolding you are suggesting is going to preserve the value of x4?
Like the previous program I need to generate all the 4 integers groups from:
[1, 2, 3, 4] to [97, 98, 99, 100] with no repetitions, and with no importance
for the order in which they appear, so [1, 2, 3, 4] is considered the same of
[4, 3, 2, 1] <=== we can also go backward instead of forward in the loops
if it helps for the performance task.
Regarding the future use, the groups could be used for many tasks, statistics, and
for goals I actually can't imagine, but they will surely be used. ::)
I'm going to surf through the K&R in order to understand the C routine you
prepared for the previous task :eek , I actually don't speak C :lol
Fortunately enough I've got a copy of the book. :P
Frank
I'm just trying to understand where you might be able to obtain some algorithmic optimization. If you have to visit each point, then the primary place you going to save some cycles is how you code the loops. Even a couple of cycles in each loop will save millions across the whole routine. Similarly, the more actual work you plan on doing in the inner loop, that's going to cost 4M cycles for every cycle that work consumes, so the more you can factor out of the loop(s) the better.
In PB there are some tricks to optimize the performance of a routine
that are linked to the compiler architecture.
For example I can get a performance boost of 3:1 just adding two instructions:
REGISTER Counter AS LONG
REGISTER x4 AS LONG
with this simple trick the performance changes in this fashion:
Total combinations ==> 3,921,225
Elapsed time in MS ==> 0
Elapsed cycles ==> 10,378,773
And the internal algorithm that the compiler produces, uses 2 registers for
the two numeric variables that are used a lot [Counter, x4].
This simple redefinition of two variables gives a boost of 3:1 to the performance. :wink
Before using ASM inside the PB program, we can use just a
few of these tweaks to see what happens.
As far as I've tried to use these tweaks by myself, not leaving
the compiler do the optimization by itself, I always got better
performances. :P
Frank
I'd be cautious of the cycles number. Is it being handled as a 64-bit number. I ask because the millisecond number went from 15 to 0. The other concern would be if you count cycles for too long, you'll be seeing cycles billed from other tasks/processes.
Also some C compilers would do the same optimization I suggested originally, instead of incrementing the counter.
Quote from: clive on May 03, 2010, 10:11:55 PM
I'd be cautious of the cycles number. Is it being handled as a 64-bit number. I ask because the millisecond number went from 15 to 0. The other concern would be if you count cycles for too long, you'll be seeing cycles billed from other tasks/processes.
Yes Clive, you are right, and in a multitasking OS it is almost inevitable.
Quote from: clive on May 03, 2010, 10:11:55 PM
Also some C compilers would do the same optimization I suggested originally, instead of incrementing the counter.
Well, what I need here is the four integers [x1, x2, x3, x4] and not the
counter in itself. I'm using it just to check that different versions of
the routine have the same result.
The task is to generate all the rows from
[1,2,3,4]
to
[97,98,99,100]
and we have counted 3,921,225 total combinations.
The challenge is to obtain that result [generating the combinations] in the shortest time we can
manage. :eek
i get about 5.25 mS on a 3 GHz Prescott
3921225 combinations
15996963 clock cycles
15819197 clock cycles
15802771 clock cycles
15769103 clock cycles
15820043 clock cycles
;----------------------------------------------------------
combo proc
push ebx
push ecx
push edx
push esi
xor ebx,ebx
mov eax,ebx
combo1: inc ebx
mov ecx,ebx
combo2: inc ecx
mov edx,ecx
combo3: inc edx
mov esi,edx
combo4: inc esi
inc eax
cmp esi,100
jb combo4
cmp edx,99
jb combo3
cmp ecx,98
jb combo2
cmp ebx,97
jb combo1
pop esi
pop edx
pop ecx
pop ebx
ret
combo endp
;----------------------------------------------------------
Thanks Dave,
This loop can help a lot :U
These kind of MASM routines in Console mode are what I
need most to learn from examples. :clap:
In my Core Duo it shows these results:
3,921,225 combinations
9,021,279 clock cycles
9,027,173 clock cycles
9,020,136 clock cycles
9,020,094 clock cycles
9,022,730 clock cycles
If I understand your routine you have performed 5 cycles
of the same generation loop to get a performance average.
Have you any idea about the performance I got in PB that
is almost the same?
Total combinations ==> 3,921,225
Elapsed time in MS ==> 16
Elapsed cycles ==> 10,457,106
Probably the cycle counter is not very precise and we have to
relay more on the ms counter? ::)
Please could you add an ms counter to your routine and the
display of it at the end of the loop?
Thanks again
Frank
i am not sure about the timing issue, Frank
i would try putting the asm code into PB, so that we can use the same timing mechanism
afterall, it is not really the absolute timing we are interested in - we just want to compare algorithms
QuoteIf I understand your routine you have performed 5 cycles of the same generation loop to get a performance average.
well - i see 4 loops :P
bascially, i just wanted to keep everything in register to make it faster - and keep it simple
in fact, there are a few dependancies in my code - it could probably be improved upon
EDIT - ohhhh
5 timing measurements
we don't look for an average, actually
we tend to believe the fastest time, unless there is some anomolous run
oh - also, we try to control the iteration count so each sample takes 0.5 to 1.0 second
if your test takes less than 0.5 seconds - the overhead shadows the measurement
for PB - you might want to make them longer - it may take some experimentation
Well I put the ASM routine inside the PB prog and
the results are:
Total combinations ==> 3,921,225
Elapsed time in MS ==> 0
Elapsed cycles ==> 9,079,092
Maybe the routine Hutch coded for PB. counting elapsed ms, in this
particular situation is not suitable. He should know if the elapsed time is too short
or there is some other problem. 9 Millions cycles should be more or
less 4 ms, but I am not sure. ::)
Well I added edi to count from 100 to 1 and had a
100 iterations of the whole generating process,
combo:
! push ebx
! push ecx
! push edx
! push esi
! push eax
! push edi
! mov edi,100
combox:
! xor ebx,ebx
! mov eax,ebx
combo1:
! inc ebx
! mov ecx,ebx
combo2:
! inc ecx
! mov edx,ecx
combo3:
! inc edx
! mov esi,edx
combo4:
! inc esi
! inc eax
! cmp esi,100
! jb combo4
! cmp edx,99
! jb combo3
! cmp ecx,98
! jb combo2
! cmp ebx,97
! jb combo1
! mov counter, eax
! dec edi
! jnz combox
! pop edi
! pop eax
! pop esi
! pop edx
! pop ecx
! pop ebx
and the results are:
Total iterations ==> 100
Total combinations ==> 3,921,225
Elapsed time in MS ==> 485
Elapsed cycles ==> 1,060,725,952
this gives a better idea of how long it takes in ms. :P
What about optimizing these loops? Any idea? ::)
Frank
well - the combo routine returns the combaination count in EAX
maybe you need to preserve EAX and store the value for the report
then for timing, just ignore it
code to report combination quantity
push eax
call combo
mov CombinationCount,eax
pop eax
code to measure timing
push eax
call combo
pop eax
I tried to preserve eax, as you can see in my previous post:
http://www.masm32.com/board/index.php?topic=13912.msg109968#msg109968
or I failed? ::)
oh - gotcha
although - no way to return a value in EAX that way :bg
i dunno how PB works, but C compilers use the EAX register as a return value from called procedures
the windows API functions typically return the result or status in EAX
in ASM procedures, we can use EAX, ECX, and/or EDX
EDIT - 4.85 mS sounds like a good number for your processor
Quote from: dedndave on May 04, 2010, 04:30:43 PM
oh - gotcha
although - no way to return a value in EAX that way :bg
i dunno how PB works, but C compilers use the EAX register as a return value from called procedures
in ASM, we can use EAX, ECX, and/or EDX
In the routine there is no CALLing, the ASM instructions are in
MAIN() alike all the others. ::)
attach your EXE file and i will see how it runs on mine
QuoteIn the routine there is no CALLing, the ASM instructions are in MAIN() alike all the others.
certainly, you have some way to call subroutines - GOSUB/RETURN maybe ?
Quote from: frktons
What about optimizing these loops? Any idea?
Pre-compute a table with [n1, n2, n3, n4], and index through it in a *single* loop. Should save a couple of million cycles in your empty loop, but will take a lot of memory (more depending on your encoding scheme)
But again, you are not doing anything inside your loop, so you just have a cycle burning timing loop. Whatever you do inside the loop is going to scale at 4M:1, so what you plan to put inside the loop is the main thing to focus on.
Quote from: dedndave on May 04, 2010, 04:42:31 PM
certainly, you have some way to call subroutines - GOSUB/RETURN maybe ?
Nothing at all, just one instruction after the other :P
Here you have the exe in a zipped form
Quote from: clive on May 04, 2010, 04:56:08 PM
Quote from: frktons
What about optimizing these loops? Any idea?
Pre-compute a table with [n1, n2, n3, n4], and index through it in a *single* loop. Should save a couple of million cycles in your empty loop, but will take a lot of memory (more depending on your encoding scheme)
But again, you are not doing anything inside your loop, so you just have a cycle burning timing loop. Whatever you do inside the loop is going to scale at 4M:1, so what you plan to put inside the loop is the main thing to focus on.
Well, I'm going, in the future ::) , to save the combinations in a file, so this loop
is just an exercise to see what kind of optimization is possible in a similar situation.
After that, I can read the whole file in an array and do what you suggest for
doing something with it, without the necessity to generate it again with nested loops.
ASM is an exercise for the time being, so let's learn something :U
here is what i get on the prescott
Total iterations ==> 100
Total combinations ==> 3,921,225
Elapsed time in MS ==> 813
Elapsed cycles ==> 2,438,347,215
8.13 mS is quite a bit higher than the 5.25 mS i calculated, but at least we have a benchmark
there has to be a way to execute routines from PB
if not, i would say it is totally handicapped
Quote from: dedndave on May 04, 2010, 01:02:03 AM
i get about 5.25 mS on a 3 GHz Prescott
3921225 combinations
15996963 clock cycles
15819197 clock cycles
15802771 clock cycles
15769103 clock cycles
15820043 clock cycles
If you add an ms counter to your routine I can see the difference in my pc
as well.
Quote from: dedndave on May 04, 2010, 05:57:12 PM
here is what i get on the prescott
Total iterations ==> 100
Total combinations ==> 3,921,225
Elapsed time in MS ==> 813
Elapsed cycles ==> 2,438,347,215
8.13 mS is quite a bit higher than the 5.25 mS i calculated, but at least we have a benchmark
there has to be a way to execute routines from PB
if not, i would say it is totally handicapped
Of course there are methods of CALLing whatever needed, I just
didn't need it. :P
It is just a wasting cycles routine, as Clive stated, it doesn't do anything else
so why should I use a Proc ? :lol
to calculate the time in seconds from clock cycles, just divide the clock cycles by processor frequency in Hz
my prescott runs at 3 Ghz, so 15769103/3000000000 ~ 0.00525 or 5.25 mS
it takes a bit of code to measure cpu frequency, although i have done it
it is easier to divide with a calculator, though :lol
At a procedure level PB comforms to the Intel ABI in terms of register usage and return values in EAX for integer values. Its only curse is it has a bit higher stack overhead so that it conforms to the specifications of BASIC as a language.