ASM for FUN - #7 SUB [Nested Loops - how to optimize them]

Started by frktons, May 03, 2010, 05:56:20 PM

Previous topic - Next topic

frktons

Here we have a simple routine, generating all the combinations of
four integer numbers ranging from 1 to 100.

Let's see how many levels of optimization we can apply to it.  ::)

Straightforward CODE, we let the compiler do the optimization task:


'-----------------------------------------------------------------------------------------
' pgm: Nested_4.bas
'-----------------------------------------------------------------------------------------
' Nested Loops - How to optimize them.
'                Generating all the combinations of 4 integers ranging from 1 to 100
' 05 may 2010  - frktons and the masmforum guys
'-----------------------------------------------------------------------------------------

#COMPILE EXE
#DIM ALL
#OPTIMIZE SPEED
' ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
'
DECLARE FUNCTION GetTickCount LIB "KERNEL32.DLL" ALIAS "GetTickCount" () AS DWORD
'
' ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

FUNCTION PBMAIN AS LONG

    DIM x1 AS LONG, x2 AS LONG, x3 AS LONG, x4 AS LONG
    DIM Counter AS LONG

    LOCAL tc AS DWORD, cycles AS QUAD
   
    cycles = 0
    Counter = 0
   
    TIX cycles

    tc = GetTickCount
   
    FOR x1 = 1 TO 97
        FOR x2 = x1 + 1 TO 98
            FOR x3 = x2 + 1 TO 99
                FOR x4 = x3 + 1 TO 100
                    INCR Counter
                NEXT x4
            NEXT x3
        NEXT x2
    NEXT x1

    tc = GetTickCount - tc

    TIX END cycles

    LOCATE 10, 20
    PRINT "Total combinations ==> ";FORMAT$(Counter,"#,");
    LOCATE 12, 20
    PRINT "Elapsed time in MS ==> ";tc;
    LOCATE 14,20
    PRINT "Elapsed cycles ==> ";FORMAT$(cycles,"#,");

    WAITKEY$

END FUNCTION



This routine performs in this way:


                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  15

                   Elapsed cycles ==> 30,835,980


The first step of optimization is not going to use ASM, only
PB CODE. After we move a step further using ASM.  :P

Please feel free to suggest any optimization you could imagine.

Frank                             
Mind is like a parachute. You know what to do in order to use it :-)

clive

Are you planning to do anything in each iteration that might actually be fold-able?

You could unfold the inner loop with Counter = Counter + (99 - x3), and obviously keep unfolding from there.
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: clive on May 03, 2010, 07:08:10 PM
Are you planning to do anything in each iteration that might actually be fold-able?

You could unfold the inner loop with Counter = Counter + (99 - x3), and obviously keep unfolding from there.

I'm not sure I understand what you mean.  :red
I actually need all the four numbers [x1, x2, x3, x4] generated.
The unfolding you are suggesting is going to preserve the value of x4?
Like the previous program I need to generate all the 4 integers groups from:

[1, 2, 3, 4] to [97, 98, 99, 100] with no repetitions, and with no importance
for the order in which they appear, so [1, 2, 3, 4] is considered the same of
[4, 3, 2, 1] <=== we can also go backward instead of forward in the loops
if it helps for the performance task.

Regarding the future use, the groups could be used for many tasks, statistics, and
for goals I actually can't imagine, but they will surely be used.  ::)

I'm going to surf through the K&R in order to understand the C routine you
prepared for the previous task  :eek , I actually don't speak C  :lol
Fortunately enough I've got a copy of the book.  :P

Frank



Mind is like a parachute. You know what to do in order to use it :-)

clive

I'm just trying to understand where you might be able to obtain some algorithmic optimization. If you have to visit each point, then the primary place you going to save some cycles is how you code the loops. Even a couple of cycles in each loop will save millions across the whole routine. Similarly, the more actual work you plan on doing in the inner loop, that's going to cost 4M cycles for every cycle that work consumes, so the more you can factor out of the loop(s) the better.
It could be a random act of randomness. Those happen a lot as well.

frktons

In PB there are some tricks to optimize the performance of a routine
that are linked to the compiler architecture.

For example I can get a performance boost of 3:1 just adding two instructions:


    REGISTER Counter AS LONG
    REGISTER x4 AS LONG


with this simple trick the performance changes in this fashion:



                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  0

                   Elapsed cycles ==> 10,378,773


And the internal algorithm that the compiler produces, uses 2 registers for
the two numeric variables that are used a lot [Counter, x4].
This simple redefinition of two variables gives a boost of 3:1 to the performance.  :wink

Before using ASM inside the PB program, we can use just a
few of these tweaks to see what happens.
As far as I've tried to use these tweaks by myself, not leaving
the compiler do the optimization by itself, I always got better
performances.  :P

Frank


Mind is like a parachute. You know what to do in order to use it :-)

clive

I'd be cautious of the cycles number. Is it being handled as a 64-bit number. I ask because the millisecond number went from 15 to 0. The other concern would be if you count cycles for too long, you'll be seeing cycles billed from other tasks/processes.

Also some C compilers would do the same optimization I suggested originally, instead of incrementing the counter.
It could be a random act of randomness. Those happen a lot as well.

frktons

Quote from: clive on May 03, 2010, 10:11:55 PM
I'd be cautious of the cycles number. Is it being handled as a 64-bit number. I ask because the millisecond number went from 15 to 0. The other concern would be if you count cycles for too long, you'll be seeing cycles billed from other tasks/processes.

Yes Clive, you are right, and in a multitasking OS it is almost inevitable.

Quote from: clive on May 03, 2010, 10:11:55 PM

Also some C compilers would do the same optimization I suggested originally, instead of incrementing the counter.


Well, what I need here is the four integers [x1, x2, x3, x4] and not the
counter in itself. I'm using it just to check that different versions of
the routine have the same result.

The task is to generate all the rows from
[1,2,3,4]
to
[97,98,99,100]
and we have counted 3,921,225 total combinations.
The challenge is to obtain that result [generating the combinations] in the shortest time we can
manage.  :eek
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

i get about 5.25 mS on a 3 GHz Prescott
3921225  combinations
15996963 clock cycles
15819197 clock cycles
15802771 clock cycles
15769103 clock cycles
15820043 clock cycles

;----------------------------------------------------------

combo   proc

        push    ebx
        push    ecx
        push    edx
        push    esi
        xor     ebx,ebx
        mov     eax,ebx

combo1: inc     ebx
        mov     ecx,ebx

combo2: inc     ecx
        mov     edx,ecx

combo3: inc     edx
        mov     esi,edx

combo4: inc     esi
        inc     eax
        cmp     esi,100
        jb      combo4

        cmp     edx,99
        jb      combo3

        cmp     ecx,98
        jb      combo2

        cmp     ebx,97
        jb      combo1

        pop     esi
        pop     edx
        pop     ecx
        pop     ebx
        ret

combo   endp

;----------------------------------------------------------



frktons

Thanks Dave,

This loop can help a lot  :U
These kind of MASM routines in Console mode are what I
need most to learn from examples.  :clap:

In my Core Duo it shows these results:

3,921,225  combinations
9,021,279 clock cycles
9,027,173 clock cycles
9,020,136 clock cycles
9,020,094 clock cycles
9,022,730 clock cycles


If I understand your routine you have performed 5 cycles
of the same generation loop to get a performance average.

Have you any idea about the performance I got in PB that
is almost the same?


                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  16

                   Elapsed cycles ==> 10,457,106


Probably the cycle counter is not very precise and we have to
relay more on the ms counter?  ::)
Please could you add an ms counter to your routine and the
display of it at the end of the loop?

Thanks again

Frank
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

i am not sure about the timing issue, Frank
i would try putting the asm code into PB, so that we can use the same timing mechanism
afterall, it is not really the absolute timing we are interested in - we just want to compare algorithms

QuoteIf I understand your routine you have performed 5 cycles of the same generation loop to get a performance average.

well - i see 4 loops   :P
bascially, i just wanted to keep everything in register to make it faster - and keep it simple
in fact, there are a few dependancies in my code - it could probably be improved upon

EDIT - ohhhh
5 timing measurements
we don't look for an average, actually
we tend to believe the fastest time, unless there is some anomolous run
oh - also, we try to control the iteration count so each sample takes 0.5 to 1.0 second
if your test takes less than 0.5 seconds - the overhead shadows the measurement
for PB - you might want to make them longer - it may take some experimentation

frktons

Well I put the ASM routine inside the PB prog and
the results are:

                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  0

                   Elapsed cycles ==> 9,079,092


Maybe the routine Hutch coded for PB. counting  elapsed ms, in this
particular situation is not suitable. He should know if the elapsed time is too short
or there is some other problem. 9 Millions cycles should be more or
less 4 ms, but I am not sure.  ::)


Mind is like a parachute. You know what to do in order to use it :-)

frktons

Well I added edi to count from 100 to 1 and had a
100 iterations of the whole generating process,

combo:

!        push    ebx
!        push    ecx
!        push    edx
!        push    esi
!        push    eax
!        push    edi
!        mov     edi,100

combox:
!        xor     ebx,ebx
!        mov     eax,ebx

combo1:
!        inc     ebx
!        mov     ecx,ebx

combo2:
!        inc     ecx
!        mov     edx,ecx

combo3:
!        inc     edx
!        mov     esi,edx

combo4:
!        inc     esi
!        inc     eax
!        cmp     esi,100
!        jb      combo4

!        cmp     edx,99
!        jb      combo3

!        cmp     ecx,98
!        jb      combo2

!        cmp     ebx,97
!        jb      combo1

!        mov     counter, eax
!        dec     edi
!        jnz     combox

!        pop     edi
!        pop     eax
!        pop     esi
!        pop     edx
!        pop     ecx
!        pop     ebx         



and the results are:


                   Total iterations ==> 100

                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  485

                   Elapsed cycles ==> 1,060,725,952


this gives a better idea of how long it takes in ms.  :P

What about optimizing these loops? Any idea?  ::)

Frank
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

well - the combo routine returns the combaination count in EAX
maybe you need to preserve EAX and store the value for the report
then for timing, just ignore it

code to report combination quantity

        push    eax
        call    combo
        mov     CombinationCount,eax
        pop     eax


code to measure timing

        push    eax
        call    combo
        pop     eax

frktons

Mind is like a parachute. You know what to do in order to use it :-)

dedndave

oh - gotcha
although - no way to return a value in EAX that way   :bg
i dunno how PB works, but C compilers use the EAX register as a return value from called procedures
the windows API functions typically return the result or status in EAX
in ASM procedures, we can use EAX, ECX, and/or EDX

EDIT - 4.85 mS sounds like a good number for your processor