Print Page - ASM for FUN - #7 SUB [Nested Loops

Title: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 03, 2010, 05:56:20 PM

Here we have a simple routine, generating all the combinations of
four integer numbers ranging from 1 to 100.

Let's see how many levels of optimization we can apply to it. ::)

Straightforward CODE, we let the compiler do the optimization task:

Code Select


'-----------------------------------------------------------------------------------------
' pgm: Nested_4.bas
'-----------------------------------------------------------------------------------------
' Nested Loops - How to optimize them.
'                Generating all the combinations of 4 integers ranging from 1 to 100
' 05 may 2010  - frktons and the masmforum guys
'-----------------------------------------------------------------------------------------

#COMPILE EXE
#DIM ALL
#OPTIMIZE SPEED
' ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
'
DECLARE FUNCTION GetTickCount LIB "KERNEL32.DLL" ALIAS "GetTickCount" () AS DWORD
'
' ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

FUNCTION PBMAIN AS LONG

    DIM x1 AS LONG, x2 AS LONG, x3 AS LONG, x4 AS LONG
    DIM Counter AS LONG

    LOCAL tc AS DWORD, cycles AS QUAD
    
    cycles = 0
    Counter = 0
    
    TIX cycles

    tc = GetTickCount
    
    FOR x1 = 1 TO 97
        FOR x2 = x1 + 1 TO 98
            FOR x3 = x2 + 1 TO 99
                FOR x4 = x3 + 1 TO 100
                    INCR Counter
                NEXT x4
            NEXT x3
        NEXT x2
    NEXT x1

    tc = GetTickCount - tc

    TIX END cycles

    LOCATE 10, 20
    PRINT "Total combinations ==> ";FORMAT$(Counter,"#,");
    LOCATE 12, 20
    PRINT "Elapsed time in MS ==> ";tc;
    LOCATE 14,20
    PRINT "Elapsed cycles ==> ";FORMAT$(cycles,"#,");

    WAITKEY$

END FUNCTION

This routine performs in this way:

Code Select


                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  15

                   Elapsed cycles ==> 30,835,980

The first step of optimization is not going to use ASM, only
PB CODE. After we move a step further using ASM. :P

Please feel free to suggest any optimization you could imagine.

Frank

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: clive on May 03, 2010, 07:08:10 PM

Are you planning to do anything in each iteration that might actually be fold-able?

You could unfold the inner loop with Counter = Counter + (99 - x3), and obviously keep unfolding from there.

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 03, 2010, 07:45:06 PM

Quote from: clive on May 03, 2010, 07:08:10 PM
Are you planning to do anything in each iteration that might actually be fold-able?

You could unfold the inner loop with Counter = Counter + (99 - x3), and obviously keep unfolding from there.

I'm not sure I understand what you mean. :red
I actually need all the four numbers [x1, x2, x3, x4] generated.
The unfolding you are suggesting is going to preserve the value of x4?
Like the previous program I need to generate all the 4 integers groups from:

[1, 2, 3, 4] to [97, 98, 99, 100] with no repetitions, and with no importance
for the order in which they appear, so [1, 2, 3, 4] is considered the same of
[4, 3, 2, 1] <=== we can also go backward instead of forward in the loops
if it helps for the performance task.

Regarding the future use, the groups could be used for many tasks, statistics, and
for goals I actually can't imagine, but they will surely be used. ::)

I'm going to surf through the K&R in order to understand the C routine you
prepared for the previous task :eek , I actually don't speak C :lol
Fortunately enough I've got a copy of the book. :P

Frank

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: clive on May 03, 2010, 08:33:14 PM

I'm just trying to understand where you might be able to obtain some algorithmic optimization. If you have to visit each point, then the primary place you going to save some cycles is how you code the loops. Even a couple of cycles in each loop will save millions across the whole routine. Similarly, the more actual work you plan on doing in the inner loop, that's going to cost 4M cycles for every cycle that work consumes, so the more you can factor out of the loop(s) the better.

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 03, 2010, 09:02:30 PM

In PB there are some tricks to optimize the performance of a routine
that are linked to the compiler architecture.

For example I can get a performance boost of 3:1 just adding two instructions:

Code Select


    REGISTER Counter AS LONG
    REGISTER x4 AS LONG

with this simple trick the performance changes in this fashion:

Code Select



                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  0

                   Elapsed cycles ==> 10,378,773

And the internal algorithm that the compiler produces, uses 2 registers for
the two numeric variables that are used a lot [Counter, x4].
This simple redefinition of two variables gives a boost of 3:1 to the performance. :wink

Before using ASM inside the PB program, we can use just a
few of these tweaks to see what happens.
As far as I've tried to use these tweaks by myself, not leaving
the compiler do the optimization by itself, I always got better
performances. :P

Frank

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: clive on May 03, 2010, 10:11:55 PM

I'd be cautious of the cycles number. Is it being handled as a 64-bit number. I ask because the millisecond number went from 15 to 0. The other concern would be if you count cycles for too long, you'll be seeing cycles billed from other tasks/processes.

Also some C compilers would do the same optimization I suggested originally, instead of incrementing the counter.

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 03, 2010, 10:23:49 PM

Quote from: clive on May 03, 2010, 10:11:55 PM
I'd be cautious of the cycles number. Is it being handled as a 64-bit number. I ask because the millisecond number went from 15 to 0. The other concern would be if you count cycles for too long, you'll be seeing cycles billed from other tasks/processes.

Yes Clive, you are right, and in a multitasking OS it is almost inevitable.

Quote from: clive on May 03, 2010, 10:11:55 PM

Also some C compilers would do the same optimization I suggested originally, instead of incrementing the counter.

Well, what I need here is the four integers [x1, x2, x3, x4] and not the
counter in itself. I'm using it just to check that different versions of
the routine have the same result.

The task is to generate all the rows from
[1,2,3,4]
to
[97,98,99,100]
and we have counted 3,921,225 total combinations.
The challenge is to obtain that result [generating the combinations] in the shortest time we can
manage. :eek

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: dedndave on May 04, 2010, 01:02:03 AM

i get about 5.25 mS on a 3 GHz Prescott

Code Select

3921225  combinations
15996963 clock cycles
15819197 clock cycles
15802771 clock cycles
15769103 clock cycles
15820043 clock cycles

Code Select

;----------------------------------------------------------

combo   proc

        push    ebx
        push    ecx
        push    edx
        push    esi
        xor     ebx,ebx
        mov     eax,ebx

combo1: inc     ebx
        mov     ecx,ebx

combo2: inc     ecx
        mov     edx,ecx

combo3: inc     edx
        mov     esi,edx

combo4: inc     esi
        inc     eax
        cmp     esi,100
        jb      combo4

        cmp     edx,99
        jb      combo3

        cmp     ecx,98
        jb      combo2

        cmp     ebx,97
        jb      combo1

        pop     esi
        pop     edx
        pop     ecx
        pop     ebx
        ret

combo   endp

;----------------------------------------------------------

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 04, 2010, 09:44:14 AM

Thanks Dave,

This loop can help a lot :U
These kind of MASM routines in Console mode are what I
need most to learn from examples. :clap:

In my Core Duo it shows these results:

Code Select


3,921,225  combinations
9,021,279 clock cycles
9,027,173 clock cycles
9,020,136 clock cycles
9,020,094 clock cycles
9,022,730 clock cycles

If I understand your routine you have performed 5 cycles
of the same generation loop to get a performance average.

Have you any idea about the performance I got in PB that
is almost the same?

Code Select


                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  16

                   Elapsed cycles ==> 10,457,106

Probably the cycle counter is not very precise and we have to
relay more on the ms counter? ::)
Please could you add an ms counter to your routine and the
display of it at the end of the loop?

Thanks again

Frank

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: dedndave on May 04, 2010, 09:53:00 AM

i am not sure about the timing issue, Frank
i would try putting the asm code into PB, so that we can use the same timing mechanism
afterall, it is not really the absolute timing we are interested in - we just want to compare algorithms

QuoteIf I understand your routine you have performed 5 cycles of the same generation loop to get a performance average.

well - i see 4 loops :P
bascially, i just wanted to keep everything in register to make it faster - and keep it simple
in fact, there are a few dependancies in my code - it could probably be improved upon

EDIT - ohhhh
5 timing measurements
we don't look for an average, actually
we tend to believe the fastest time, unless there is some anomolous run
oh - also, we try to control the iteration count so each sample takes 0.5 to 1.0 second
if your test takes less than 0.5 seconds - the overhead shadows the measurement
for PB - you might want to make them longer - it may take some experimentation

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 04, 2010, 10:19:02 AM

Well I put the ASM routine inside the PB prog and
the results are:

Code Select


                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  0

                   Elapsed cycles ==> 9,079,092

Maybe the routine Hutch coded for PB. counting elapsed ms, in this
particular situation is not suitable. He should know if the elapsed time is too short
or there is some other problem. 9 Millions cycles should be more or
less 4 ms, but I am not sure. ::)

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 04, 2010, 10:32:06 AM

Well I added edi to count from 100 to 1 and had a
100 iterations of the whole generating process,

Code Select


combo:

!        push    ebx
!        push    ecx
!        push    edx
!        push    esi
!        push    eax
!        push    edi
!        mov     edi,100

combox:
!        xor     ebx,ebx
!        mov     eax,ebx

combo1:
!        inc     ebx
!        mov     ecx,ebx

combo2:
!        inc     ecx
!        mov     edx,ecx

combo3:
!        inc     edx
!        mov     esi,edx

combo4:
!        inc     esi
!        inc     eax
!        cmp     esi,100
!        jb      combo4

!        cmp     edx,99
!        jb      combo3

!        cmp     ecx,98
!        jb      combo2

!        cmp     ebx,97
!        jb      combo1

!        mov     counter, eax
!        dec     edi
!        jnz     combox

!        pop     edi
!        pop     eax
!        pop     esi
!        pop     edx
!        pop     ecx
!        pop     ebx

and the results are:

Code Select


                   Total iterations ==> 100

                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  485

                   Elapsed cycles ==> 1,060,725,952

this gives a better idea of how long it takes in ms. :P

What about optimizing these loops? Any idea? ::)

Frank

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: dedndave on May 04, 2010, 02:21:41 PM

well - the combo routine returns the combaination count in EAX
maybe you need to preserve EAX and store the value for the report
then for timing, just ignore it

code to report combination quantity

Code Select


        push    eax
        call    combo
        mov     CombinationCount,eax
        pop     eax

code to measure timing

Code Select


        push    eax
        call    combo
        pop     eax

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 04, 2010, 04:17:10 PM

I tried to preserve eax, as you can see in my previous post:

http://www.masm32.com/board/index.php?topic=13912.msg109968#msg109968

or I failed? ::)

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: dedndave on May 04, 2010, 04:30:43 PM

oh - gotcha
although - no way to return a value in EAX that way :bg
i dunno how PB works, but C compilers use the EAX register as a return value from called procedures
the windows API functions typically return the result or status in EAX
in ASM procedures, we can use EAX, ECX, and/or EDX

EDIT - 4.85 mS sounds like a good number for your processor

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 04, 2010, 04:33:32 PM

Quote from: dedndave on May 04, 2010, 04:30:43 PM
oh - gotcha
although - no way to return a value in EAX that way :bg
i dunno how PB works, but C compilers use the EAX register as a return value from called procedures
in ASM, we can use EAX, ECX, and/or EDX

In the routine there is no CALLing, the ASM instructions are in
MAIN() alike all the others. ::)

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: dedndave on May 04, 2010, 04:41:18 PM

attach your EXE file and i will see how it runs on mine

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: dedndave on May 04, 2010, 04:42:31 PM

QuoteIn the routine there is no CALLing, the ASM instructions are in MAIN() alike all the others.

certainly, you have some way to call subroutines - GOSUB/RETURN maybe ?

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: clive on May 04, 2010, 04:56:08 PM

Quote from: frktons
What about optimizing these loops? Any idea?

Pre-compute a table with [n1, n2, n3, n4], and index through it in a *single* loop. Should save a couple of million cycles in your empty loop, but will take a lot of memory (more depending on your encoding scheme)

But again, you are not doing anything inside your loop, so you just have a cycle burning timing loop. Whatever you do inside the loop is going to scale at 4M:1, so what you plan to put inside the loop is the main thing to focus on.

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 04, 2010, 05:42:32 PM

Quote from: dedndave on May 04, 2010, 04:42:31 PM
certainly, you have some way to call subroutines - GOSUB/RETURN maybe ?

Nothing at all, just one instruction after the other :P

Here you have the exe in a zipped form

Quote from: clive on May 04, 2010, 04:56:08 PM
Quote from: frktons
What about optimizing these loops? Any idea?

Pre-compute a table with [n1, n2, n3, n4], and index through it in a *single* loop. Should save a couple of million cycles in your empty loop, but will take a lot of memory (more depending on your encoding scheme)

But again, you are not doing anything inside your loop, so you just have a cycle burning timing loop. Whatever you do inside the loop is going to scale at 4M:1, so what you plan to put inside the loop is the main thing to focus on.

Well, I'm going, in the future ::) , to save the combinations in a file, so this loop
is just an exercise to see what kind of optimization is possible in a similar situation.
After that, I can read the whole file in an array and do what you suggest for
doing something with it, without the necessity to generate it again with nested loops.

ASM is an exercise for the time being, so let's learn something :U

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: dedndave on May 04, 2010, 05:57:12 PM

here is what i get on the prescott

Code Select

                   Total iterations ==> 100

                   Total combinations ==> 3,921,225

                   Elapsed time in MS ==>  813

                   Elapsed cycles ==> 2,438,347,215

8.13 mS is quite a bit higher than the 5.25 mS i calculated, but at least we have a benchmark

there has to be a way to execute routines from PB
if not, i would say it is totally handicapped

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: frktons on May 04, 2010, 09:17:36 PM

Quote from: dedndave on May 04, 2010, 01:02:03 AM
i get about 5.25 mS on a 3 GHz Prescott
Code Select Expand
3921225 combinations 15996963 clock cycles 15819197 clock cycles 15802771 clock cycles 15769103 clock cycles 15820043 clock cycles

If you add an ms counter to your routine I can see the difference in my pc
as well.

Quote from: dedndave on May 04, 2010, 05:57:12 PM
here is what i get on the prescott
Code Select Expand
Total iterations ==> 100 Total combinations ==> 3,921,225 Elapsed time in MS ==> 813 Elapsed cycles ==> 2,438,347,215
8.13 mS is quite a bit higher than the 5.25 mS i calculated, but at least we have a benchmark

there has to be a way to execute routines from PB
if not, i would say it is totally handicapped

Of course there are methods of CALLing whatever needed, I just
didn't need it. :P

It is just a wasting cycles routine, as Clive stated, it doesn't do anything else
so why should I use a Proc ? :lol

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: dedndave on May 04, 2010, 10:29:01 PM

to calculate the time in seconds from clock cycles, just divide the clock cycles by processor frequency in Hz
my prescott runs at 3 Ghz, so 15769103/3000000000 ~ 0.00525 or 5.25 mS
it takes a bit of code to measure cpu frequency, although i have done it
it is easier to divide with a calculator, though :lol

Title: Re: ASM for FUN - #7 SUB [Nested Loops - how to optimize them]
Post by: hutch-- on May 05, 2010, 06:03:04 AM

At a procedure level PB comforms to the Intel ABI in terms of register usage and return values in EAX for integer values. Its only curse is it has a bit higher stack overhead so that it conforms to the specifications of BASIC as a language.

The MASM Forum Archive 2004 to 2012

Project Support Forums => PowerBASIC Low Level Code => Topic started by: frktons on May 03, 2010, 05:56:20 PM