News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Best way to unroll loops

Started by ASMManiac, February 16, 2012, 05:58:09 AM

Previous topic - Next topic

ASMManiac

What is the most efficient way to unroll loops in assembly?

I have an inner loop that will be (usually) between 5 and 20 iterations.  This number will be constant for each call of the inner loop, but unknown until run-time.

If I knew the size of the loop before-hand I could do something like this:

doThing(4)
doThing(3)
doThing(2)
doThing(1)
doThing(0)


I know C++ allows use of switch-case statements to unroll the loop efficiently.
I am wondering
(1) is there a better way to do it in assembly
(2) if not, how would I build a "skip-table" like the C++ compiler does automatically
(3) Can I take advantage of the fact that the inner loop executes the same number of iterations each time?

dedndave

you can still unroll it
you pay a penalty for testing if you are done with each iteration
but, you can still get an advantage by "widening" the range of the index values

the inner loop that runs the same number of iterations can probably be unrolled, nicely   :P

dedndave

here is an example of an unrolled loop
it is loosely based on a design by Randy Hyde
however, i managed to squeeze a bit of speed out of it with some rather signifigant changes

one thing i watched for when i was working on it was branch distance
i tried to keep all the branches that are executed repeatedly in the SHORT range (+127 to -128 bytes)
if i unroll it one more time, some of the repeated branches become NEAR and speed is lost
so - keeping the code small for each iteration is important

notice the indexes that are added to EAX to unroll - this is what i am refering to as "widening"
([eax], [eax+4], [eax+8], etc)
these indexes happen in place of updating the pointer register

;*****************************************************************

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

        ALIGN   16

SzLen   PROC    lpszStr:LPSTR

;SzLen - Get Length of Zero-Terminated String
;DednDave - Dave Sheldon - 8-2010

        mov     edx,[esp+4]
        mov     ecx,7F7F7F7Fh
        test    dl,3
        push    ebx
        jz      SzLen2

        mov     ebx,[edx]
        xor     eax,eax
        or      bl,bl
        jz      SzLen0

        inc     eax
        or      bh,bh
        jz      SzLen0

        test    dl,2
        jnz     SzLen1

        inc     eax
        test    ebx,0FF0000h
        jnz     SzLen1

SzLen0: pop     ebx
        ret     4

SzLen1: or      dl,3
        inc     edx

SzLen2: push    esi
        push    edi
        mov     esi,01010101h
        mov     edi,80808080h

SzLen3: mov     eax,edx
        mov     ebx,[eax]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+4]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+8]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+12]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+16]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+20]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+24]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+28]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jnz     SzLen4

        mov     ebx,[eax+32]
        add     edx,4
        and     ebx,ecx
        sub     ebx,esi
        and     ebx,edi
        jz      SzLen3

SzLen4: mov     ebx,[edx-4]
        lea     eax,[edx-4]
        or      bl,bl
        jz      SzLen5

        inc     eax
        or      bh,bh
        jz      SzLen5

        inc     eax
        test    ebx,0FF0000h
        jz      SzLen5

        inc     eax
        test    ebx,0FF000000h
        jnz     SzLen3

SzLen5: pop     edi
        pop     esi
        sub     eax,[esp+8]
        pop     ebx
        ret     4

SzLen   ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

;*****************************************************************

hutch--

You will find it has much to do with the Thing you want to Do. Depending on the code you do occasionally get decent improvements with unrolling, generally in cases where the loop overhead is a factor but it is often the case that unrolling does between little and nothing and in some situations where you get no speed gain from unrolling, you also get a cache penalty because of the excessive size of the loop code.

Set up a timing method and try variations while being aware that often such variations vary from one machine to another.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

FORTRANS

Hi,

   Untested, but should be close to what would work.  Unless
of course it is completely wrong.

HTH,

Steve N.


.DATA
;    Make a jump table holding the addresses of labels that access
; an unrolled loop.
JumpTable       DD      Label0, Label1, Label2, Label3, Label4

.CODE
;    Then use the size to enter the unrolled loop at the right place.
; (Make sure you don't exceed the size of your loop!)

        MOV     ESI,LoopSize
        SHL     ESI,2   ; Change index to DWORD array address.
        JMP     [JumpTable+ESI]
Label4:
    doThing(4)
Label3:
    doThing(3)
Label2:
    doThing(2)
Label1:
    doThing(1)
Label0:
    doThing(0)

ASMManiac

@dedndave
Could you explain the SHORT and NEAR in a little bit more detail?
Is the op-code few bytes for jumping a short distance and longer for jumping a longer distance?  Is the speed difference noticable?
Also, how can I determine if the compiler is doing a SHORT or NEAR jump?  i.e. how can I determine the size of the assembled code in between my jump statements?

@FORTRANS
If I am using x64, do I need to change the JumpTable pointers to 64 bit, or are they relative addresses so that 32 bit will work just as well?

dedndave

well - short branches are 2 bytes in size and near branches are 5 bytes
but, it's not really the size of the branch that matters - it's the speed
i have noticed that short branches in loops are signifigantly faster, at least on a P4
also - near branches "prefer" 16-aligned target addresses - short branches don't seem to care

one way to see if near or short branches are being generated is to use the "short" specifier   :P
        jz short target
if the assembler is unable to reach the target with a short branch, it will spit out an error - lol

another trick is to temporarily disable instructions for newer processors
that usually doesn't work too well, because you want to use newer instructions elsewhere in the loop
but, on the 286 and older processors, all conditional branches had to be short
        .286
        jz      target
        .586

again, the assembler will spit out an error if target is too far away

the way i usually do it is to generate an assembler listing and examine the code
you can do this by adding the "/Fl" switch to the MASM command line
you can see if short or near branches are used, and how far off you are
this method also gives you the opportunity to review other generated code
sometimes, you see something in the generated code that doesn't seem obvious in the source


when you use the "Fl" switch, the assembler spits out all kinds of stuff
it will create a huge LST file
you can reduce the size by using this at the beginning of the source...
        .XCREF
        .NOLIST
        INCLUDE    \masm32\include\masm32rt.inc
        .LIST


now - let me point you at some good reading...
http://www.agner.org/optimize/

Tedd

The main thing to watch for when unrolling is that your loop still fits into cache, if not then you're going to lose the benefit of unrolling. So there's a limit to the number of times you should unroll a loop, which is dependent on the size of that loop (a shorter loops can be unrolled more times than a longer one.)

Indirectly, this is why 'short' jumps are better - they imply a smaller loop which is more likely to fit into cache. (They're also encoded as 2 bytes instead of 3, which arguably causes a small penalty, but this is only on the first loop iteration; a few extra bytes spent encoding jumps means a few less bytes for useful instructions while still fitting into cache.)

Also, then, the start of your loop should be aligned to cache-line size, so that you have the maximum remaining space to fit your loop into. And just to make things easy, cache-line size is entirely dependent on the cpu model!
No snowflake in an avalanche feels responsible.

FORTRANS

Quote from: ASMManiac on February 16, 2012, 04:48:00 PM

@FORTRANS
If I am using x64, do I need to change the JumpTable pointers to 64 bit, or are they relative addresses so that 32 bit will work just as well?

Hi,

   Well, you can ask in the 64-bit subforum.  I have not
looked at 64-bit code enough to say what is valid in long
mode.

Regards,

Steve N.

ASMManiac

I am having trouble creating the jump table while using labels created with a macro.
It says: Symbol not defined Loop1_0.
It seems like the assembler is trying to find Loop1_0 in the source but can't until it has evalueated the macro for loop.
I am using Jwasm.
Here is my code:

.data
JumpTable1       DQ      Loop1_0, Loop1_1, Loop1_2, Loop1_3

.code

minFilt1_16A PROC uses r10 r11 r12 r13 r14 rax
... other code...

        @@:
            FOR arg, <0,1,2,3,4>
                @CatStr(Loop1_, <&arg>, <:>)
                movdqu xmm1, [SRC + j - &arg * 16]
                pminub xmm0, xmm1
                movdqu [blk1_128 + j - &arg * 16], xmm0
            ENDM
... other code....
minFilt1_16A

qWord

declare the labels with two colons
QuoteLoop1_0::
FPU in a trice: SmplMath
It's that simple!