The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: loki_dre on April 20, 2008, 08:49:40 AM

Title: for loop with constants
Post by: loki_dre on April 20, 2008, 08:49:40 AM
is it possible to write code as a small for loop based on a constant that won't consume any registers or variables, but simply copy and paste the given instructions over X times?
eg.
for(i=0;i<3;i++)
mov [ebx],al
inc ebx

Which would translate to:
mov [ebx],al
inc ebx
mov [ebx],al
inc ebx
mov [ebx],al
inc ebx
Title: Re: for loop with constants
Post by: MichaelW on April 20, 2008, 09:16:58 AM
You can create an "unrolled" loop with the REPEAT directive.

    REPEAT 3
      mov [ebx],al
      inc ebx
    ENDM


00401000 8803                   mov     [ebx],al
00401002 43                     inc     ebx
00401003 8803                   mov     [ebx],al
00401005 43                     inc     ebx
00401006 8803                   mov     [ebx],al
00401008 43                     inc     ebx


And many variations are possible, for example:

    N=0
    REPEAT 3
      mov [ebx+N*4],eax
      N=N+1
    ENDM


00401009 8903                   mov     [ebx],eax
0040100B 894304                 mov     [ebx+4],eax
0040100E 894308                 mov     [ebx+8],eax
Title: Re: for loop with constants
Post by: loki_dre on April 21, 2008, 03:47:08 PM
are unrolled loops always faster? any recommendations on a limit?
Title: Re: for loop with constants
Post by: donkey on April 21, 2008, 03:55:32 PM
Quote from: loki_dre on April 21, 2008, 03:47:08 PM
are unrolled loops always faster? any recommendations on a limit?


Yes, they are always faster though in some cases the gain would be offset by other considerations for example with loops with a large number of iterations the extra coding would be cumbersome and if the branching is predictable within the scope of the branch prediction hardware the actual gains in speed might be offset by code size and instruction caching issues. For myself I will unroll loops but none bigger than around 16 iterations but that's a personal preference and has no basis in any set rules.
Title: Re: for loop with constants
Post by: loki_dre on April 21, 2008, 05:08:21 PM
any examples of a nested for loop??
Thanks in advance
Title: Re: for loop with constants
Post by: loki_dre on April 21, 2008, 05:09:14 PM
examples of a nested unrolled loop I mean
Title: Re: for loop with constants
Post by: hutch-- on April 23, 2008, 06:44:12 AM
loki,

There does not seem to be any set rule, even on the same processor. I would always advocate timing different unroll sizes as I have found optimal ranging from 0 to 16 depending on the code to be unrolled. Almost exclusively the limiting factor in algo speed is memory access speed and once you have hit that wall, nothing makes it go faster.

You can try reading in larger sizes, DWORD over BYTE etc ... but you make yourself more work handling uneven ends.
Title: Re: for loop with constants
Post by: MichaelW on April 23, 2008, 08:19:44 AM
This code displays the cycle counts for a memory fill loop at unroll counts of 1 to 256. The optimal unroll count will depend on the processor, I think primarily on the size and/or characteristics of the processor's L1 cache.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    mov esi, halloc(1024)

    invoke Sleep, 3000

    FOR UR,<1,2,4,8,16,32,64,128,256>

      counter_begin 1000, HIGH_PRIORITY_CLASS
          xor eax, eax
          xor edx, edx
          mov ecx, 1024/4/UR
        @@:
          N=0
          REPEAT UR
            mov [esi+edx+N], eax
            N=N+1
          ENDM
          add edx, 4*UR
          dec ecx
          jnz @B
      counter_end
      print ustr$(eax)," cycles, unrolled by "
      print ustr$(UR),13,10

    ENDM

    print chr$(13,10)
    counter_begin 1000, HIGH_PRIORITY_CLASS
        xor eax, eax
        xor edx, edx
        mov ecx, 1024/4-1
      @@:
        mov [esi+ecx*4], eax
        dec ecx
        jns @B
    counter_end
    print ustr$(eax)," cycles, optimized for no unroll",13,10,13,10

    hfree ebx

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Results on my P3:

774 cycles, unrolled by 1
392 cycles, unrolled by 2
352 cycles, unrolled by 4
444 cycles, unrolled by 8
668 cycles, unrolled by 16
580 cycles, unrolled by 32
571 cycles, unrolled by 64
577 cycles, unrolled by 128
580 cycles, unrolled by 256

518 cycles, optimized for no unroll


Edit: The unroll by one cycle count is somewhat misleading, because the loop includes an instruction that is not necessary for a loop that is not unrolled, so for reference I added a loop that eliminates this instruction.

And corrected two errors related to using ecx as a loop counter and an address component, instead of just as a loop counter.
Title: Re: for loop with constants
Post by: hutch-- on April 23, 2008, 08:58:32 AM
Here is the result on my old PIV.


548 cycles, unrolled by 1
541 cycles, unrolled by 2
519 cycles, unrolled by 4
519 cycles, unrolled by 8
520 cycles, unrolled by 16
2391 cycles, unrolled by 32
1395 cycles, unrolled by 64
1397 cycles, unrolled by 128
1396 cycles, unrolled by 256

527 cycles, optimized for no unroll

Press any key to exit...
Title: Re: for loop with constants
Post by: Tedd on April 23, 2008, 11:51:28 AM
In general, you can unroll a small loop a small number of times, for immediate benefit. As the loop gets larger, you can unroll it less without causing it to fill a line of cache - if it all fits in cache then it's faster. The same goes for nested loops - unroll the inner if it doesn't make it too big, but unrolling both is most likely to make it too big.
The processor does actually optimise some loops (mainly, it can skip the decoding step), so it's not always better to unroll (but for small loops it generally is.) And it all depends on the size of the processor's cache ::)