News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

for loop with constants

Started by loki_dre, April 20, 2008, 08:49:40 AM

Previous topic - Next topic

loki_dre

is it possible to write code as a small for loop based on a constant that won't consume any registers or variables, but simply copy and paste the given instructions over X times?
eg.
for(i=0;i<3;i++)
mov [ebx],al
inc ebx

Which would translate to:
mov [ebx],al
inc ebx
mov [ebx],al
inc ebx
mov [ebx],al
inc ebx

MichaelW

You can create an "unrolled" loop with the REPEAT directive.

    REPEAT 3
      mov [ebx],al
      inc ebx
    ENDM


00401000 8803                   mov     [ebx],al
00401002 43                     inc     ebx
00401003 8803                   mov     [ebx],al
00401005 43                     inc     ebx
00401006 8803                   mov     [ebx],al
00401008 43                     inc     ebx


And many variations are possible, for example:

    N=0
    REPEAT 3
      mov [ebx+N*4],eax
      N=N+1
    ENDM


00401009 8903                   mov     [ebx],eax
0040100B 894304                 mov     [ebx+4],eax
0040100E 894308                 mov     [ebx+8],eax
eschew obfuscation

loki_dre

are unrolled loops always faster? any recommendations on a limit?

donkey

Quote from: loki_dre on April 21, 2008, 03:47:08 PM
are unrolled loops always faster? any recommendations on a limit?


Yes, they are always faster though in some cases the gain would be offset by other considerations for example with loops with a large number of iterations the extra coding would be cumbersome and if the branching is predictable within the scope of the branch prediction hardware the actual gains in speed might be offset by code size and instruction caching issues. For myself I will unroll loops but none bigger than around 16 iterations but that's a personal preference and has no basis in any set rules.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

loki_dre

any examples of a nested for loop??
Thanks in advance

loki_dre

examples of a nested unrolled loop I mean

hutch--

loki,

There does not seem to be any set rule, even on the same processor. I would always advocate timing different unroll sizes as I have found optimal ranging from 0 to 16 depending on the code to be unrolled. Almost exclusively the limiting factor in algo speed is memory access speed and once you have hit that wall, nothing makes it go faster.

You can try reading in larger sizes, DWORD over BYTE etc ... but you make yourself more work handling uneven ends.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

#7
This code displays the cycle counts for a memory fill loop at unroll counts of 1 to 256. The optimal unroll count will depend on the processor, I think primarily on the size and/or characteristics of the processor's L1 cache.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    mov esi, halloc(1024)

    invoke Sleep, 3000

    FOR UR,<1,2,4,8,16,32,64,128,256>

      counter_begin 1000, HIGH_PRIORITY_CLASS
          xor eax, eax
          xor edx, edx
          mov ecx, 1024/4/UR
        @@:
          N=0
          REPEAT UR
            mov [esi+edx+N], eax
            N=N+1
          ENDM
          add edx, 4*UR
          dec ecx
          jnz @B
      counter_end
      print ustr$(eax)," cycles, unrolled by "
      print ustr$(UR),13,10

    ENDM

    print chr$(13,10)
    counter_begin 1000, HIGH_PRIORITY_CLASS
        xor eax, eax
        xor edx, edx
        mov ecx, 1024/4-1
      @@:
        mov [esi+ecx*4], eax
        dec ecx
        jns @B
    counter_end
    print ustr$(eax)," cycles, optimized for no unroll",13,10,13,10

    hfree ebx

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Results on my P3:

774 cycles, unrolled by 1
392 cycles, unrolled by 2
352 cycles, unrolled by 4
444 cycles, unrolled by 8
668 cycles, unrolled by 16
580 cycles, unrolled by 32
571 cycles, unrolled by 64
577 cycles, unrolled by 128
580 cycles, unrolled by 256

518 cycles, optimized for no unroll


Edit: The unroll by one cycle count is somewhat misleading, because the loop includes an instruction that is not necessary for a loop that is not unrolled, so for reference I added a loop that eliminates this instruction.

And corrected two errors related to using ecx as a loop counter and an address component, instead of just as a loop counter.
eschew obfuscation

hutch--

Here is the result on my old PIV.


548 cycles, unrolled by 1
541 cycles, unrolled by 2
519 cycles, unrolled by 4
519 cycles, unrolled by 8
520 cycles, unrolled by 16
2391 cycles, unrolled by 32
1395 cycles, unrolled by 64
1397 cycles, unrolled by 128
1396 cycles, unrolled by 256

527 cycles, optimized for no unroll

Press any key to exit...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Tedd

In general, you can unroll a small loop a small number of times, for immediate benefit. As the loop gets larger, you can unroll it less without causing it to fill a line of cache - if it all fits in cache then it's faster. The same goes for nested loops - unroll the inner if it doesn't make it too big, but unrolling both is most likely to make it too big.
The processor does actually optimise some loops (mainly, it can skip the decoding step), so it's not always better to unroll (but for small loops it generally is.) And it all depends on the size of the processor's cache ::)
No snowflake in an avalanche feels responsible.