is it possible to write code as a small for loop based on a constant that won't consume any registers or variables, but simply copy and paste the given instructions over X times?
eg.
for(i=0;i<3;i++)
mov [ebx],al
inc ebx
Which would translate to:
mov [ebx],al
inc ebx
mov [ebx],al
inc ebx
mov [ebx],al
inc ebx
You can create an "unrolled" loop with the REPEAT directive.
REPEAT 3
mov [ebx],al
inc ebx
ENDM
00401000 8803 mov [ebx],al
00401002 43 inc ebx
00401003 8803 mov [ebx],al
00401005 43 inc ebx
00401006 8803 mov [ebx],al
00401008 43 inc ebx
And many variations are possible, for example:
N=0
REPEAT 3
mov [ebx+N*4],eax
N=N+1
ENDM
00401009 8903 mov [ebx],eax
0040100B 894304 mov [ebx+4],eax
0040100E 894308 mov [ebx+8],eax
are unrolled loops always faster? any recommendations on a limit?
Quote from: loki_dre on April 21, 2008, 03:47:08 PM
are unrolled loops always faster? any recommendations on a limit?
Yes, they are always faster though in some cases the gain would be offset by other considerations for example with loops with a large number of iterations the extra coding would be cumbersome and if the branching is predictable within the scope of the branch prediction hardware the actual gains in speed might be offset by code size and instruction caching issues. For myself I will unroll loops but none bigger than around 16 iterations but that's a personal preference and has no basis in any set rules.
any examples of a nested for loop??
Thanks in advance
examples of a nested unrolled loop I mean
loki,
There does not seem to be any set rule, even on the same processor. I would always advocate timing different unroll sizes as I have found optimal ranging from 0 to 16 depending on the code to be unrolled. Almost exclusively the limiting factor in algo speed is memory access speed and once you have hit that wall, nothing makes it go faster.
You can try reading in larger sizes, DWORD over BYTE etc ... but you make yourself more work handling uneven ends.
This code displays the cycle counts for a memory fill loop at unroll counts of 1 to 256. The optimal unroll count will depend on the processor, I think primarily on the size and/or characteristics of the processor's L1 cache.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov esi, halloc(1024)
invoke Sleep, 3000
FOR UR,<1,2,4,8,16,32,64,128,256>
counter_begin 1000, HIGH_PRIORITY_CLASS
xor eax, eax
xor edx, edx
mov ecx, 1024/4/UR
@@:
N=0
REPEAT UR
mov [esi+edx+N], eax
N=N+1
ENDM
add edx, 4*UR
dec ecx
jnz @B
counter_end
print ustr$(eax)," cycles, unrolled by "
print ustr$(UR),13,10
ENDM
print chr$(13,10)
counter_begin 1000, HIGH_PRIORITY_CLASS
xor eax, eax
xor edx, edx
mov ecx, 1024/4-1
@@:
mov [esi+ecx*4], eax
dec ecx
jns @B
counter_end
print ustr$(eax)," cycles, optimized for no unroll",13,10,13,10
hfree ebx
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Results on my P3:
774 cycles, unrolled by 1
392 cycles, unrolled by 2
352 cycles, unrolled by 4
444 cycles, unrolled by 8
668 cycles, unrolled by 16
580 cycles, unrolled by 32
571 cycles, unrolled by 64
577 cycles, unrolled by 128
580 cycles, unrolled by 256
518 cycles, optimized for no unroll
Edit: The unroll by one cycle count is somewhat misleading, because the loop includes an instruction that is not necessary for a loop that is not unrolled, so for reference I added a loop that eliminates this instruction.
And corrected two errors related to using ecx as a loop counter and an address component, instead of just as a loop counter.
Here is the result on my old PIV.
548 cycles, unrolled by 1
541 cycles, unrolled by 2
519 cycles, unrolled by 4
519 cycles, unrolled by 8
520 cycles, unrolled by 16
2391 cycles, unrolled by 32
1395 cycles, unrolled by 64
1397 cycles, unrolled by 128
1396 cycles, unrolled by 256
527 cycles, optimized for no unroll
Press any key to exit...
In general, you can unroll a small loop a small number of times, for immediate benefit. As the loop gets larger, you can unroll it less without causing it to fill a line of cache - if it all fits in cache then it's faster. The same goes for nested loops - unroll the inner if it doesn't make it too big, but unrolling both is most likely to make it too big.
The processor does actually optimise some loops (mainly, it can skip the decoding step), so it's not always better to unroll (but for small loops it generally is.) And it all depends on the size of the processor's cache ::)