Pages: [1]
|
 |
|
Author
|
Topic: for loop with constants (Read 3497 times)
|
loki_dre
Guest
|
is it possible to write code as a small for loop based on a constant that won't consume any registers or variables, but simply copy and paste the given instructions over X times? eg. for(i=0;i<3;i++) mov [ebx],al inc ebx
Which would translate to: mov [ebx],al inc ebx mov [ebx],al inc ebx mov [ebx],al inc ebx
|
|
|
Logged
|
|
|
|
MichaelW
Global Moderator
Member
    
Gender: 
Posts: 5161
|
You can create an "unrolled" loop with the REPEAT directive. REPEAT 3 mov [ebx],al inc ebx ENDM
00401000 8803 mov [ebx],al 00401002 43 inc ebx 00401003 8803 mov [ebx],al 00401005 43 inc ebx 00401006 8803 mov [ebx],al 00401008 43 inc ebx
And many variations are possible, for example: N=0 REPEAT 3 mov [ebx+N*4],eax N=N+1 ENDM
00401009 8903 mov [ebx],eax 0040100B 894304 mov [ebx+4],eax 0040100E 894308 mov [ebx+8],eax
|
|
|
Logged
|
eschew obfuscation
|
|
|
loki_dre
Guest
|
are unrolled loops always faster? any recommendations on a limit?
|
|
|
Logged
|
|
|
|
donkey
|
are unrolled loops always faster? any recommendations on a limit?
Yes, they are always faster though in some cases the gain would be offset by other considerations for example with loops with a large number of iterations the extra coding would be cumbersome and if the branching is predictable within the scope of the branch prediction hardware the actual gains in speed might be offset by code size and instruction caching issues. For myself I will unroll loops but none bigger than around 16 iterations but that's a personal preference and has no basis in any set rules.
|
|
|
Logged
|
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender "It was just a dream, Bender. There's no such thing as two". -- Fry -- Futurama Donkey's Stable
|
|
|
loki_dre
Guest
|
any examples of a nested for loop?? Thanks in advance
|
|
|
Logged
|
|
|
|
loki_dre
Guest
|
examples of a nested unrolled loop I mean
|
|
|
Logged
|
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
loki,
There does not seem to be any set rule, even on the same processor. I would always advocate timing different unroll sizes as I have found optimal ranging from 0 to 16 depending on the code to be unrolled. Almost exclusively the limiting factor in algo speed is memory access speed and once you have hit that wall, nothing makes it go faster.
You can try reading in larger sizes, DWORD over BYTE etc ... but you make yourself more work handling uneven ends.
|
|
|
Logged
|
|
|
|
MichaelW
Global Moderator
Member
    
Gender: 
Posts: 5161
|
This code displays the cycle counts for a memory fill loop at unroll counts of 1 to 256. The optimal unroll count will depend on the processor, I think primarily on the size and/or characteristics of the processor’s L1 cache. ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« include \masm32\include\masm32rt.inc .686 include \masm32\macros\timers.asm ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« .data .code ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« start: ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« mov esi, halloc(1024)
invoke Sleep, 3000
FOR UR,<1,2,4,8,16,32,64,128,256>
counter_begin 1000, HIGH_PRIORITY_CLASS xor eax, eax xor edx, edx mov ecx, 1024/4/UR @@: N=0 REPEAT UR mov [esi+edx+N], eax N=N+1 ENDM add edx, 4*UR dec ecx jnz @B counter_end print ustr$(eax)," cycles, unrolled by " print ustr$(UR),13,10
ENDM
print chr$(13,10) counter_begin 1000, HIGH_PRIORITY_CLASS xor eax, eax xor edx, edx mov ecx, 1024/4-1 @@: mov [esi+ecx*4], eax dec ecx jns @B counter_end print ustr$(eax)," cycles, optimized for no unroll",13,10,13,10
hfree ebx
inkey "Press any key to exit..." exit ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« end start
Results on my P3: 774 cycles, unrolled by 1 392 cycles, unrolled by 2 352 cycles, unrolled by 4 444 cycles, unrolled by 8 668 cycles, unrolled by 16 580 cycles, unrolled by 32 571 cycles, unrolled by 64 577 cycles, unrolled by 128 580 cycles, unrolled by 256
518 cycles, optimized for no unroll
Edit: The unroll by one cycle count is somewhat misleading, because the loop includes an instruction that is not necessary for a loop that is not unrolled, so for reference I added a loop that eliminates this instruction. And corrected two errors related to using ecx as a loop counter and an address component, instead of just as a loop counter.
|
|
« Last Edit: April 23, 2008, 09:44:57 AM by MichaelW »
|
Logged
|
eschew obfuscation
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
Here is the result on my old PIV. 548 cycles, unrolled by 1 541 cycles, unrolled by 2 519 cycles, unrolled by 4 519 cycles, unrolled by 8 520 cycles, unrolled by 16 2391 cycles, unrolled by 32 1395 cycles, unrolled by 64 1397 cycles, unrolled by 128 1396 cycles, unrolled by 256
527 cycles, optimized for no unroll
Press any key to exit...
|
|
|
Logged
|
|
|
|
Tedd
Procrastinator Extraordinaire
Member
    
Posts: 2210
Reality Bytes
|
In general, you can unroll a small loop a small number of times, for immediate benefit. As the loop gets larger, you can unroll it less without causing it to fill a line of cache - if it all fits in cache then it's faster. The same goes for nested loops - unroll the inner if it doesn't make it too big, but unrolling both is most likely to make it too big. The processor does actually optimise some loops (mainly, it can skip the decoding step), so it's not always better to unroll (but for small loops it generally is.) And it all depends on the size of the processor's cache 
|
|
|
Logged
|
No snowflake in an avalanche feels responsible.
|
|
|
|
Pages: [1]
|
|
|
 |