Pages: [1]
|
 |
|
Author
|
Topic: Loop unrolling proper usage (Read 5348 times)
|
zemtex
|
I do loop unrolls from time to time, but I am looking for some deeper understanding of it. I am also interested in tricks related to register conservation. Examples are welcome, but I prefer theoretical explanations. Post a few examples, simple ones and explain each line with a comment why you do it like this and like that. Before and After examples are preferred.
|
|
|
Logged
|
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.
|
|
|
|
FORTRANS
Member
    
Gender: 
Posts: 1147
Imagine
|
Hi,
Well I unroll loops when the loop code is taking an appreciable amount of time compared to the time taken by the code inside the loop and the code overall is limited in performance. Off the top of my head, a line drawing routine and a graphics program took most of my efforts in this area.
The line drawing routine started as a generic Bresenham and ended up as a horrible mess of unrolled, tangled, specialized, spaghetti code. I eventually just went with a simplified algorithm to get the performance I wanted.
The graphics program got every optimization I could think of. The loop unrolling, per se, was probably silly as filling a screen with pixels takes quite a bit of time compared to the loop code. But it was a part of restructuring the program more than for looping performance. Unrolling by two allowed for a copy from/to buffer one and then a copy from/to buffer two rather than one buffer with two copies to set things up for the next iteration. So about six copies per iteration went to five (or such).
Regards,
Steve N.
|
|
|
Logged
|
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
It matters more on some hardware than others, mainly older stuff. It worked OK in some algos on PIV hardware but almost has no effect with the Core series and i7 series hardware.
The theory is simple enough, unroll an algo to reduce the loop overhead but many factors work outside the theory, if the loop content is heavily memory dependent then unrolling it will not matter as the time taken with the memory operand operations will be the main factor, in some situations where the loop code is mainly data stored directly in registers there is some potential to get a timing reduction.
Finally you set up a timing mechanism as see if unrolling an algo alters its timing, if its faster then use it, if there is no difference then don't bother but be aware that sometimes and unroll makes an algo slower.
|
|
|
Logged
|
|
|
|
dedndave
|
what Hutch said  ...i might add... when you time code and modify it, then time it again, you are optimizing it for your platform to truly know what is "good", you have to stick it in the laboratory sub-forum see how it performs on a range of processors and operating systems
|
|
|
Logged
|
|
|
|
FORTRANS
Member
    
Gender: 
Posts: 1147
Imagine
|
see how it performs on a range of processors and operating systems
Hi, Just as a joke mind you, here are some results from that graphics program I mentioned. The P-III Win2k and AMD systems produced erratic results. The target was the 200LX, and some things that sped up the development machines slowed it down. Many things that sped up the others had little or no effect on the 200LX as well, but usually were left in. 200LX, 80186 16MHz, MS-DOS 5.0, ~2.2x +1.96816912E+000 Iterations per second. PSEUDO5T, mains, A.O.T. battery +4.35200000E+000 Iterations per second. PSEUDO7
Pentium 90, WD 90C33, DOS, ~5.0x +2.49228395E+001 Iterations per second. PSEUDO5T +1.24106907E+002 Iterations per second. PSEUDO7
Pentium III 800, Matrox G400, OS/2 VDM, ~17.2 +3.87951807E+001 Iterations per second. PSEUDO5T +6.68239356E+002 Iterations per second. PSEUDO7
Pentium III 800, Matrox G400, W2k +2.34104046E+001 Iterations per second. PSEUDO5T +3.79800853E+001 Iterations per second. PSEUDO5T +4.43276284E+002 Iterations per second. PSEUDO7 +6.39156627E+002 Iterations per second. PSEUDO7
AMD 64 @ 2 GHz, WinXP, ~31.5x +4.21271764E+001 Iterations per second. PSEUDO5T +1.32561728E+003 Iterations per second. PSEUDO7
Cheers, Steve N.
|
|
|
Logged
|
|
|
|
jj2007
|
It is very difficult to achieve reliable timings for the P4. Attached a set of macros based on MichaelW's Timer.asm designed for that purpose. .nolist include \masm32\include\masm32rt.inc
.686 .xmm
; ###### these macros improve drastically the consistency of timings on the P4 ####### include \masm32\MasmBasic\Cyct_Macros.inc
.code start: REPEAT 10 cyct_begin invoke GetTickCount cyct_end <GetTickCount> ENDM exit
end start
|
|
|
Logged
|
|
|
|
|
Pages: [1]
|
|
|
 |