Title: Loop unrolling proper usage Post by: zemtex on April 22, 2012, 03:06:19 PM I do loop unrolls from time to time, but I am looking for some deeper understanding of it. I am also interested in tricks related to register conservation. Examples are welcome, but I prefer theoretical explanations.
Post a few examples, simple ones and explain each line with a comment why you do it like this and like that. Before and After examples are preferred. Title: Re: Loop unrolling proper usage Post by: shlomok on April 22, 2012, 03:17:57 PM Hi,
There are some examples here: http://www.mark.masmcode.com/ Too advanced for me but they might so sense for you. Title: Re: Loop unrolling proper usage Post by: FORTRANS on April 22, 2012, 03:39:01 PM Hi,
Well I unroll loops when the loop code is taking an appreciable amount of time compared to the time taken by the code inside the loop and the code overall is limited in performance. Off the top of my head, a line drawing routine and a graphics program took most of my efforts in this area. The line drawing routine started as a generic Bresenham and ended up as a horrible mess of unrolled, tangled, specialized, spaghetti code. I eventually just went with a simplified algorithm to get the performance I wanted. The graphics program got every optimization I could think of. The loop unrolling, per se, was probably silly as filling a screen with pixels takes quite a bit of time compared to the loop code. But it was a part of restructuring the program more than for looping performance. Unrolling by two allowed for a copy from/to buffer one and then a copy from/to buffer two rather than one buffer with two copies to set things up for the next iteration. So about six copies per iteration went to five (or such). Regards, Steve N. Title: Re: Loop unrolling proper usage Post by: hutch-- on April 22, 2012, 03:45:11 PM It matters more on some hardware than others, mainly older stuff. It worked OK in some algos on PIV hardware but almost has no effect with the Core series and i7 series hardware.
The theory is simple enough, unroll an algo to reduce the loop overhead but many factors work outside the theory, if the loop content is heavily memory dependent then unrolling it will not matter as the time taken with the memory operand operations will be the main factor, in some situations where the loop code is mainly data stored directly in registers there is some potential to get a timing reduction. Finally you set up a timing mechanism as see if unrolling an algo alters its timing, if its faster then use it, if there is no difference then don't bother but be aware that sometimes and unroll makes an algo slower. Title: Re: Loop unrolling proper usage Post by: dedndave on April 22, 2012, 04:55:26 PM what Hutch said :P
...i might add... when you time code and modify it, then time it again, you are optimizing it for your platform to truly know what is "good", you have to stick it in the laboratory sub-forum see how it performs on a range of processors and operating systems Title: Re: Loop unrolling proper usage Post by: FORTRANS on April 22, 2012, 07:13:47 PM Quote from: dedndave on April 22, 2012, 04:55:26 PM see how it performs on a range of processors and operating systems Hi, Just as a joke mind you, here are some results from that graphics program I mentioned. The P-III Win2k and AMD systems produced erratic results. The target was the 200LX, and some things that sped up the development machines slowed it down. Many things that sped up the others had little or no effect on the 200LX as well, but usually were left in. Code: 200LX, 80186 16MHz, MS-DOS 5.0, ~2.2x +1.96816912E+000 Iterations per second. PSEUDO5T, mains, A.O.T. battery +4.35200000E+000 Iterations per second. PSEUDO7 Pentium 90, WD 90C33, DOS, ~5.0x +2.49228395E+001 Iterations per second. PSEUDO5T +1.24106907E+002 Iterations per second. PSEUDO7 Pentium III 800, Matrox G400, OS/2 VDM, ~17.2 +3.87951807E+001 Iterations per second. PSEUDO5T +6.68239356E+002 Iterations per second. PSEUDO7 Pentium III 800, Matrox G400, W2k +2.34104046E+001 Iterations per second. PSEUDO5T +3.79800853E+001 Iterations per second. PSEUDO5T +4.43276284E+002 Iterations per second. PSEUDO7 +6.39156627E+002 Iterations per second. PSEUDO7 AMD 64 @ 2 GHz, WinXP, ~31.5x +4.21271764E+001 Iterations per second. PSEUDO5T +1.32561728E+003 Iterations per second. PSEUDO7 Cheers, Steve N. Title: Re: Loop unrolling proper usage Post by: jj2007 on April 22, 2012, 08:21:09 PM It is very difficult to achieve reliable timings for the P4. Attached a set of macros based on MichaelW's Timer.asm designed for that purpose.
Code: .nolist include \masm32\include\masm32rt.inc .686 .xmm ; ###### these macros improve drastically the consistency of timings on the P4 ####### include \masm32\MasmBasic\Cyct_Macros.inc .code start: REPEAT 10 cyct_begin invoke GetTickCount cyct_end <GetTickCount> ENDM exit end start
The MASM Forum Archive 2004 to 2012 | Powered by SMF 1.0.12.
© 2001-2005, Lewis Media. All Rights Reserved. |