Loop unrolling proper usage

zemtex · April 22, 2012, 03:06:19 PM

I do loop unrolls from time to time, but I am looking for some deeper understanding of it. I am also interested in tricks related to register conservation. Examples are welcome, but I prefer theoretical explanations.
Post a few examples, simple ones and explain each line with a comment why you do it like this and like that. Before and After examples are preferred.

shlomok · April 22, 2012, 03:17:57 PM

Hi,
There are some examples here: http://www.mark.masmcode.com/

Too advanced for me but they might so sense for you.

FORTRANS · April 22, 2012, 03:39:01 PM

Hi,

Well I unroll loops when the loop code is taking an appreciable
amount of time compared to the time taken by the code inside
the loop and the code overall is limited in performance. Off the
top of my head, a line drawing routine and a graphics program
took most of my efforts in this area.

The line drawing routine started as a generic Bresenham and
ended up as a horrible mess of unrolled, tangled, specialized,
spaghetti code. I eventually just went with a simplified algorithm
to get the performance I wanted.

The graphics program got every optimization I could think of.
The loop unrolling, per se, was probably silly as filling a screen with
pixels takes quite a bit of time compared to the loop code. But it
was a part of restructuring the program more than for looping
performance. Unrolling by two allowed for a copy from/to buffer
one and then a copy from/to buffer two rather than one buffer
with two copies to set things up for the next iteration. So about
six copies per iteration went to five (or such).

Regards,

Steve N.

hutch-- · April 22, 2012, 03:45:11 PM

It matters more on some hardware than others, mainly older stuff. It worked OK in some algos on PIV hardware but almost has no effect with the Core series and i7 series hardware.

The theory is simple enough, unroll an algo to reduce the loop overhead but many factors work outside the theory, if the loop content is heavily memory dependent then unrolling it will not matter as the time taken with the memory operand operations will be the main factor, in some situations where the loop code is mainly data stored directly in registers there is some potential to get a timing reduction.

Finally you set up a timing mechanism as see if unrolling an algo alters its timing, if its faster then use it, if there is no difference then don't bother but be aware that sometimes and unroll makes an algo slower.

dedndave · April 22, 2012, 04:55:26 PM

what Hutch said :P

...i might add...
when you time code and modify it, then time it again,
you are optimizing it for your platform
to truly know what is "good", you have to stick it in the laboratory sub-forum
see how it performs on a range of processors and operating systems

FORTRANS · April 22, 2012, 07:13:47 PM

Quote from: dedndave on April 22, 2012, 04:55:26 PM
see how it performs on a range of processors and operating systems

Hi,

Just as a joke mind you, here are some results from that
graphics program I mentioned. The P-III Win2k and AMD
systems produced erratic results.

The target was the 200LX, and some things that sped up
the development machines slowed it down. Many things that
sped up the others had little or no effect on the 200LX as well,
but usually were left in.

Code Select


200LX, 80186 16MHz, MS-DOS 5.0, ~2.2x
+1.96816912E+000 Iterations per second.  PSEUDO5T, mains, A.O.T. battery
+4.35200000E+000 Iterations per second.  PSEUDO7

Pentium 90, WD 90C33, DOS, ~5.0x
+2.49228395E+001 Iterations per second.  PSEUDO5T
+1.24106907E+002 Iterations per second.  PSEUDO7

Pentium III 800, Matrox G400, OS/2 VDM, ~17.2
+3.87951807E+001 Iterations per second.  PSEUDO5T
+6.68239356E+002 Iterations per second.  PSEUDO7

Pentium III 800, Matrox G400, W2k
+2.34104046E+001 Iterations per second.  PSEUDO5T
+3.79800853E+001 Iterations per second.  PSEUDO5T
+4.43276284E+002 Iterations per second.  PSEUDO7
+6.39156627E+002 Iterations per second.  PSEUDO7

AMD 64 @ 2 GHz, WinXP, ~31.5x
+4.21271764E+001 Iterations per second.  PSEUDO5T
+1.32561728E+003 Iterations per second.  PSEUDO7

Cheers,

Steve N.

jj2007 · April 22, 2012, 08:21:09 PM

It is very difficult to achieve reliable timings for the P4. Attached a set of macros based on MichaelW's Timer.asm designed for that purpose.

Code Select

.nolist
include \masm32\include\masm32rt.inc

.686
.xmm

; ###### these macros improve drastically the consistency of timings on the P4 #######
include \masm32\MasmBasic\Cyct_Macros.inc

.code
start:
	REPEAT 10
	cyct_begin
		invoke GetTickCount
	cyct_end <GetTickCount>
	ENDM
	exit

end start

News:

Loop unrolling proper usage

zemtex

shlomok

FORTRANS

hutch--

dedndave

FORTRANS

jj2007