News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Loop unrolling proper usage

Started by zemtex, April 22, 2012, 03:06:19 PM

Previous topic - Next topic

zemtex

I do loop unrolls from time to time, but I am looking for some deeper understanding of it. I am also interested in tricks related to register conservation. Examples are welcome, but I prefer theoretical explanations.
Post a few examples, simple ones and explain each line with a comment why you do it like this and like that. Before and After examples are preferred.
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

shlomok

Hi,
There are some examples here: http://www.mark.masmcode.com/

Too advanced for me but they might so sense for you.

FORTRANS

Hi,

   Well I unroll loops when the loop code is taking an appreciable
amount of time compared to the time taken by the code inside
the loop and the code overall is limited in performance.  Off the
top of my head, a line drawing routine and a graphics program
took most of my efforts in this area.

   The line drawing routine started as a generic Bresenham and
ended up as a horrible mess of unrolled, tangled, specialized,
spaghetti code.  I eventually just went with a simplified algorithm
to get the performance I wanted.

   The graphics program got every optimization I could think of.
The loop unrolling, per se, was probably silly as filling a screen with
pixels takes quite a bit of time compared to the loop code.  But it
was a part of restructuring the program more than for looping
performance.  Unrolling by two allowed for a copy from/to buffer
one and then a copy from/to buffer two rather than one buffer
with two copies to set things up for the next iteration.  So about
six copies per iteration went to five (or such).

Regards,

Steve N.

hutch--

It matters more on some hardware than others, mainly older stuff. It worked OK in some algos on PIV hardware but almost has no effect with the Core series and i7 series hardware.

The theory is simple enough, unroll an algo to reduce the loop overhead but many factors work outside the theory, if the loop content is heavily memory dependent then unrolling it will not matter as the time taken with the memory operand operations will be the main factor, in some situations where the loop code is mainly data stored directly in registers there is some potential to get a timing reduction.

Finally you set up a timing mechanism as see if unrolling an algo alters its timing, if its faster then use it, if there is no difference then don't bother but be aware that sometimes and unroll makes an algo slower.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

what Hutch said   :P

...i might add...
when you time code and modify it, then time it again,
you are optimizing it for your platform
to truly know what is "good", you have to stick it in the laboratory sub-forum
see how it performs on a range of processors and operating systems

FORTRANS

Quote from: dedndave on April 22, 2012, 04:55:26 PM
see how it performs on a range of processors and operating systems

Hi,

   Just as a joke mind you, here are some results from that
graphics program I mentioned.  The P-III Win2k and AMD
systems produced erratic results.

   The target was the 200LX, and some things that sped up
the development machines slowed it down.  Many things that
sped up the others had little or no effect on the 200LX as well,
but usually were left in.


200LX, 80186 16MHz, MS-DOS 5.0, ~2.2x
+1.96816912E+000 Iterations per second.  PSEUDO5T, mains, A.O.T. battery
+4.35200000E+000 Iterations per second.  PSEUDO7

Pentium 90, WD 90C33, DOS, ~5.0x
+2.49228395E+001 Iterations per second.  PSEUDO5T
+1.24106907E+002 Iterations per second.  PSEUDO7

Pentium III 800, Matrox G400, OS/2 VDM, ~17.2
+3.87951807E+001 Iterations per second.  PSEUDO5T
+6.68239356E+002 Iterations per second.  PSEUDO7

Pentium III 800, Matrox G400, W2k
+2.34104046E+001 Iterations per second.  PSEUDO5T
+3.79800853E+001 Iterations per second.  PSEUDO5T
+4.43276284E+002 Iterations per second.  PSEUDO7
+6.39156627E+002 Iterations per second.  PSEUDO7

AMD 64 @ 2 GHz, WinXP, ~31.5x
+4.21271764E+001 Iterations per second.  PSEUDO5T
+1.32561728E+003 Iterations per second.  PSEUDO7


Cheers,

Steve N.

jj2007

It is very difficult to achieve reliable timings for the P4. Attached a set of macros based on MichaelW's Timer.asm designed for that purpose.

.nolist
include \masm32\include\masm32rt.inc

.686
.xmm

; ###### these macros improve drastically the consistency of timings on the P4 #######
include \masm32\MasmBasic\Cyct_Macros.inc

.code
start:
REPEAT 10
cyct_begin
invoke GetTickCount
cyct_end <GetTickCount>
ENDM
exit

end start