News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

why is "add" faster than "inc"

Started by thomas_remkus, April 28, 2006, 08:17:49 PM

Previous topic - Next topic

EduardoS

Quote from: MichaelW on May 01, 2006, 03:13:19 PM
I don't understand why, but if I change the INC ECX after the test_word label back to ADD ECX, 1 the cycle count for a P3 drops from 1332 to about 1281.
Here when i change one or two incs by add the time don't change, but when i change all it drops, i test diferent alignments and the clock count vary from 1200 to 1600, so i guess the diference in the third algo is due to alignment (other topic?).

EduardoS

Yes, was alignment, putting a xchg ax, ax after each inc/dec (to have the same size of add/sub):

1063 cycles, cmpmem
1063 cycles, cmpmem_incdec
824 cycles, szRev
821 cycles, szRev_incdec
1265 cycles, szWcnt
1265 cycles, szWcnt_incdec
Press any key to exit...

[attachment deleted by admin]

Mark Jones

Quote from: MichaelW on May 01, 2006, 01:41:00 PM
The attachment this time tests INC/DEC versions of the MASM32 cmpmem, szRev, and szWcnt procedures against the original ADD/SUB versions. I simply replaced each ADD reg, 1 with INC reg and each SUB reg, 1 with DEC reg.

Interesting, tried to run it from WinRAR and it crashed. (AMD XP 2500+) Runs fine if extracted manually and ran. Only time this has ever happened that I recall.


1083 cycles, cmpmem
1086 cycles, cmpmem_incdec
911 cycles, szRev
871 cycles, szRev_incdec
1467 cycles, szWcnt
1544 cycles, szWcnt_incdec
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

asmfan

Guys, if you need fair tests use nonmultitasking OS (DOS for example) in other way any difference in results can be referred to a measuring error...
Russia is a weird place

msmith

BogdanOntanu said:

Quote
There is a political reason also: the HLL llanguages do much better at using ADD/SUB than INC/DEC ...

I made the following program to test this on my compiler:


; Dimension Variables
DIM I AS LONG
DIM StartTime AS LONG

OBMain.CREATE

END EVENT

Button1.COMMAND
StartTime=GETTICKCOUNT()
FOR I=1 to 2000000000
; Do nothing indide loop
NEXT I
Button1.TEXT=STR$((GETTICKCOUNT()-StartTime))
END EVENT



Here is the .asm code ror inc:

; LN:11 FOR I=1 to 2000000000
mov eax,1
mov [I],eax
mov eax,2000000000
mov [_LopVec1],eax
_Lbl2:
mov eax,[I]
cmp eax,[_LopVec1]
jg _Lbl4
; LN:12 ; Do nothing indide loop
; LN:13 NEXT I
_Lbl3:
inc [I]
jmp _Lbl2
_Lbl4:


Here is the .asm code for add:

; LN:11 FOR I=1 to 2000000000
mov eax,1
mov [I],eax
mov eax,2000000000
mov [_LopVec1],eax
_Lbl2:
mov eax,[I]
cmp eax,[_LopVec1]
jg _Lbl4
; LN:12 ; Do nothing indide loop
; LN:13 NEXT I
_Lbl3:
mov eax,[I]
add eax,1
mov [I],eax
jmp _Lbl2
_Lbl4:


With an inc instruction in the "Next" incrementer code, the pgm runs in 5.2 seconds.

After modifying the compiler to generate a mov, add 1, mov sequence to increment the loopvar, the pgm was recompiled and ran in 4.2 seconds.

This is a 22% improvement!

Changinging the compiler to do this took less than a minute of work. The compiler can output either type of code with equal ease. In fact it wouldn't be that difficult to detect the type of machine and output the inc or add code on the fly.

EduardoS


msmith


EduardoS

P4 breaks inc in two uops, add is much faster in it,
the same don't happen with Athlon.

msmith

EduardoS,

Thanks for the info.

The problem with sensing the processor type and generating the optimized code is that it is easy to do so at compile time, but much harder at run time.

If I sense the processor type at compile time and generate the best code, there is no telling what kind of machine the .exe will be run on.

The 22% speed improvement on each iteration of a FOR LOOP is significant, especially if there is not much inside the loop.

Mike

BogdanOntanu

The problem with such simple "do almost nothing loops" is that they never represent the real behavoiur of a compiler during a big/medium application.
Basically you will never have such empty loops in real life applications and algorithms.

I have seen compilers "forget" about variables assignments under stress conditions or use constructs like:

movzx edx, word ptr[ esi]  ... correct because the pointer was typed "word"

and then:

movzx ecx,dx   --> funny is it not ? apparently edx is also considered "word" :P

sometimes on the very next instruction...

I am always amazed on the huge credit compilers get from perfoming such simple tests :D
when in fact they do perform much badly in really complicated algorithms...


Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

msmith

Quote
The problem with such simple "do almost nothing loops" is that they never represent the real behavoiur of a compiler during a big/medium application.
Basically you will never have such empty loops in real life applications and algorithms.

It is true that a contrived test never fully represents real behavior.

The behavior of the processor (cache loading, branch lookahead, etc.) will likeky change, but not the behavior of the  compiler.

No test in life is really representative. On your driving test, you have a policeman in your car to check you. Do you drive the same with a policeman in the car as when you don't? Every test in life is contrived, but the advantage of such tests is the possibility of testing just one thing in isolation. If the inc/add test had complex code in the loop, we would not know for sure why one example was faster than the other.

EduardoS

I tested the szWcnt_incdec on a Athlon XP and test some variations, i don't change de code, just code alignment, the best variation was a kind of "out of logic" alignment:

1451 cycles, szWcnt
1334 cycles, szWcnt_incdec <-- best variantion


With some comments:

szWcnt_incdec proc src:DWORD,txt:DWORD

    push ebx
    push esi
    push edi

    mov edx, len([esp+16])      ; procedure call for src length
    sub edx, len([esp+20])      ; procedure call for 1st text length

    mov eax, -1
    mov esi, [esp+16]           ; source in ESI
    add edx, esi                ; add src to exit position
    xor ebx, ebx                ; clear EBX to prevent stall with BL
    add edx, 1                  ; correct to get last word
    mov edi, [esp+20]           ; text to count in EDI
    sub esi, 1
align 16
  pre:
    INC EAX     ; was faster with this instruction aligned 16 bytes
    xchg ax, ax
    nop
  wcst:         
    add esi, 1   ;For the best performace the distance from the start of inc above must be 4 bytes and
                    ;must exists two instructions (nops) between them (don't ask why)
    xchg ax, ax
    nop

    cmp edx, esi ; the distance from the start of add must be 6 bytes (don't ask why) and
                      ; must exists two instructions (nops) between them (again, don't ask why)
    jle wcout                  ;
    mov bl, [esi]               ;
    cmp bl, [edi]               ; all of this instructions must be in sequence
    jne wcst                    ;

    nop                   ;
    xor ecx, ecx       ; the inc must right after xor and must exists a nop before xor and a nop after inc
  test_word:           ; the length of this 4 instructions must be 6 bytes
    INC ECX             ;
    xchg ax, ax        ;

    cmp BYTE PTR [edi+ecx], 0   ;
    je pre                      ;
    mov bl, [esi+ecx]           ;  Must be in sequence
    cmp bl, [edi+ecx]           ;
    jne wcst                    ;
     jmp test_word               ;
  wcout:

    pop edi
    pop esi
    pop ebx

    ret 8

szWcnt_incdec endp



I really don't understand why this alignment was the fastest on XP 2000+.