The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: thomas_remkus on April 28, 2006, 08:17:49 PM

Title: why is "add" faster than "inc"
Post by: thomas_remkus on April 28, 2006, 08:17:49 PM
i'm in a tight loop. doing basically nothing but adding numbers. when I use "inc eax" it's much slower than "add eax, 1". why is this? the difference is significant.
Title: Re: why is "add" faster than "inc"
Post by: arafel on April 28, 2006, 08:33:46 PM
It's faster only on PIV and above processors, because there is some penalty due to partial register stall when using inc instruction. On PIII and below "add reg, 1" is much slower.

(at least for Intel cpus, don't know if there is difference for AMD)
Title: Re: why is "add" faster than "inc"
Post by: jdoe on April 28, 2006, 09:51:16 PM
Quote from: arafel on April 28, 2006, 08:33:46 PM
(at least for Intel cpus, don't know if there is difference for AMD)

add/sub are faster than inc/dec even on AMD processor.  :thumbu


Title: Re: why is "add" faster than "inc"
Post by: QvasiModo on April 28, 2006, 10:19:26 PM
It's because add changes all of the arithmetic registers, while inc changes only some of them - so the processor may have to wait before another arithmetic operation completes just to set the flags correctly, even when the calculations are completely unrelated.

For example, if I have this:

cmp eax,10h
add edx,1

the processor doesn't have to wait for the cmp instruction to complete to be able to execute the add instruction. But if I have this:

cmp eax,10h
inc edx

then the processor has to wait for cmp to know how the flags have to be set after executing inc.
Title: Re: why is "add" faster than "inc"
Post by: Ratch on April 29, 2006, 12:04:18 AM
 jdoe,

Quote
add/sub are faster than inc/dec even on AMD processor

     Both ADD and INC are DirectPath vs VectorPath instructions according to the AMD Optimization Manual http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf .  Also many optimization examples in the manual use INC under the right circumstances, i.e. no reading or writing the register immediately after modifying it.  I can't find any statement in the manual saying that a ADD is preferable to a INC on the AMD.

QvasiModo,

Quote
It's because add changes all of the arithmetic registers, while inc changes only some of them

     Both ADD and INC change only the register that they are coded to change.

Quote
For example, if I have this:

Code:
cmp eax,10h
add edx,1
the processor doesn't have to wait for the cmp instruction to complete to be able to execute the add instruction. But if I have this:

Code:
cmp eax,10h
inc edx
then the processor has to wait for cmp to know how the flags have to be set after executing inc

     Why?  In both cases the ADD in the first snippet and the INC in the second snippet are going to wipe away any flag settings of the CMP instruction.  Ratch
Title: Re: why is "add" faster than "inc"
Post by: tenkey on April 29, 2006, 12:39:39 AM
I think QvasiModo is referring to the differences in flag settings.

Because ADD and CMP change the same set of flags, and INC and CMP don't, there may be a stall for creating the correct flag setting in the latter case.

The difference is CF. In multiprecision arithmetic, you would use INC/DEC for counting and updating addresses. You would need to save and restore CF if there were no increment/decrement instructions that left CF alone.
Title: Re: why is "add" faster than "inc"
Post by: jdoe on April 29, 2006, 12:52:11 AM
Quote from: Ratch on April 29, 2006, 12:04:18 AM
jdoe,

Quote
add/sub are faster than inc/dec even on AMD processor

     Both ADD and INC are DirectPath vs VectorPath instructions according to the AMD Optimization Manual http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf .  Also many optimization examples in the manual use INC under the right circumstances, i.e. no reading or writing the register immediately after modifying it.  I can't find any statement in the manual saying that a ADD is preferable to a INC on the AMD.

I don't mind about what was written or not. From the test I did, using add/sub is in worst case as fast or faster that inc/dec. On my AMD athlon 1800+ though.

If you ear from the radio that the sky is green today, would you believe it without going outside to see it by youself ?
Title: Re: why is "add" faster than "inc"
Post by: Ratch on April 29, 2006, 01:02:43 AM
jdoe,

QuoteIf you ear from the radio that the sky is green today, would you believe it without going outside to see it by youself ?

     From the radio?  No I certainly would not.  But if the one who made the sky said so, then I would believe it until I saw otherwise.  Check your timings again.  They can be tricky with insidious pitfalls.  Ratch
Title: Re: why is "add" faster than "inc"
Post by: Ratch on April 29, 2006, 01:10:44 AM
tenkey,

QuoteBecause ADD and CMP change the same set of flags, and INC and CMP don't, there may be a stall for creating the correct flag setting in the latter case.

     Again I point out, the CMPs in his example code are effectively NOPs.  The flags the CMPs set or clear are wiped out by the following ADD and INC instructions.  Ratch
Title: Re: why is "add" faster than "inc"
Post by: hutch-- on April 29, 2006, 01:20:42 AM
In the words of Intel from PIV manual 4,

Quote
The inc and dec instructions should always be avoided. Using add and sub instructions instead of inc and dec instructions avoid data dependence and improve performance.

This probably has something to do with why ADD SUB are faster on later Intel hardware.  :bg
Title: Re: why is "add" faster than "inc"
Post by: tenkey on April 29, 2006, 07:41:40 AM
Quote from: Ratch on April 29, 2006, 01:10:44 AM
tenkey,

QuoteBecause ADD and CMP change the same set of flags, and INC and CMP don't, there may be a stall for creating the correct flag setting in the latter case.

     Again I point out, the CMPs in his example code are effectively NOPs.  The flags the CMPs set or clear are wiped out by the following ADD and INC instructions.  Ratch

Here is a demonstration of the INC instruction (DEC is similar). Predict what the following code will produce, then run it. Replace the INC with the equivalent ADD and see if there's a difference.

.386
.model stdcall, flat
option casemap :none   ; case sensitive

include c:\masm32\include\windows.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc

includelib c:\masm32\lib\kernel32.lib
includelib c:\masm32\lib\user32.lib

.data
caseclr db "CF is cleared by CMP, not set by INC.",0
case0   db "CF is not set by STC.",0
case1   db "CF is not cleared by CMP.",0
case2   db "CF is set by INC.",0
.code
_start:

stc        ; set CF
jnc carryclear_case0   ; error if CF is clear
mov ecx,-1
mov eax,7
cmp eax,3  ; 7 - 3 = 4, no carry (borrow)
jc carryset_case1   ; error if CF is set
; CF status is "clear"
inc ecx    ; FFFFFFFF + 1 = 0 w/carry - is CF set?
jc carryset_case2   ; find out!
carryclear:
invoke MessageBox,NULL,addr caseclr,addr caseclr,MB_OK
jmp quit
carryclear_case0:
invoke MessageBox,NULL,addr case0,addr case0,MB_OK
jmp quit
carryset_case1:
invoke MessageBox,NULL,addr case1,addr case1,MB_OK
jmp quit
carryset_case2:
invoke MessageBox,NULL,addr case2,addr case2,MB_OK
quit:
invoke ExitProcess,0

end _start
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on April 29, 2006, 02:08:26 PM
Quote from: jdoe on April 28, 2006, 09:51:16 PM
Quote from: arafel on April 28, 2006, 08:33:46 PM
(at least for Intel cpus, don't know if there is difference for AMD)

add/sub are faster than inc/dec even on AMD processor.  :thumbu




Maybe under certain conditions, generaly not:

Press any key to start...
add 1 : 1019 clocks
add 2 : 1020 clocks
add 3 : 1020 clocks
add 4 : 1363 clocks
inc 1 : 1020 clocks
inc 2 : 1020 clocks
inc 3 : 1021 clocks
inc 4 : 1361 clocks
add/cmp : 1019 clocks
inc/cmp : 1019 clocks
Press any key to exit...

[attachment deleted by admin]
Title: Re: why is "add" faster than "inc"
Post by: dsouza123 on April 29, 2006, 02:19:27 PM
Athlon 1.2 Ghz @ 1190 Mhz
Windows XP SP2  512MB

Press any key to start...
add 1 : 1026 clocks
add 2 : 1027 clocks
add 3 : 1351 clocks
add 4 : 1802 clocks
inc 1 : 1027 clocks
inc 2 : 1025 clocks
inc 3 : 1028 clocks
inc 4 : 1373 clocks
add/cmp : 1026 clocks
inc/cmp : 1027 clocks
Press any key to exit...
Title: Re: why is "add" faster than "inc"
Post by: Mark Jones on April 29, 2006, 05:11:18 PM
Quote from: thomas_remkus on April 28, 2006, 08:17:49 PM
when I use "inc eax" it's much slower than "add eax, 1". why is this?

Code optimization is like working on a Sudoku puzzle or decrypting an encrypto-gram. Can be lots of fun, and also maddeningly annoying at the same time. :bg   

See Agner Fog's optimization guide: http://www.agner.org/assem/
Title: Re: why is "add" faster than "inc"
Post by: jdoe on April 29, 2006, 06:26:10 PM
Quote from: Ratch on April 29, 2006, 01:02:43 AM
But if the one who made the sky said so, then I would believe it until I saw otherwise.

You know how to answer.

Ok, I give you one point about a little gain with INC/DEC on AMD in some circumstance but I keep saying that generaly, using ADD/SUB is as fast or faster. In other words, when writing optimize code, trying both is a good idea.
Title: Re: why is "add" faster than "inc"
Post by: BogdanOntanu on April 29, 2006, 07:26:32 PM
Unfortunately JDoe is right...

This is a mistake made by the new CPU's (P4 and up)
One of the aberations of human technology "evolution"

New less experienced people have come to development team and they simply forgot about the importance of INC and DEC...

ADD and SUB are inheritely more complex operations in hardware than INC/DEC but the new commers forgot about it...
They will rediscover it some day...if ever :D There is a political reason also: the HLL llanguages do much better at using ADD/SUB than INC/DEC ...

Such is life on this planet...

If somebody has a P2/P2/P1/386 INC/DEC will be much faster than ADD/SUB

Title: Re: why is "add" faster than "inc"
Post by: Ratch on April 29, 2006, 08:42:41 PM
jdoe,

     ADD EAX,1 uses three times as many bytes as INC EAX.  Ratch
Title: Re: why is "add" faster than "inc"
Post by: jdoe on April 29, 2006, 09:44:57 PM
Quote from: Ratch on April 29, 2006, 08:42:41 PM
ADD EAX,1 uses three times as many bytes as INC EAX.  Ratch

I won't argue on that because it is an immutable truth. BTW, I've never talk about optimize size which have different purpose than speed.

You definitely want the last word on the subject. Keep searching...

Title: Re: why is "add" faster than "inc"
Post by: hutch-- on April 29, 2006, 09:58:46 PM
Bogdan is correct here, it is simply technology change based on how the hardware is constructed. INC and DEC performed well on most of the older stuff but the PIV is internally different and Intel publish that it is preferred to use ADD SUB instead. From what I can tell later AMD stuff is working much the same way. Now the upshot is if you are still writing code dedicated to older hardware and the speed actually matters, use INC DEC but if you are targetting modern hardware, use ADD SUB as the manufacturer suggests.
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on April 29, 2006, 10:44:32 PM
Quote from: EduardoS on April 29, 2006, 02:08:26 PM
Quote from: jdoe on April 28, 2006, 09:51:16 PM
Quote from: arafel on April 28, 2006, 08:33:46 PM
(at least for Intel cpus, don't know if there is difference for AMD)

add/sub are faster than inc/dec even on AMD processor.  :thumbu




Maybe under certain conditions, generaly not:

Press any key to start...
add 1 : 1019 clocks
add 2 : 1020 clocks
add 3 : 1020 clocks
add 4 : 1363 clocks
inc 1 : 1020 clocks
inc 2 : 1020 clocks
inc 3 : 1021 clocks
inc 4 : 1361 clocks
add/cmp : 1019 clocks
inc/cmp : 1019 clocks
Press any key to exit...


It was on an Athlon 64... I think AMD don't have anything newer...
AMD kill some instructions on 64bits mode, the inc/dec lose the 1 byte form but still existing on the 2 byte form, so i guess they will suport inc/dec for some time more.

I'm curious to know how this code go on P4.
Title: Re: why is "add" faster than "inc"
Post by: Mark Jones on April 29, 2006, 10:55:27 PM
Quote from: hutch-- on April 29, 2006, 09:58:46 PM
...but if you are targetting modern hardware, use ADD SUB as the manufacturer suggests.

I wonder then, why manufacturers simply have not aliased the two in modern processors?
Title: Re: why is "add" faster than "inc"
Post by: hutch-- on April 29, 2006, 11:09:32 PM
Mark,

From memory its purely a specification difference in which flags are set. What you suggest makes sense as it would reduce an instruction redundancy.
Title: Re: why is "add" faster than "inc"
Post by: dsouza123 on April 30, 2006, 12:59:35 AM
Maybe Intel has corrected it (slow INC/DEC)
on the Merom/Conroe/Woodcrest (Core Microarchitecure) (mobile,desktop,server)
which trace their lineage
Pentium 3 -> Pentium M -> Yonah -> Core Microarchitecure

Woodcrest has a June launch, Conroe July, Merom August.
Only some engineering samples out in the wild, some benchmark results available.

Unfortunately haven't seen any Instruction spec sheets for them yet.
Title: Re: why is "add" faster than "inc"
Post by: NightWare on April 30, 2006, 01:11:20 AM
Quote from: jdoe on April 29, 2006, 09:44:57 PM
I won't argue on that because it is an immutable truth. BTW, I've never talk about optimize size which have different purpose than speed.

i think you're partially wrong... coz the size of the code change the size of the jump in the loop... it could make a big difference, sometimes...

but you're right when you say that both have to be tested... i always do that, and i've saw that inc/dec is generally a bit faster than add/sub (with celeron 700mghz and P4 2ghz)... but of course, like i said previously, it depends of the entire code/proc...
Title: Re: why is "add" faster than "inc"
Post by: thomas_remkus on April 30, 2006, 02:29:49 AM
I ... uh, really ... did not expect such a large conversation. Clearly small things like this are really hot topics.

Here's what I got out of this: INC/DEC used to be faster, but ADD/SUB are faster now. That does not mean that they always will be, but it can depend on the chip. Also, ADD/SUB are larger so the jump might take a different method to get there so size is something to consider.

I have tested *my* code from as the cloud-maker, and have found that in my instance with visual studio and inline __asm under debug the ADD/SUB is faster but INC/DEC is faster under release. For what reasons, I have no idea.

This really tells me one more major thing ... I'll check my clouds very carefully!!
Title: Re: why is "add" faster than "inc"
Post by: Mincho Georgiev on April 30, 2006, 09:59:42 AM
QuoteThere is a political reason also: the HLL llanguages do much better at using ADD/SUB than INC/DEC
Unfortunately, this sounds pretty logical. I can't figure a single reason for ADD/SUB to be faster than INC except a hardware design mistake or Bogdan's point of view... 
Title: Re: why is "add" faster than "inc"
Post by: arafel on April 30, 2006, 12:50:37 PM
Quote from: EduardoS on April 29, 2006, 10:44:32 PM
I'm curious to know how this code go on P4.

Eduardos, such tests are far from being realistic. A continuous repeating of the inc/add institutions doesn't represent any real life scenario where almost always other factors present.

Quote from: thomas_remkus on April 30, 2006, 02:29:49 AM
Also, ADD/SUB are larger so the jump might take a different method to get there so size is something to consider.

It's more a case of crossing cache boundaries than the distance of the jmps. On some occasions add 1/sub 1, because of it's size, will make you to cross the boundary and lead to big slowdown. The solution than might be replacing it by inc/dec to reduce the code size.


...Anyway, as it has been mentioned here already, better would be just to try different approaches when optimizing and see which one gives better results.

P.S. In every-day coding when not doing tight optimizations I always use inc/dec, because it requires less typing  :green
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on April 30, 2006, 02:32:47 PM
Quote from: arafel on April 30, 2006, 12:50:37 PM
Eduardos, such tests are far from being realistic. A continuous repeating of the inc/add institutions doesn't represent any real life scenario where almost always other factors present.
I can't test every algo to see if inc or add is faster,
That code is usefull to see if there is any diference in latency and throughput and if the flag dependency affect the result, in Athlon it don't show any difference.
Title: Re: why is "add" faster than "inc"
Post by: arafel on April 30, 2006, 04:21:44 PM
Quote from: EduardoS on April 30, 2006, 02:32:47 PM
I can't test every algo to see if inc or add is faster,
Therefore I stand by what I have said: better try different approaches when optimizing and see which one is better.

Quote from: EduardoS on April 30, 2006, 02:32:47 PM
That code is usefull to see if there is any diference in latency and throughput and if the flag dependency affect the result, in Athlon it don't show any difference.

cmp   ebx, 5
inc   eax | add   eax, 1
cmp   ebx, 5
inc   eax | add   eax, 1
....


Doesn't exactly test the dependency you have mentioned.
Every next cmp instruction in such case wont make a difference whether add or inc was used, since it doesn't depend on the difference of CF modification by add and inc.

Adding some other instruction which depends on those things will solve this.


cmp   ebx, 5 ; ZF is affected
inc   eax ; CF is not affected
seta   dl ;;; seta will need to wait for both cmp and inc to retire to get the needed flag values.

cmp   ebx, 5 ; ZF is affected
add   eax, 1 ; CF and ZF are affected
seta   dl ; seta will execute right away after add retiring, independently of cmp progress status.
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on April 30, 2006, 06:48:51 PM
Quote from: arafel on April 30, 2006, 04:21:44 PM
Doesn't exactly test the dependency you have mentioned.
Every next cmp instruction in such case wont make a difference whether add or inc was used, since it doesn't depend on the difference of CF modification by add and inc.
You are right here.

Quote
Adding some other instruction which depends on those things will solve this.


cmp   ebx, 5 ; ZF is affected
inc   eax ; CF is not affected
seta   dl ;;; seta will need to wait for both cmp and inc to retire to get the needed flag values.

cmp   ebx, 5 ; ZF is affected
add   eax, 1 ; CF and ZF are affected
seta   dl ; seta will execute right away after add retiring, independently of cmp progress status.

Here we have a true dependency, the seta depnds on the result of cmp and inc, you can't replace the inc by "add eax, 1" cause you will lose the status of carry and give a diferent result, the question is about false dependency where inc and add reg, 1 give the same result, replacing the seta by setz for example.


    Repeat 1024
        cmp ebx, 5
        inc eax
        setz bl       
    endm



add/seta : 1020 clocks
inc/seta : 2044 clocks <--True dependency
add/setz : 1020 clocks
inc/setz : 1020 clocks
Title: Re: why is "add" faster than "inc"
Post by: arafel on April 30, 2006, 08:18:59 PM
Yes, but the problem setz doesn't depend on the CF. You just can't use instruction which isn't affected by CF, can't! It breaks the whole purpose of testing the things you were testing. This is why the outcome is identical in the test results you have posted.

Perhaps choosing seta wasn't such a good idea from my part, but it was merely an example of the partial register stall I was talking about.

Will this prove my point better? :

  add     eax, 1       ; affects CF and other flags
  rcr     ecx, 1       ; affects CF
  add     esi, 1234567 ; doesn't need to wait for both preceding instructions to retire       
 

  inc     eax          ; affects some flags except CF
  rcr     ecx, 1       ; affects CF
  add     esi, 1234567 ; will wait for both preciding instuctions to retire
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on April 30, 2006, 08:34:30 PM
Here both codes needs only 1 cycle,
Athlon don't seem to be slower with inc.
Title: Re: why is "add" faster than "inc"
Post by: arafel on April 30, 2006, 09:03:12 PM
hmm. On PIII (repeated 1024 times of course):

add/rcr/add: 2516 clocks
inc/rcr/add : 7374 clocks

I'll test on a couple of PIVs tomorrow... Although imo it shouldn't be any different on them unless PIV doesn't have flags register stall penalty, what I doubt.
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on April 30, 2006, 09:33:48 PM
Quote from: arafel on April 30, 2006, 08:18:59 PM
  add     eax, 1       ; affects CF and other flags
  rcr     ecx, 1       ; affects CF
  add     esi, 1234567 ; doesn't need to wait for both preceding instructions to retire       
 

  inc     eax          ; affects some flags except CF
  rcr     ecx, 1       ; affects CF
  add     esi, 1234567 ; will wait for both preciding instuctions to retire


A small note, the two codes give diferent results, the first is a true dependency:
  add     eax, 1       ; affects CF and other flags
  rcr     ecx, 1       ; affects CF ;;; rcr needs the carry to perform the shift and the first instruction changes the carry (true dependency)
  add     esi, 1234567 ; doesn't need to wait for both preceding instructions to retire       
 

An example where both inc and add reg,1 give the same result should be:

  add     eax, 1       ; affects CF and other flags
  bt     ecx, 1       ; affects CF
  add     esi, 1234567 ; doesn't need to wait for both preceding instructions to retire       
 
Wich need the same clock cycle to execute here.
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on April 30, 2006, 09:58:36 PM
Just an interesting effect on P3, try both:


  add     eax, 1       ; affects CF and other flags
  bt     ecx, 1       ; affects CF
 


  inc     eax          ; affects some flags except CF
  rcr     ecx, 1       ; affects CF

Title: Re: why is "add" faster than "inc"
Post by: Mark Jones on April 30, 2006, 10:25:36 PM
This is partly how the Athlon XP 1800 claims to run at "2500" speed - it simply executes some codes faster than 1:1 clock speed.
Title: Re: why is "add" faster than "inc"
Post by: MichaelW on May 01, 2006, 04:34:49 AM
Instead of comparing code sequences that are not representative of real code, this compares two versions of the MASM32 cmpmem procedure:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include timers.asm

    cmpmem_incdec PROTO :DWORD,:DWORD,:DWORD

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      pMem dd 0
      fLen dd 0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke read_disk_file, chr$("\masm32\m32lib\cmpmem.asm"),
                          ADDR pMem, ADDR fLen
    print ustr$(fLen)," bytes",13,10

    invoke cmpmem, pMem, pMem, fLen
    print ustr$(eax),13,10
    invoke cmpmem_incdec, pMem, pMem, fLen
    print ustr$(eax),13,10

    LOOP_COUNT equ 1000000

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke cmpmem, pMem, pMem, fLen
    counter_end
    print ustr$(eax)," cycles, cmpmem",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke cmpmem_incdec, pMem, pMem, fLen
    counter_end
    print ustr$(eax)," cycles, cmpmem_incdec",13,10

    free pMem
    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

cmpmem_incdec proc buf1:DWORD,buf2:DWORD,bcnt:DWORD

    push esi
    push edi

    mov edx, bcnt
    shr edx, 2                      ; div by 4

    mov esi, buf1
    mov edi, buf2
    xor ecx, ecx

  align 4
  @@:
    mov eax, [esi+ecx]              ; DWORD compare main file
    cmp eax, [edi+ecx]
    jne fail
    add ecx, 4
    ;sub edx, 1
    DEC EDX
    jnz @B

    mov edx, bcnt                   ; calculate any remainder
    and edx, 3
    jz match                        ; exit if its zero
    xor eax, eax                    ; clear EAX for partial writes

  @@:
    mov al, [esi+ecx]               ; BYTE compare tail
    cmp al, [edi+ecx]
    jne fail
    ;add ecx, 1
    INC ECX
    ;sub edx, 1
    DEC EDX
    jnz @B

    jmp match

  fail:
    xor eax, eax                    ; return zero if DIFFERENT
    jmp quit

  match:
    mov eax, 1                      ; return NON zero if SAME

  quit:
    pop edi
    pop esi

    ret

cmpmem_incdec endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start


Typical result on a P3:

1777 cycles, cmpmem
1081 cycles, cmpmem_incdec


Typical result on my old K5:

916 cycles, cmpmem
910 cycles, cmpmem_incdec



[attachment deleted by admin]
Title: Re: why is "add" faster than "inc"
Post by: hutch-- on May 01, 2006, 05:11:04 AM
Interestingly enough, i am getting very little difference on this PIV.


1396 bytes
1
1
1201 cycles cmpmem
1210 cycles cmpmem_incdec
Press any key to exit...
Title: Re: why is "add" faster than "inc"
Post by: PBrennick on May 01, 2006, 06:20:28 AM
The same can be said for my AMD.

Quote1396 bytes
1
1
1097 cycles
1077 cycles
Press any key to exit...
Title: Re: why is "add" faster than "inc"
Post by: jdoe on May 01, 2006, 06:33:37 AM
AMD Athlon 1800+


1396 bytes
1
1
1079 cycles
1081 cycles
Press any key to exit...



PBrennick, I'm curious to know your AMD model? I can't get a difference of that range on mine by using INC/DEC.
Title: Re: why is "add" faster than "inc"
Post by: MichaelW on May 01, 2006, 01:41:00 PM
The attachment this time tests INC/DEC versions of the MASM32 cmpmem, szRev, and szWcnt procedures against the original ADD/SUB versions. I simply replaced each ADD reg, 1 with INC reg and each SUB reg, 1 with DEC reg.

Typical results on a P3:

1776 cycles, cmpmem
1080 cycles, cmpmem_incdec
944 cycles, szRev
901 cycles, szRev_incdec
1584 cycles, szWcnt
1332 cycles, szWcnt_incdec



[attachment deleted by admin]
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on May 01, 2006, 02:20:27 PM
A64
Quote
1063 cycles, cmpmem
1063 cycles, cmpmem_incdec
824 cycles, szRev
822 cycles, szRev_incdec
1266 cycles, szWcnt
1432 cycles, szWcnt_incdec
Press any key to exit...

I don't find dependencys on szWcnt, why it was slower?
Title: Re: why is "add" faster than "inc"
Post by: MichaelW on May 01, 2006, 03:13:19 PM
I don't understand why, but if I change the INC ECX after the test_word label back to ADD ECX, 1 the cycle count for a P3 drops from 1332 to about 1281.


Title: Re: why is "add" faster than "inc"
Post by: dsouza123 on May 01, 2006, 04:11:04 PM
Athlon 1300 Mhz Windows XP SP2


1396 bytes
1
1
1078 cycles, cmpmem
1077 cycles, cmpmem_incdec



1076 cycles, cmpmem
1071 cycles, cmpmem_incdec
901 cycles, szRev
866 cycles, szRev_incdec
1454 cycles, szWcnt
1532 cycles, szWcnt_incdec
Title: Re: why is "add" faster than "inc"
Post by: jdoe on May 01, 2006, 04:24:04 PM
P4 1700MHz


1185 cycles, cmpmem
1212 cycles, cmpmem_incdec
1070 cycles, szRev
1019 cycles, szRev_incdec
1640 cycles, szWcnt
1759 cycles, szWcnt_incdec
Press any key to exit...
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on May 01, 2006, 05:10:03 PM
Quote from: MichaelW on May 01, 2006, 03:13:19 PM
I don't understand why, but if I change the INC ECX after the test_word label back to ADD ECX, 1 the cycle count for a P3 drops from 1332 to about 1281.
Here when i change one or two incs by add the time don't change, but when i change all it drops, i test diferent alignments and the clock count vary from 1200 to 1600, so i guess the diference in the third algo is due to alignment (other topic?).
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on May 01, 2006, 05:14:50 PM
Yes, was alignment, putting a xchg ax, ax after each inc/dec (to have the same size of add/sub):

1063 cycles, cmpmem
1063 cycles, cmpmem_incdec
824 cycles, szRev
821 cycles, szRev_incdec
1265 cycles, szWcnt
1265 cycles, szWcnt_incdec
Press any key to exit...

[attachment deleted by admin]
Title: Re: why is "add" faster than "inc"
Post by: Mark Jones on May 01, 2006, 08:12:30 PM
Quote from: MichaelW on May 01, 2006, 01:41:00 PM
The attachment this time tests INC/DEC versions of the MASM32 cmpmem, szRev, and szWcnt procedures against the original ADD/SUB versions. I simply replaced each ADD reg, 1 with INC reg and each SUB reg, 1 with DEC reg.

Interesting, tried to run it from WinRAR and it crashed. (AMD XP 2500+) Runs fine if extracted manually and ran. Only time this has ever happened that I recall.


1083 cycles, cmpmem
1086 cycles, cmpmem_incdec
911 cycles, szRev
871 cycles, szRev_incdec
1467 cycles, szWcnt
1544 cycles, szWcnt_incdec
Title: Re: why is "add" faster than "inc"
Post by: asmfan on May 02, 2006, 07:51:33 PM
Guys, if you need fair tests use nonmultitasking OS (DOS for example) in other way any difference in results can be referred to a measuring error...
Title: Re: why is "add" faster than "inc"
Post by: msmith on May 03, 2006, 02:33:49 AM
BogdanOntanu said:

Quote
There is a political reason also: the HLL llanguages do much better at using ADD/SUB than INC/DEC ...

I made the following program to test this on my compiler:


; Dimension Variables
DIM I AS LONG
DIM StartTime AS LONG

OBMain.CREATE

END EVENT

Button1.COMMAND
StartTime=GETTICKCOUNT()
FOR I=1 to 2000000000
; Do nothing indide loop
NEXT I
Button1.TEXT=STR$((GETTICKCOUNT()-StartTime))
END EVENT



Here is the .asm code ror inc:

; LN:11 FOR I=1 to 2000000000
mov eax,1
mov [I],eax
mov eax,2000000000
mov [_LopVec1],eax
_Lbl2:
mov eax,[I]
cmp eax,[_LopVec1]
jg _Lbl4
; LN:12 ; Do nothing indide loop
; LN:13 NEXT I
_Lbl3:
inc [I]
jmp _Lbl2
_Lbl4:


Here is the .asm code for add:

; LN:11 FOR I=1 to 2000000000
mov eax,1
mov [I],eax
mov eax,2000000000
mov [_LopVec1],eax
_Lbl2:
mov eax,[I]
cmp eax,[_LopVec1]
jg _Lbl4
; LN:12 ; Do nothing indide loop
; LN:13 NEXT I
_Lbl3:
mov eax,[I]
add eax,1
mov [I],eax
jmp _Lbl2
_Lbl4:


With an inc instruction in the "Next" incrementer code, the pgm runs in 5.2 seconds.

After modifying the compiler to generate a mov, add 1, mov sequence to increment the loopvar, the pgm was recompiled and ran in 4.2 seconds.

This is a 22% improvement!

Changinging the compiler to do this took less than a minute of work. The compiler can output either type of code with equal ease. In fact it wouldn't be that difficult to detect the type of machine and output the inc or add code on the fly.
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on May 03, 2006, 10:42:45 AM
msmith, whats your processor?
Title: Re: why is "add" faster than "inc"
Post by: msmith on May 03, 2006, 06:43:29 PM
It's a PIV 2.8GHZ.
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on May 04, 2006, 12:58:03 AM
P4 breaks inc in two uops, add is much faster in it,
the same don't happen with Athlon.
Title: Re: why is "add" faster than "inc"
Post by: msmith on May 04, 2006, 02:13:35 AM
EduardoS,

Thanks for the info.

The problem with sensing the processor type and generating the optimized code is that it is easy to do so at compile time, but much harder at run time.

If I sense the processor type at compile time and generate the best code, there is no telling what kind of machine the .exe will be run on.

The 22% speed improvement on each iteration of a FOR LOOP is significant, especially if there is not much inside the loop.

Mike
Title: Re: why is "add" faster than "inc"
Post by: BogdanOntanu on May 04, 2006, 06:36:32 AM
The problem with such simple "do almost nothing loops" is that they never represent the real behavoiur of a compiler during a big/medium application.
Basically you will never have such empty loops in real life applications and algorithms.

I have seen compilers "forget" about variables assignments under stress conditions or use constructs like:

movzx edx, word ptr[ esi]  ... correct because the pointer was typed "word"

and then:

movzx ecx,dx   --> funny is it not ? apparently edx is also considered "word" :P

sometimes on the very next instruction...

I am always amazed on the huge credit compilers get from perfoming such simple tests :D
when in fact they do perform much badly in really complicated algorithms...


Title: Re: why is "add" faster than "inc"
Post by: msmith on May 04, 2006, 07:13:59 PM
Quote
The problem with such simple "do almost nothing loops" is that they never represent the real behavoiur of a compiler during a big/medium application.
Basically you will never have such empty loops in real life applications and algorithms.

It is true that a contrived test never fully represents real behavior.

The behavior of the processor (cache loading, branch lookahead, etc.) will likeky change, but not the behavior of the  compiler.

No test in life is really representative. On your driving test, you have a policeman in your car to check you. Do you drive the same with a policeman in the car as when you don't? Every test in life is contrived, but the advantage of such tests is the possibility of testing just one thing in isolation. If the inc/add test had complex code in the loop, we would not know for sure why one example was faster than the other.
Title: Re: why is "add" faster than "inc"
Post by: EduardoS on May 04, 2006, 10:45:09 PM
I tested the szWcnt_incdec on a Athlon XP and test some variations, i don't change de code, just code alignment, the best variation was a kind of "out of logic" alignment:

1451 cycles, szWcnt
1334 cycles, szWcnt_incdec <-- best variantion


With some comments:

szWcnt_incdec proc src:DWORD,txt:DWORD

    push ebx
    push esi
    push edi

    mov edx, len([esp+16])      ; procedure call for src length
    sub edx, len([esp+20])      ; procedure call for 1st text length

    mov eax, -1
    mov esi, [esp+16]           ; source in ESI
    add edx, esi                ; add src to exit position
    xor ebx, ebx                ; clear EBX to prevent stall with BL
    add edx, 1                  ; correct to get last word
    mov edi, [esp+20]           ; text to count in EDI
    sub esi, 1
align 16
  pre:
    INC EAX     ; was faster with this instruction aligned 16 bytes
    xchg ax, ax
    nop
  wcst:         
    add esi, 1   ;For the best performace the distance from the start of inc above must be 4 bytes and
                    ;must exists two instructions (nops) between them (don't ask why)
    xchg ax, ax
    nop

    cmp edx, esi ; the distance from the start of add must be 6 bytes (don't ask why) and
                      ; must exists two instructions (nops) between them (again, don't ask why)
    jle wcout                  ;
    mov bl, [esi]               ;
    cmp bl, [edi]               ; all of this instructions must be in sequence
    jne wcst                    ;

    nop                   ;
    xor ecx, ecx       ; the inc must right after xor and must exists a nop before xor and a nop after inc
  test_word:           ; the length of this 4 instructions must be 6 bytes
    INC ECX             ;
    xchg ax, ax        ;

    cmp BYTE PTR [edi+ecx], 0   ;
    je pre                      ;
    mov bl, [esi+ecx]           ;  Must be in sequence
    cmp bl, [edi+ecx]           ;
    jne wcst                    ;
     jmp test_word               ;
  wcout:

    pop edi
    pop esi
    pop ebx

    ret 8

szWcnt_incdec endp



I really don't understand why this alignment was the fastest on XP 2000+.