News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

why is "add" faster than "inc"

Started by thomas_remkus, April 28, 2006, 08:17:49 PM

Previous topic - Next topic

BogdanOntanu

Unfortunately JDoe is right...

This is a mistake made by the new CPU's (P4 and up)
One of the aberations of human technology "evolution"

New less experienced people have come to development team and they simply forgot about the importance of INC and DEC...

ADD and SUB are inheritely more complex operations in hardware than INC/DEC but the new commers forgot about it...
They will rediscover it some day...if ever :D There is a political reason also: the HLL llanguages do much better at using ADD/SUB than INC/DEC ...

Such is life on this planet...

If somebody has a P2/P2/P1/386 INC/DEC will be much faster than ADD/SUB

Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

Ratch

jdoe,

     ADD EAX,1 uses three times as many bytes as INC EAX.  Ratch

jdoe

Quote from: Ratch on April 29, 2006, 08:42:41 PM
ADD EAX,1 uses three times as many bytes as INC EAX.  Ratch

I won't argue on that because it is an immutable truth. BTW, I've never talk about optimize size which have different purpose than speed.

You definitely want the last word on the subject. Keep searching...


hutch--

Bogdan is correct here, it is simply technology change based on how the hardware is constructed. INC and DEC performed well on most of the older stuff but the PIV is internally different and Intel publish that it is preferred to use ADD SUB instead. From what I can tell later AMD stuff is working much the same way. Now the upshot is if you are still writing code dedicated to older hardware and the speed actually matters, use INC DEC but if you are targetting modern hardware, use ADD SUB as the manufacturer suggests.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

EduardoS

Quote from: EduardoS on April 29, 2006, 02:08:26 PM
Quote from: jdoe on April 28, 2006, 09:51:16 PM
Quote from: arafel on April 28, 2006, 08:33:46 PM
(at least for Intel cpus, don't know if there is difference for AMD)

add/sub are faster than inc/dec even on AMD processor.  :thumbu




Maybe under certain conditions, generaly not:

Press any key to start...
add 1 : 1019 clocks
add 2 : 1020 clocks
add 3 : 1020 clocks
add 4 : 1363 clocks
inc 1 : 1020 clocks
inc 2 : 1020 clocks
inc 3 : 1021 clocks
inc 4 : 1361 clocks
add/cmp : 1019 clocks
inc/cmp : 1019 clocks
Press any key to exit...


It was on an Athlon 64... I think AMD don't have anything newer...
AMD kill some instructions on 64bits mode, the inc/dec lose the 1 byte form but still existing on the 2 byte form, so i guess they will suport inc/dec for some time more.

I'm curious to know how this code go on P4.

Mark Jones

Quote from: hutch-- on April 29, 2006, 09:58:46 PM
...but if you are targetting modern hardware, use ADD SUB as the manufacturer suggests.

I wonder then, why manufacturers simply have not aliased the two in modern processors?
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

hutch--

Mark,

From memory its purely a specification difference in which flags are set. What you suggest makes sense as it would reduce an instruction redundancy.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dsouza123

Maybe Intel has corrected it (slow INC/DEC)
on the Merom/Conroe/Woodcrest (Core Microarchitecure) (mobile,desktop,server)
which trace their lineage
Pentium 3 -> Pentium M -> Yonah -> Core Microarchitecure

Woodcrest has a June launch, Conroe July, Merom August.
Only some engineering samples out in the wild, some benchmark results available.

Unfortunately haven't seen any Instruction spec sheets for them yet.

NightWare

Quote from: jdoe on April 29, 2006, 09:44:57 PM
I won't argue on that because it is an immutable truth. BTW, I've never talk about optimize size which have different purpose than speed.

i think you're partially wrong... coz the size of the code change the size of the jump in the loop... it could make a big difference, sometimes...

but you're right when you say that both have to be tested... i always do that, and i've saw that inc/dec is generally a bit faster than add/sub (with celeron 700mghz and P4 2ghz)... but of course, like i said previously, it depends of the entire code/proc...

thomas_remkus

I ... uh, really ... did not expect such a large conversation. Clearly small things like this are really hot topics.

Here's what I got out of this: INC/DEC used to be faster, but ADD/SUB are faster now. That does not mean that they always will be, but it can depend on the chip. Also, ADD/SUB are larger so the jump might take a different method to get there so size is something to consider.

I have tested *my* code from as the cloud-maker, and have found that in my instance with visual studio and inline __asm under debug the ADD/SUB is faster but INC/DEC is faster under release. For what reasons, I have no idea.

This really tells me one more major thing ... I'll check my clouds very carefully!!

Mincho Georgiev

QuoteThere is a political reason also: the HLL llanguages do much better at using ADD/SUB than INC/DEC
Unfortunately, this sounds pretty logical. I can't figure a single reason for ADD/SUB to be faster than INC except a hardware design mistake or Bogdan's point of view... 

arafel

Quote from: EduardoS on April 29, 2006, 10:44:32 PM
I'm curious to know how this code go on P4.

Eduardos, such tests are far from being realistic. A continuous repeating of the inc/add institutions doesn't represent any real life scenario where almost always other factors present.

Quote from: thomas_remkus on April 30, 2006, 02:29:49 AM
Also, ADD/SUB are larger so the jump might take a different method to get there so size is something to consider.

It's more a case of crossing cache boundaries than the distance of the jmps. On some occasions add 1/sub 1, because of it's size, will make you to cross the boundary and lead to big slowdown. The solution than might be replacing it by inc/dec to reduce the code size.


...Anyway, as it has been mentioned here already, better would be just to try different approaches when optimizing and see which one gives better results.

P.S. In every-day coding when not doing tight optimizations I always use inc/dec, because it requires less typing  :green

EduardoS

Quote from: arafel on April 30, 2006, 12:50:37 PM
Eduardos, such tests are far from being realistic. A continuous repeating of the inc/add institutions doesn't represent any real life scenario where almost always other factors present.
I can't test every algo to see if inc or add is faster,
That code is usefull to see if there is any diference in latency and throughput and if the flag dependency affect the result, in Athlon it don't show any difference.

arafel

Quote from: EduardoS on April 30, 2006, 02:32:47 PM
I can't test every algo to see if inc or add is faster,
Therefore I stand by what I have said: better try different approaches when optimizing and see which one is better.

Quote from: EduardoS on April 30, 2006, 02:32:47 PM
That code is usefull to see if there is any diference in latency and throughput and if the flag dependency affect the result, in Athlon it don't show any difference.

cmp   ebx, 5
inc   eax | add   eax, 1
cmp   ebx, 5
inc   eax | add   eax, 1
....


Doesn't exactly test the dependency you have mentioned.
Every next cmp instruction in such case wont make a difference whether add or inc was used, since it doesn't depend on the difference of CF modification by add and inc.

Adding some other instruction which depends on those things will solve this.


cmp   ebx, 5 ; ZF is affected
inc   eax ; CF is not affected
seta   dl ;;; seta will need to wait for both cmp and inc to retire to get the needed flag values.

cmp   ebx, 5 ; ZF is affected
add   eax, 1 ; CF and ZF are affected
seta   dl ; seta will execute right away after add retiring, independently of cmp progress status.

EduardoS

Quote from: arafel on April 30, 2006, 04:21:44 PM
Doesn't exactly test the dependency you have mentioned.
Every next cmp instruction in such case wont make a difference whether add or inc was used, since it doesn't depend on the difference of CF modification by add and inc.
You are right here.

Quote
Adding some other instruction which depends on those things will solve this.


cmp   ebx, 5 ; ZF is affected
inc   eax ; CF is not affected
seta   dl ;;; seta will need to wait for both cmp and inc to retire to get the needed flag values.

cmp   ebx, 5 ; ZF is affected
add   eax, 1 ; CF and ZF are affected
seta   dl ; seta will execute right away after add retiring, independently of cmp progress status.

Here we have a true dependency, the seta depnds on the result of cmp and inc, you can't replace the inc by "add eax, 1" cause you will lose the status of carry and give a diferent result, the question is about false dependency where inc and add reg, 1 give the same result, replacing the seta by setz for example.


    Repeat 1024
        cmp ebx, 5
        inc eax
        setz bl       
    endm



add/seta : 1020 clocks
inc/seta : 2044 clocks <--True dependency
add/setz : 1020 clocks
inc/setz : 1020 clocks