i'm in a tight loop. doing basically nothing but adding numbers. when I use "inc eax" it's much slower than "add eax, 1". why is this? the difference is significant.
It's faster only on PIV and above processors, because there is some penalty due to partial register stall when using inc instruction. On PIII and below "add reg, 1" is much slower.
(at least for Intel cpus, don't know if there is difference for AMD)
Quote from: arafel on April 28, 2006, 08:33:46 PM
(at least for Intel cpus, don't know if there is difference for AMD)
add/sub are faster than inc/dec even on AMD processor. :thumbu
It's because add changes all of the arithmetic registers, while inc changes only some of them - so the processor may have to wait before another arithmetic operation completes just to set the flags correctly, even when the calculations are completely unrelated.
For example, if I have this:
cmp eax,10h
add edx,1
the processor doesn't have to wait for the cmp instruction to complete to be able to execute the add instruction. But if I have this:
cmp eax,10h
inc edx
then the processor has to wait for cmp to know how the flags have to be set after executing inc.
jdoe,
Quote
add/sub are faster than inc/dec even on AMD processor
Both ADD and INC are DirectPath vs VectorPath instructions according to the AMD Optimization Manual http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf . Also many optimization examples in the manual use INC under the right circumstances, i.e. no reading or writing the register immediately after modifying it. I can't find any statement in the manual saying that a ADD is preferable to a INC on the AMD.
QvasiModo,
Quote
It's because add changes all of the arithmetic registers, while inc changes only some of them
Both ADD and INC change only the register that they are coded to change.
Quote
For example, if I have this:
Code:
cmp eax,10h
add edx,1
the processor doesn't have to wait for the cmp instruction to complete to be able to execute the add instruction. But if I have this:
Code:
cmp eax,10h
inc edx
then the processor has to wait for cmp to know how the flags have to be set after executing inc
Why? In both cases the ADD in the first snippet and the INC in the second snippet are going to wipe away any flag settings of the CMP instruction. Ratch
I think QvasiModo is referring to the differences in flag settings.
Because ADD and CMP change the same set of flags, and INC and CMP don't, there may be a stall for creating the correct flag setting in the latter case.
The difference is CF. In multiprecision arithmetic, you would use INC/DEC for counting and updating addresses. You would need to save and restore CF if there were no increment/decrement instructions that left CF alone.
Quote from: Ratch on April 29, 2006, 12:04:18 AM
jdoe,
Quote
add/sub are faster than inc/dec even on AMD processor
Both ADD and INC are DirectPath vs VectorPath instructions according to the AMD Optimization Manual http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf . Also many optimization examples in the manual use INC under the right circumstances, i.e. no reading or writing the register immediately after modifying it. I can't find any statement in the manual saying that a ADD is preferable to a INC on the AMD.
I don't mind about what was written or not. From the test I did, using add/sub is in worst case as fast or faster that inc/dec. On my AMD athlon 1800+ though.
If you ear from the radio that the sky is green today, would you believe it without going outside to see it by youself ?
jdoe,
QuoteIf you ear from the radio that the sky is green today, would you believe it without going outside to see it by youself ?
From the radio? No I certainly would not. But if the one who made the sky said so, then I would believe it until I saw otherwise. Check your timings again. They can be tricky with insidious pitfalls. Ratch
tenkey,
QuoteBecause ADD and CMP change the same set of flags, and INC and CMP don't, there may be a stall for creating the correct flag setting in the latter case.
Again I point out, the CMPs in his example code are effectively NOPs. The flags the CMPs set or clear are wiped out by the following ADD and INC instructions. Ratch
In the words of Intel from PIV manual 4,
Quote
The inc and dec instructions should always be avoided. Using add and sub instructions instead of inc and dec instructions avoid data dependence and improve performance.
This probably has something to do with why ADD SUB are faster on later Intel hardware. :bg
Quote from: Ratch on April 29, 2006, 01:10:44 AM
tenkey,
QuoteBecause ADD and CMP change the same set of flags, and INC and CMP don't, there may be a stall for creating the correct flag setting in the latter case.
Again I point out, the CMPs in his example code are effectively NOPs. The flags the CMPs set or clear are wiped out by the following ADD and INC instructions. Ratch
Here is a demonstration of the INC instruction (DEC is similar). Predict what the following code will produce, then run it. Replace the INC with the equivalent ADD and see if there's a difference.
.386
.model stdcall, flat
option casemap :none ; case sensitive
include c:\masm32\include\windows.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
includelib c:\masm32\lib\kernel32.lib
includelib c:\masm32\lib\user32.lib
.data
caseclr db "CF is cleared by CMP, not set by INC.",0
case0 db "CF is not set by STC.",0
case1 db "CF is not cleared by CMP.",0
case2 db "CF is set by INC.",0
.code
_start:
stc ; set CF
jnc carryclear_case0 ; error if CF is clear
mov ecx,-1
mov eax,7
cmp eax,3 ; 7 - 3 = 4, no carry (borrow)
jc carryset_case1 ; error if CF is set
; CF status is "clear"
inc ecx ; FFFFFFFF + 1 = 0 w/carry - is CF set?
jc carryset_case2 ; find out!
carryclear:
invoke MessageBox,NULL,addr caseclr,addr caseclr,MB_OK
jmp quit
carryclear_case0:
invoke MessageBox,NULL,addr case0,addr case0,MB_OK
jmp quit
carryset_case1:
invoke MessageBox,NULL,addr case1,addr case1,MB_OK
jmp quit
carryset_case2:
invoke MessageBox,NULL,addr case2,addr case2,MB_OK
quit:
invoke ExitProcess,0
end _start
Quote from: jdoe on April 28, 2006, 09:51:16 PM
Quote from: arafel on April 28, 2006, 08:33:46 PM
(at least for Intel cpus, don't know if there is difference for AMD)
add/sub are faster than inc/dec even on AMD processor. :thumbu
Maybe under certain conditions, generaly not:
Press any key to start...
add 1 : 1019 clocks
add 2 : 1020 clocks
add 3 : 1020 clocks
add 4 : 1363 clocks
inc 1 : 1020 clocks
inc 2 : 1020 clocks
inc 3 : 1021 clocks
inc 4 : 1361 clocks
add/cmp : 1019 clocks
inc/cmp : 1019 clocks
Press any key to exit...
[attachment deleted by admin]
Athlon 1.2 Ghz @ 1190 Mhz
Windows XP SP2 512MB
Press any key to start...
add 1 : 1026 clocks
add 2 : 1027 clocks
add 3 : 1351 clocks
add 4 : 1802 clocks
inc 1 : 1027 clocks
inc 2 : 1025 clocks
inc 3 : 1028 clocks
inc 4 : 1373 clocks
add/cmp : 1026 clocks
inc/cmp : 1027 clocks
Press any key to exit...
Quote from: thomas_remkus on April 28, 2006, 08:17:49 PM
when I use "inc eax" it's much slower than "add eax, 1". why is this?
Code optimization is like working on a Sudoku puzzle or decrypting an encrypto-gram. Can be lots of fun, and also maddeningly annoying at the same time. :bg
See Agner Fog's optimization guide: http://www.agner.org/assem/
Quote from: Ratch on April 29, 2006, 01:02:43 AM
But if the one who made the sky said so, then I would believe it until I saw otherwise.
You know how to answer.
Ok, I give you one point about a little gain with INC/DEC on AMD in some circumstance but I keep saying that generaly, using ADD/SUB is
as fast or faster. In other words, when writing optimize code, trying both is a good idea.
Unfortunately JDoe is right...
This is a mistake made by the new CPU's (P4 and up)
One of the aberations of human technology "evolution"
New less experienced people have come to development team and they simply forgot about the importance of INC and DEC...
ADD and SUB are inheritely more complex operations in hardware than INC/DEC but the new commers forgot about it...
They will rediscover it some day...if ever :D There is a political reason also: the HLL llanguages do much better at using ADD/SUB than INC/DEC ...
Such is life on this planet...
If somebody has a P2/P2/P1/386 INC/DEC will be much faster than ADD/SUB
jdoe,
ADD EAX,1 uses three times as many bytes as INC EAX. Ratch
Quote from: Ratch on April 29, 2006, 08:42:41 PM
ADD EAX,1 uses three times as many bytes as INC EAX. Ratch
I won't argue on that because it is an immutable truth. BTW, I've never talk about optimize size which have different purpose than speed.
You definitely want the last word on the subject. Keep searching...
Bogdan is correct here, it is simply technology change based on how the hardware is constructed. INC and DEC performed well on most of the older stuff but the PIV is internally different and Intel publish that it is preferred to use ADD SUB instead. From what I can tell later AMD stuff is working much the same way. Now the upshot is if you are still writing code dedicated to older hardware and the speed actually matters, use INC DEC but if you are targetting modern hardware, use ADD SUB as the manufacturer suggests.
Quote from: EduardoS on April 29, 2006, 02:08:26 PM
Quote from: jdoe on April 28, 2006, 09:51:16 PM
Quote from: arafel on April 28, 2006, 08:33:46 PM
(at least for Intel cpus, don't know if there is difference for AMD)
add/sub are faster than inc/dec even on AMD processor. :thumbu
Maybe under certain conditions, generaly not:
Press any key to start...
add 1 : 1019 clocks
add 2 : 1020 clocks
add 3 : 1020 clocks
add 4 : 1363 clocks
inc 1 : 1020 clocks
inc 2 : 1020 clocks
inc 3 : 1021 clocks
inc 4 : 1361 clocks
add/cmp : 1019 clocks
inc/cmp : 1019 clocks
Press any key to exit...
It was on an Athlon 64... I think AMD don't have anything newer...
AMD kill some instructions on 64bits mode, the inc/dec lose the 1 byte form but still existing on the 2 byte form, so i guess they will suport inc/dec for some time more.
I'm curious to know how this code go on P4.
Quote from: hutch-- on April 29, 2006, 09:58:46 PM
...but if you are targetting modern hardware, use ADD SUB as the manufacturer suggests.
I wonder then, why manufacturers simply have not aliased the two in modern processors?
Mark,
From memory its purely a specification difference in which flags are set. What you suggest makes sense as it would reduce an instruction redundancy.
Maybe Intel has corrected it (slow INC/DEC)
on the Merom/Conroe/Woodcrest (Core Microarchitecure) (mobile,desktop,server)
which trace their lineage
Pentium 3 -> Pentium M -> Yonah -> Core Microarchitecure
Woodcrest has a June launch, Conroe July, Merom August.
Only some engineering samples out in the wild, some benchmark results available.
Unfortunately haven't seen any Instruction spec sheets for them yet.
Quote from: jdoe on April 29, 2006, 09:44:57 PM
I won't argue on that because it is an immutable truth. BTW, I've never talk about optimize size which have different purpose than speed.
i think you're partially wrong... coz the size of the code change the size of the jump in the loop... it could make a big difference, sometimes...
but you're right when you say that both have to be tested... i always do that, and i've saw that inc/dec is generally a bit faster than add/sub (with celeron 700mghz and P4 2ghz)... but of course, like i said previously, it depends of the entire code/proc...
I ... uh, really ... did not expect such a large conversation. Clearly small things like this are really hot topics.
Here's what I got out of this: INC/DEC used to be faster, but ADD/SUB are faster now. That does not mean that they always will be, but it can depend on the chip. Also, ADD/SUB are larger so the jump might take a different method to get there so size is something to consider.
I have tested *my* code from as the cloud-maker, and have found that in my instance with visual studio and inline __asm under debug the ADD/SUB is faster but INC/DEC is faster under release. For what reasons, I have no idea.
This really tells me one more major thing ... I'll check my clouds very carefully!!
QuoteThere is a political reason also: the HLL llanguages do much better at using ADD/SUB than INC/DEC
Unfortunately, this sounds pretty logical. I can't figure a single reason for ADD/SUB to be faster than INC except a hardware design mistake or Bogdan's point of view...
Quote from: EduardoS on April 29, 2006, 10:44:32 PM
I'm curious to know how this code go on P4.
Eduardos, such tests are far from being realistic. A continuous repeating of the inc/add institutions doesn't represent any real life scenario where almost always other factors present.
Quote from: thomas_remkus on April 30, 2006, 02:29:49 AM
Also, ADD/SUB are larger so the jump might take a different method to get there so size is something to consider.
It's more a case of crossing cache boundaries than the distance of the jmps. On some occasions add 1/sub 1, because of it's size, will make you to cross the boundary and lead to big slowdown. The solution than might be replacing it by inc/dec to reduce the code size.
...Anyway, as it has been mentioned here already, better would be just to try different approaches when optimizing and see which one gives better results.
P.S. In every-day coding when not doing tight optimizations I always use inc/dec, because it requires less typing :green
Quote from: arafel on April 30, 2006, 12:50:37 PM
Eduardos, such tests are far from being realistic. A continuous repeating of the inc/add institutions doesn't represent any real life scenario where almost always other factors present.
I can't test every algo to see if inc or add is faster,
That code is usefull to see if there is any diference in latency and throughput and if the flag dependency affect the result, in Athlon it don't show any difference.
Quote from: EduardoS on April 30, 2006, 02:32:47 PM
I can't test every algo to see if inc or add is faster,
Therefore I stand by what I have said: better try different approaches when optimizing and see which one is better.
Quote from: EduardoS on April 30, 2006, 02:32:47 PM
That code is usefull to see if there is any diference in latency and throughput and if the flag dependency affect the result, in Athlon it don't show any difference.
cmp ebx, 5
inc eax | add eax, 1
cmp ebx, 5
inc eax | add eax, 1
....
Doesn't exactly test the dependency you have mentioned.
Every next cmp instruction in such case wont make a difference whether add or inc was used, since it doesn't depend on the difference of CF modification by add and inc.
Adding some other instruction which depends on those things will solve this.
cmp ebx, 5 ; ZF is affected
inc eax ; CF is not affected
seta dl ;;; seta will need to wait for both cmp and inc to retire to get the needed flag values.
cmp ebx, 5 ; ZF is affected
add eax, 1 ; CF and ZF are affected
seta dl ; seta will execute right away after add retiring, independently of cmp progress status.
Quote from: arafel on April 30, 2006, 04:21:44 PM
Doesn't exactly test the dependency you have mentioned.
Every next cmp instruction in such case wont make a difference whether add or inc was used, since it doesn't depend on the difference of CF modification by add and inc.
You are right here.
Quote
Adding some other instruction which depends on those things will solve this.
cmp ebx, 5 ; ZF is affected
inc eax ; CF is not affected
seta dl ;;; seta will need to wait for both cmp and inc to retire to get the needed flag values.
cmp ebx, 5 ; ZF is affected
add eax, 1 ; CF and ZF are affected
seta dl ; seta will execute right away after add retiring, independently of cmp progress status.
Here we have a true dependency, the seta depnds on the result of cmp and inc, you can't replace the inc by "add eax, 1" cause you will lose the status of carry and give a diferent result, the question is about false dependency where inc and add reg, 1 give the same result, replacing the seta by setz for example.
Repeat 1024
cmp ebx, 5
inc eax
setz bl
endm
add/seta : 1020 clocks
inc/seta : 2044 clocks <--True dependency
add/setz : 1020 clocks
inc/setz : 1020 clocks
Yes, but the problem setz doesn't depend on the CF. You just can't use instruction which isn't affected by CF, can't! It breaks the whole purpose of testing the things you were testing. This is why the outcome is identical in the test results you have posted.
Perhaps choosing seta wasn't such a good idea from my part, but it was merely an example of the partial register stall I was talking about.
Will this prove my point better? :
add eax, 1 ; affects CF and other flags
rcr ecx, 1 ; affects CF
add esi, 1234567 ; doesn't need to wait for both preceding instructions to retire
inc eax ; affects some flags except CF
rcr ecx, 1 ; affects CF
add esi, 1234567 ; will wait for both preciding instuctions to retire
Here both codes needs only 1 cycle,
Athlon don't seem to be slower with inc.
hmm. On PIII (repeated 1024 times of course):
add/rcr/add: 2516 clocks
inc/rcr/add : 7374 clocks
I'll test on a couple of PIVs tomorrow... Although imo it shouldn't be any different on them unless PIV doesn't have flags register stall penalty, what I doubt.
Quote from: arafel on April 30, 2006, 08:18:59 PM
add eax, 1 ; affects CF and other flags
rcr ecx, 1 ; affects CF
add esi, 1234567 ; doesn't need to wait for both preceding instructions to retire
inc eax ; affects some flags except CF
rcr ecx, 1 ; affects CF
add esi, 1234567 ; will wait for both preciding instuctions to retire
A small note, the two codes give diferent results, the first is a true dependency:
add eax, 1 ; affects CF and other flags
rcr ecx, 1 ; affects CF ;;; rcr needs the carry to perform the shift and the first instruction changes the carry (true dependency)
add esi, 1234567 ; doesn't need to wait for both preceding instructions to retire
An example where both inc and add reg,1 give the same result should be:
add eax, 1 ; affects CF and other flags
bt ecx, 1 ; affects CF
add esi, 1234567 ; doesn't need to wait for both preceding instructions to retire
Wich need the same clock cycle to execute here.
Just an interesting effect on P3, try both:
add eax, 1 ; affects CF and other flags
bt ecx, 1 ; affects CF
inc eax ; affects some flags except CF
rcr ecx, 1 ; affects CF
This is partly how the Athlon XP 1800 claims to run at "2500" speed - it simply executes some codes faster than 1:1 clock speed.
Instead of comparing code sequences that are not representative of real code, this compares two versions of the MASM32 cmpmem procedure:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include timers.asm
cmpmem_incdec PROTO :DWORD,:DWORD,:DWORD
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
pMem dd 0
fLen dd 0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke read_disk_file, chr$("\masm32\m32lib\cmpmem.asm"),
ADDR pMem, ADDR fLen
print ustr$(fLen)," bytes",13,10
invoke cmpmem, pMem, pMem, fLen
print ustr$(eax),13,10
invoke cmpmem_incdec, pMem, pMem, fLen
print ustr$(eax),13,10
LOOP_COUNT equ 1000000
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke cmpmem, pMem, pMem, fLen
counter_end
print ustr$(eax)," cycles, cmpmem",13,10
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
invoke cmpmem_incdec, pMem, pMem, fLen
counter_end
print ustr$(eax)," cycles, cmpmem_incdec",13,10
free pMem
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
cmpmem_incdec proc buf1:DWORD,buf2:DWORD,bcnt:DWORD
push esi
push edi
mov edx, bcnt
shr edx, 2 ; div by 4
mov esi, buf1
mov edi, buf2
xor ecx, ecx
align 4
@@:
mov eax, [esi+ecx] ; DWORD compare main file
cmp eax, [edi+ecx]
jne fail
add ecx, 4
;sub edx, 1
DEC EDX
jnz @B
mov edx, bcnt ; calculate any remainder
and edx, 3
jz match ; exit if its zero
xor eax, eax ; clear EAX for partial writes
@@:
mov al, [esi+ecx] ; BYTE compare tail
cmp al, [edi+ecx]
jne fail
;add ecx, 1
INC ECX
;sub edx, 1
DEC EDX
jnz @B
jmp match
fail:
xor eax, eax ; return zero if DIFFERENT
jmp quit
match:
mov eax, 1 ; return NON zero if SAME
quit:
pop edi
pop esi
ret
cmpmem_incdec endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Typical result on a P3:
1777 cycles, cmpmem
1081 cycles, cmpmem_incdec
Typical result on my old K5:
916 cycles, cmpmem
910 cycles, cmpmem_incdec
[attachment deleted by admin]
Interestingly enough, i am getting very little difference on this PIV.
1396 bytes
1
1
1201 cycles cmpmem
1210 cycles cmpmem_incdec
Press any key to exit...
The same can be said for my AMD.
Quote1396 bytes
1
1
1097 cycles
1077 cycles
Press any key to exit...
AMD Athlon 1800+
1396 bytes
1
1
1079 cycles
1081 cycles
Press any key to exit...
PBrennick, I'm curious to know your AMD model? I can't get a difference of that range on mine by using INC/DEC.
The attachment this time tests INC/DEC versions of the MASM32 cmpmem, szRev, and szWcnt procedures against the original ADD/SUB versions. I simply replaced each ADD reg, 1 with INC reg and each SUB reg, 1 with DEC reg.
Typical results on a P3:
1776 cycles, cmpmem
1080 cycles, cmpmem_incdec
944 cycles, szRev
901 cycles, szRev_incdec
1584 cycles, szWcnt
1332 cycles, szWcnt_incdec
[attachment deleted by admin]
A64
Quote
1063 cycles, cmpmem
1063 cycles, cmpmem_incdec
824 cycles, szRev
822 cycles, szRev_incdec
1266 cycles, szWcnt
1432 cycles, szWcnt_incdec
Press any key to exit...
I don't find dependencys on szWcnt, why it was slower?
I don't understand why, but if I change the INC ECX after the test_word label back to ADD ECX, 1 the cycle count for a P3 drops from 1332 to about 1281.
Athlon 1300 Mhz Windows XP SP2
1396 bytes
1
1
1078 cycles, cmpmem
1077 cycles, cmpmem_incdec
1076 cycles, cmpmem
1071 cycles, cmpmem_incdec
901 cycles, szRev
866 cycles, szRev_incdec
1454 cycles, szWcnt
1532 cycles, szWcnt_incdec
P4 1700MHz
1185 cycles, cmpmem
1212 cycles, cmpmem_incdec
1070 cycles, szRev
1019 cycles, szRev_incdec
1640 cycles, szWcnt
1759 cycles, szWcnt_incdec
Press any key to exit...
Quote from: MichaelW on May 01, 2006, 03:13:19 PM
I don't understand why, but if I change the INC ECX after the test_word label back to ADD ECX, 1 the cycle count for a P3 drops from 1332 to about 1281.
Here when i change one or two incs by add the time don't change, but when i change all it drops, i test diferent alignments and the clock count vary from 1200 to 1600, so i guess the diference in the third algo is due to alignment (other topic?).
Yes, was alignment, putting a xchg ax, ax after each inc/dec (to have the same size of add/sub):
1063 cycles, cmpmem
1063 cycles, cmpmem_incdec
824 cycles, szRev
821 cycles, szRev_incdec
1265 cycles, szWcnt
1265 cycles, szWcnt_incdec
Press any key to exit...
[attachment deleted by admin]
Quote from: MichaelW on May 01, 2006, 01:41:00 PM
The attachment this time tests INC/DEC versions of the MASM32 cmpmem, szRev, and szWcnt procedures against the original ADD/SUB versions. I simply replaced each ADD reg, 1 with INC reg and each SUB reg, 1 with DEC reg.
Interesting, tried to run it from WinRAR and it crashed. (AMD XP 2500+) Runs fine if extracted manually and ran. Only time this has ever happened that I recall.
1083 cycles, cmpmem
1086 cycles, cmpmem_incdec
911 cycles, szRev
871 cycles, szRev_incdec
1467 cycles, szWcnt
1544 cycles, szWcnt_incdec
Guys, if you need fair tests use nonmultitasking OS (DOS for example) in other way any difference in results can be referred to a measuring error...
BogdanOntanu said:
Quote
There is a political reason also: the HLL llanguages do much better at using ADD/SUB than INC/DEC ...
I made the following program to test this on my compiler:
; Dimension Variables
DIM I AS LONG
DIM StartTime AS LONG
OBMain.CREATE
END EVENT
Button1.COMMAND
StartTime=GETTICKCOUNT()
FOR I=1 to 2000000000
; Do nothing indide loop
NEXT I
Button1.TEXT=STR$((GETTICKCOUNT()-StartTime))
END EVENT
Here is the .asm code ror inc:
; LN:11 FOR I=1 to 2000000000
mov eax,1
mov [I],eax
mov eax,2000000000
mov [_LopVec1],eax
_Lbl2:
mov eax,[I]
cmp eax,[_LopVec1]
jg _Lbl4
; LN:12 ; Do nothing indide loop
; LN:13 NEXT I
_Lbl3:
inc [I]
jmp _Lbl2
_Lbl4:
Here is the .asm code for add:
; LN:11 FOR I=1 to 2000000000
mov eax,1
mov [I],eax
mov eax,2000000000
mov [_LopVec1],eax
_Lbl2:
mov eax,[I]
cmp eax,[_LopVec1]
jg _Lbl4
; LN:12 ; Do nothing indide loop
; LN:13 NEXT I
_Lbl3:
mov eax,[I]
add eax,1
mov [I],eax
jmp _Lbl2
_Lbl4:
With an inc instruction in the "Next" incrementer code, the pgm runs in 5.2 seconds.
After modifying the compiler to generate a mov, add 1, mov sequence to increment the loopvar, the pgm was recompiled and ran in 4.2 seconds.
This is a 22% improvement!
Changinging the compiler to do this took less than a minute of work. The compiler can output either type of code with equal ease. In fact it wouldn't be that difficult to detect the type of machine and output the inc or add code on the fly.
msmith, whats your processor?
It's a PIV 2.8GHZ.
P4 breaks inc in two uops, add is much faster in it,
the same don't happen with Athlon.
EduardoS,
Thanks for the info.
The problem with sensing the processor type and generating the optimized code is that it is easy to do so at compile time, but much harder at run time.
If I sense the processor type at compile time and generate the best code, there is no telling what kind of machine the .exe will be run on.
The 22% speed improvement on each iteration of a FOR LOOP is significant, especially if there is not much inside the loop.
Mike
The problem with such simple "do almost nothing loops" is that they never represent the real behavoiur of a compiler during a big/medium application.
Basically you will never have such empty loops in real life applications and algorithms.
I have seen compilers "forget" about variables assignments under stress conditions or use constructs like:
movzx edx, word ptr[ esi] ... correct because the pointer was typed "word"
and then:
movzx ecx,dx --> funny is it not ? apparently edx is also considered "word" :P
sometimes on the very next instruction...
I am always amazed on the huge credit compilers get from perfoming such simple tests :D
when in fact they do perform much badly in really complicated algorithms...
Quote
The problem with such simple "do almost nothing loops" is that they never represent the real behavoiur of a compiler during a big/medium application.
Basically you will never have such empty loops in real life applications and algorithms.
It is true that a contrived test never fully represents real behavior.
The behavior of the processor (cache loading, branch lookahead, etc.) will likeky change, but not the behavior of the compiler.
No test in life is really representative. On your driving test, you have a policeman in your car to check you. Do you drive the same with a policeman in the car as when you don't? Every test in life is contrived, but the advantage of such tests is the possibility of testing just one thing in isolation. If the inc/add test had complex code in the loop, we would not know for sure why one example was faster than the other.
I tested the szWcnt_incdec on a Athlon XP and test some variations, i don't change de code, just code alignment, the best variation was a kind of "out of logic" alignment:
1451 cycles, szWcnt
1334 cycles, szWcnt_incdec <-- best variantion
With some comments:
szWcnt_incdec proc src:DWORD,txt:DWORD
push ebx
push esi
push edi
mov edx, len([esp+16]) ; procedure call for src length
sub edx, len([esp+20]) ; procedure call for 1st text length
mov eax, -1
mov esi, [esp+16] ; source in ESI
add edx, esi ; add src to exit position
xor ebx, ebx ; clear EBX to prevent stall with BL
add edx, 1 ; correct to get last word
mov edi, [esp+20] ; text to count in EDI
sub esi, 1
align 16
pre:
INC EAX ; was faster with this instruction aligned 16 bytes
xchg ax, ax
nop
wcst:
add esi, 1 ;For the best performace the distance from the start of inc above must be 4 bytes and
;must exists two instructions (nops) between them (don't ask why)
xchg ax, ax
nop
cmp edx, esi ; the distance from the start of add must be 6 bytes (don't ask why) and
; must exists two instructions (nops) between them (again, don't ask why)
jle wcout ;
mov bl, [esi] ;
cmp bl, [edi] ; all of this instructions must be in sequence
jne wcst ;
nop ;
xor ecx, ecx ; the inc must right after xor and must exists a nop before xor and a nop after inc
test_word: ; the length of this 4 instructions must be 6 bytes
INC ECX ;
xchg ax, ax ;
cmp BYTE PTR [edi+ecx], 0 ;
je pre ;
mov bl, [esi+ecx] ; Must be in sequence
cmp bl, [edi+ecx] ;
jne wcst ;
jmp test_word ;
wcout:
pop edi
pop esi
pop ebx
ret 8
szWcnt_incdec endp
I really don't understand why this alignment was the fastest on XP 2000+.