News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

why is "add" faster than "inc"

Started by thomas_remkus, April 28, 2006, 08:17:49 PM

Previous topic - Next topic

arafel

Yes, but the problem setz doesn't depend on the CF. You just can't use instruction which isn't affected by CF, can't! It breaks the whole purpose of testing the things you were testing. This is why the outcome is identical in the test results you have posted.

Perhaps choosing seta wasn't such a good idea from my part, but it was merely an example of the partial register stall I was talking about.

Will this prove my point better? :

  add     eax, 1       ; affects CF and other flags
  rcr     ecx, 1       ; affects CF
  add     esi, 1234567 ; doesn't need to wait for both preceding instructions to retire       
 

  inc     eax          ; affects some flags except CF
  rcr     ecx, 1       ; affects CF
  add     esi, 1234567 ; will wait for both preciding instuctions to retire

EduardoS

Here both codes needs only 1 cycle,
Athlon don't seem to be slower with inc.

arafel

hmm. On PIII (repeated 1024 times of course):

add/rcr/add: 2516 clocks
inc/rcr/add : 7374 clocks

I'll test on a couple of PIVs tomorrow... Although imo it shouldn't be any different on them unless PIV doesn't have flags register stall penalty, what I doubt.

EduardoS

Quote from: arafel on April 30, 2006, 08:18:59 PM
  add     eax, 1       ; affects CF and other flags
  rcr     ecx, 1       ; affects CF
  add     esi, 1234567 ; doesn't need to wait for both preceding instructions to retire       
 

  inc     eax          ; affects some flags except CF
  rcr     ecx, 1       ; affects CF
  add     esi, 1234567 ; will wait for both preciding instuctions to retire


A small note, the two codes give diferent results, the first is a true dependency:
  add     eax, 1       ; affects CF and other flags
  rcr     ecx, 1       ; affects CF ;;; rcr needs the carry to perform the shift and the first instruction changes the carry (true dependency)
  add     esi, 1234567 ; doesn't need to wait for both preceding instructions to retire       
 

An example where both inc and add reg,1 give the same result should be:

  add     eax, 1       ; affects CF and other flags
  bt     ecx, 1       ; affects CF
  add     esi, 1234567 ; doesn't need to wait for both preceding instructions to retire       
 
Wich need the same clock cycle to execute here.

EduardoS

Just an interesting effect on P3, try both:


  add     eax, 1       ; affects CF and other flags
  bt     ecx, 1       ; affects CF
 


  inc     eax          ; affects some flags except CF
  rcr     ecx, 1       ; affects CF


Mark Jones

This is partly how the Athlon XP 1800 claims to run at "2500" speed - it simply executes some codes faster than 1:1 clock speed.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

MichaelW

#36
Instead of comparing code sequences that are not representative of real code, this compares two versions of the MASM32 cmpmem procedure:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include timers.asm

    cmpmem_incdec PROTO :DWORD,:DWORD,:DWORD

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      pMem dd 0
      fLen dd 0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke read_disk_file, chr$("\masm32\m32lib\cmpmem.asm"),
                          ADDR pMem, ADDR fLen
    print ustr$(fLen)," bytes",13,10

    invoke cmpmem, pMem, pMem, fLen
    print ustr$(eax),13,10
    invoke cmpmem_incdec, pMem, pMem, fLen
    print ustr$(eax),13,10

    LOOP_COUNT equ 1000000

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke cmpmem, pMem, pMem, fLen
    counter_end
    print ustr$(eax)," cycles, cmpmem",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke cmpmem_incdec, pMem, pMem, fLen
    counter_end
    print ustr$(eax)," cycles, cmpmem_incdec",13,10

    free pMem
    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

cmpmem_incdec proc buf1:DWORD,buf2:DWORD,bcnt:DWORD

    push esi
    push edi

    mov edx, bcnt
    shr edx, 2                      ; div by 4

    mov esi, buf1
    mov edi, buf2
    xor ecx, ecx

  align 4
  @@:
    mov eax, [esi+ecx]              ; DWORD compare main file
    cmp eax, [edi+ecx]
    jne fail
    add ecx, 4
    ;sub edx, 1
    DEC EDX
    jnz @B

    mov edx, bcnt                   ; calculate any remainder
    and edx, 3
    jz match                        ; exit if its zero
    xor eax, eax                    ; clear EAX for partial writes

  @@:
    mov al, [esi+ecx]               ; BYTE compare tail
    cmp al, [edi+ecx]
    jne fail
    ;add ecx, 1
    INC ECX
    ;sub edx, 1
    DEC EDX
    jnz @B

    jmp match

  fail:
    xor eax, eax                    ; return zero if DIFFERENT
    jmp quit

  match:
    mov eax, 1                      ; return NON zero if SAME

  quit:
    pop edi
    pop esi

    ret

cmpmem_incdec endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start


Typical result on a P3:

1777 cycles, cmpmem
1081 cycles, cmpmem_incdec


Typical result on my old K5:

916 cycles, cmpmem
910 cycles, cmpmem_incdec



[attachment deleted by admin]
eschew obfuscation

hutch--

Interestingly enough, i am getting very little difference on this PIV.


1396 bytes
1
1
1201 cycles cmpmem
1210 cycles cmpmem_incdec
Press any key to exit...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

PBrennick

The same can be said for my AMD.

Quote1396 bytes
1
1
1097 cycles
1077 cycles
Press any key to exit...
The GeneSys Project is available from:
The Repository or My crappy website

jdoe

AMD Athlon 1800+


1396 bytes
1
1
1079 cycles
1081 cycles
Press any key to exit...



PBrennick, I'm curious to know your AMD model? I can't get a difference of that range on mine by using INC/DEC.

MichaelW

The attachment this time tests INC/DEC versions of the MASM32 cmpmem, szRev, and szWcnt procedures against the original ADD/SUB versions. I simply replaced each ADD reg, 1 with INC reg and each SUB reg, 1 with DEC reg.

Typical results on a P3:

1776 cycles, cmpmem
1080 cycles, cmpmem_incdec
944 cycles, szRev
901 cycles, szRev_incdec
1584 cycles, szWcnt
1332 cycles, szWcnt_incdec



[attachment deleted by admin]
eschew obfuscation

EduardoS

A64
Quote
1063 cycles, cmpmem
1063 cycles, cmpmem_incdec
824 cycles, szRev
822 cycles, szRev_incdec
1266 cycles, szWcnt
1432 cycles, szWcnt_incdec
Press any key to exit...

I don't find dependencys on szWcnt, why it was slower?

MichaelW

I don't understand why, but if I change the INC ECX after the test_word label back to ADD ECX, 1 the cycle count for a P3 drops from 1332 to about 1281.


eschew obfuscation

dsouza123

Athlon 1300 Mhz Windows XP SP2


1396 bytes
1
1
1078 cycles, cmpmem
1077 cycles, cmpmem_incdec



1076 cycles, cmpmem
1071 cycles, cmpmem_incdec
901 cycles, szRev
866 cycles, szRev_incdec
1454 cycles, szWcnt
1532 cycles, szWcnt_incdec

jdoe

P4 1700MHz


1185 cycles, cmpmem
1212 cycles, cmpmem_incdec
1070 cycles, szRev
1019 cycles, szRev_incdec
1640 cycles, szWcnt
1759 cycles, szWcnt_incdec
Press any key to exit...