The MASM Forum Archive 2004 to 2012

Project Support Forums => PowerBASIC Low Level Code => Topic started by: frktons on April 18, 2010, 08:52:38 PM

Title: ASM for FUN - #1 SUB
Post by: frktons on April 18, 2010, 08:52:38 PM
Hi everybody.

In my spare time I'm trying to learn some PB/ASM  stuff.

Here I started:
an old routine written by J. Fielden with ConASCII to display a text-based screen on a windows console.
The fraction of code I'm trying to optimize is:

---------------------------------------------------------------------------------------------------------------------
    FOR x = 1 to 4000
      sBuf = sBuf + MID$(MainStr, x, 1)+ CHR$(32)
    NEXT x
---------------------------------------------------------------------------------------------------------------------


When I saw this LOOP I remembered what an old ASM programmer once told me:
"why to use this kind of resource consuming instructions when you can do it in a much better way?"

The MID$ funcion is used a lot in strings manipulation, but it is not the most efficient way, i guess.
So I tried to add some speed to it and I rewrote it in this shape:


---------------------------------------------------------------------------------------------------------------------
    sBuf = SPACE$(8000)

    y = STRPTR(MainStr)
    y1 = STRPTR(sBuf)

    FOR x = 1 TO 8000 STEP 2
      POKE y1, PEEK(y)
      y = y + 1
      y1 = y1 + 2
    NEXT x
---------------------------------------------------------------------------------------------------------------------


This version is much faster then the previous one, but I think it is still not the best one.

PEEKing and POKEing single bytes in a 32 bit CPU is not the most efficient way of doing things.

Is anyone [more experienced  in PB/ASM stuff] able to suggest me a better way of doing this LOOP?

Thanks for your time  :8)

Frank
Title: Re: ASM for FUN - #1 SUB
Post by: hutch-- on April 19, 2010, 01:24:26 AM
Frank,

One thing that is inherantly slow in any dialect of basic is string concatenation as it does a re-allocate for each iteration. The trick is to allocate a buffer larger enough to hold the entire result then just write each byte to the buffer sequentially. Just note though that nothing makes a console display fast, about the best you can do is create the string in large enough pieces and point it at the console in larger blocks and the display will get faster.
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 02:17:11 AM
you can use display page switching to make it appear faster
get an undisplayed page set up with the changes, then - bang - switch display pages
then, update the previous page and use it the next time
Title: Re: ASM for FUN - #1 SUB
Post by: jj2007 on April 19, 2010, 07:35:00 AM
Quote from: hutch-- on April 19, 2010, 01:24:26 AM
One thing that is inherantly slow in any dialect of basic is string concatenation as it does a re-allocate for each iteration.

Hutch,

It's a known problem but not limited to Basic, see e.g. string concatenation extremely slow (http://bugs.freepascal.org/view.php?id=7137) for Pascal, or StringBuilder vs StringBuffer vs String.concat (http://kaioa.com/node/59) for Java. The latter says "Creating a String with a length of 65536 (character by character) already takes about 22 seconds on an AMD64 X2 4200+". The same in PureMasmTM (below) takes 0.7 seconds on a P4....:

include \masm32\MasmBasic\MasmBasic.inc

.code
start:
Let esi="x"
push Timer()
For_ n=0 To 65535
Let esi=esi+"x"
Next
void Timer()
pop edx
sub eax, edx
; result: 718 ms on a P4
Inkey Str$("Your puter has constructed a 65k string in %i milliseconds", eax)
Exit
end start


... but that is still a lot of time. How would you speed that up, in the context of Let esi=esi+c$, i.e. in a situation where you a) have no idea how long the final string will be, b) don't know in advance how many strings will be appended, c) don't know in advance how long the string to be appended will be?
Title: Re: ASM for FUN - #1 SUB
Post by: hutch-- on April 19, 2010, 09:28:50 AM
JJ,

Whilew this is not much use to Frank, this is the method i use in basic to circumvent the concatenation slowdown inherant in the design of the language. I take your point that any language that uses an allocated string will get proresssively slower as the string gets longer, fundamental pascal strings with the descriptor at the front in a block of allocated memory.

the algo below is very adjustable in terms of initial buffer size and realloc interval, speed increases with the reduction in re-allocations but at the expense of using larger blocks of memory initially. basically start with a reasonably big buffer and it keeps reallocating until the appends stop.


#IF 0  ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    ---------------------------------------

    buffer$ = ""
    ' buffer$ = nul$(1024*1024*128)

    addin$ = "12345678901234567890123456789012345678901234567890"+_
             "12345678901234567890123456789012345678901234567890" ' 100 characters

    cloc = 0    ' start at beginning of buffer
    lpcn = 0    ' zero the loop counter

    Do
      cloc = repappend(buffer$,addon$,cloc,1024*1024)
      ! add lpcn, 1
    Loop while lpcn < 1000000               ' 1 million appends at 100 bytes each

    buffer$ = left$(buffer$,cloc)           ' trim off extra memory

    ---------------------------------------

    NOTES

      The algorithm addresses a fundamental problem with sequential string appends in
      basic, the reallocation and copy of memory each time the memory is reallocated.

      This algorithm is at it fastest when the original buffer to receive the appended
      data is large enough to hold it all without any reallocation of memory. It becomes
      progressively slower as the buffer size is decreased as more reallocations must be
      done. This is adjustable with the algorithm in two (2) ways.

      1. The larger the initial buffer is, the less times the memory is reallocated
      2. The larger the extra buffer size is the less times the memory is reallocated

      When the append string operation is completed, trim off any excess memory using
      the last return value from the function using the LEFT$() standard basic string
      function.

#ENDIF ' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FUNCTION repappend(src$,addit$,ByVal cloc as DWORD,ByVal iextra as DWORD) as DWORD

    #REGISTER NONE

  ' ------------------------------------------------------
  ' src$        the main buffer to append extra data to.
  ' addit$      the byte data to append to the main buffer
  ' cloc        current location pointer
  ' iextra      size of increase to buffer, min 1 meg
  ' ------------------------------------------------------

    LOCAL pstr as DWORD

    pstr = StrPtr(src$)
    ! mov esi, pstr

    pstr = StrPtr(addit$)
    ! mov edi, pstr

    ! mov ebx, [esi-4]          ' stored buffer length
    ! mov eax, cloc
    ! add eax, [edi-4]          ' write location counter AND addit$ length to EAX

    ! cmp ebx, eax              ' test if combined length greater than source buffer size
    ! jg nxt1

    ! cmp iextra, 1024*1024     ' if realloc, check buffer size
    ! jge nxt0
    ! mov iextra, 1024*1024     ' set a minimum extra buffer size

  nxt0:
      extra$ = nul$(iextra)     ' allocate a buffer of "iextra" size
      src$ = src$ + extra$      ' realloc source string size
      pstr = StrPtr(src$)
    ! mov esi, pstr

  nxt1:
    ! mov ecx, 1
    ! or eax, -1
    ! add esi, cloc             ' add current location pointer for next copy position

  ' -----------
  ' unroll by 2
  ' -----------
  lbl0:
    ! add eax, ecx
    ! movzx edx, BYTE PTR [edi+eax]
    ! mov [esi+eax], dl
    ! test edx, edx
    ! jz lbl1                   ' exit on written terminator

    ! add eax, ecx
    ! movzx ebx, BYTE PTR [edi+eax]
    ! mov [esi+eax], bl
    ! test ebx, ebx
    ! jnz lbl0                  ' exit on written terminator

  lbl1:
    ! add eax, cloc             ' location
    ! mov FUNCTION, eax         ' return updated offset pointer

END FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Title: Re: ASM for FUN - #1 SUB
Post by: jj2007 on April 19, 2010, 10:07:49 AM
Quote from: hutch-- on April 19, 2010, 09:28:50 AM
While this is not much use to Frank, this is the method i use in basic to circumvent the concatenation slowdown inherant in the design of the language. I take your point that any language that uses an allocated string will get proresssively slower as the string gets longer, fundamental pascal strings with the descriptor at the front in a block of allocated memory.

Hutch,
Thanks. You demonstrate once more that hand-made assembler beats High-Level Language ;-)

Here is another little exercise: Destination needs itself as source. It works, but of course it cannot be very speedy for high loop counts - either because of reallocs, or because you would need to create a copy of the source before finishing the job.

include \masm32\MasmBasic\MasmBasic.inc

.code
start:
push Timer()
Let esi=" "
For_ n=1 To 20000
Let esi=Str$(n)+" "+esi+"A"
Next
void Timer()
pop edx
sub eax, edx
Inkey Str$("%i ms for ", eax), Left$(esi, 30), " ... ", Right$(esi, 10)
; P4: 1031 ms for 20000 19999 19998 19997 19996  ... AAAAAAAAAA
Exit
end start


Just out of curiosity: How many ticks does that one take in PB?
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 01:55:24 PM
OK guys, nice suggestions. :U
I'd like to progress just one step further.
The speed of output in a console is a different problem that I think was resolved in quite a good
way by Fielden in his original routine:

-----------------------------------------------------------------------------------------------------------------------
SUB GetScreen
    LOCAL lpReadRegion AS SMALL_RECT
    REGISTER x AS LONG
    LOCAL SIZE AS DWORD

    sBuf = ""
    SIZE = MAKDWD(80, 25)
    FOR x = 1 TO 4000
      sBuf = sBuf + MID$(MainStr, x, 1)+ " "
    NEXT x
    lpReadRegion.xRight = 79
    lpReadRegion.xBottom = 24
    WriteConsoleOutPut GetStdHandle(%STD_OUTPUT_HANDLE), _
    BYVAL STRPTR(sBuf),BYVAL SIZE, BYVAL 0&, lpReadRegion
END SUB
-----------------------------------------------------------------------------------------------------------------------


as you can see, the SUB is calling an API [WriteConsoleOutPut] passing it the whole
screen to be printed, and this is problably the fastest way of doing it, except for the trick of
printing it on an allocated but invisible console and after switching to it, as dedndave suggested, but
it is just for the impression it gives, not for the real speed. And it is consuming some extra CPU cycles as well
instead of sparing them [that is the first objective I have in mind].

I have focused just in the section of CODE that prepares the area of 8000 bytes[the entire 25*80 screen] for the windows
console, starting with a different format of 4000 bytes that was used in DOS CONSOLE.:

---------------------------------------------------------------------------------------------------------------------
    FOR x = 1 to 4000
      sBuf = sBuf + MID$(MainStr, x, 1)+ CHR$(32)
    NEXT x
---------------------------------------------------------------------------------------------------------------------


This part of the SUB looks like it can be improved a lot.

What Hutch suggests doing is what I realized in PB with POINTERS and PEEK/POKE instructions:

1) I allocated the space for the final string [8000 bytes], fortunately I already know how big it'll be.
2) I MOVed single bytes into it with POKE reading them with PEEK from the original string [4000 bytes]
putting a blank after each byte, or better preserving it skipping every other byte:


---------------------------------------------------------------------------------------------------------------------
    sBuf = SPACE$(8000)

    y = STRPTR(MainStr)
    y1 = STRPTR(sBuf)

    FOR x = 1 TO 8000 STEP 2
      POKE y1, PEEK(y)
      y = y + 1
      y1 = y1 + 2
    NEXT x
---------------------------------------------------------------------------------------------------------------------


But I was thinking about something a little bit faster:

Do you think it'll be faster if I read [MOV] four bytes from the original string into a 32 bits register, and after I MOV
the four single 8 bits registers to the addresses pointed by the pointer of the target string?

In my opinion doing it this way I only use 5 MOVs instead of 8 MOVs, and one of them is faster because it involves a 32 bit MOVs.  Instead of 8 slower 8 bits MOVs I use 4 slower 8 bits MOVs and a faster 32 bits MOV.
But this is only a supposition I'm doing, because of my limited knowledge of ASM.
Can you  please focus your answers about this simple point?
I'll ask more questions as soon as I get this point a little bit clearer. :8)

The SUB that jj2007 is proposing is something alike what I'm looking for, but it is for a general  string
building, a little bit different than the problem I'm facing with.  :bg
The same for the routine hutch is pointing at.

Thanks anyway for your patience and suggestions.

Frank



Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 02:11:18 PM
hiya Frank
one thing i would mention is, the "DOS" console mode window is not perfect
it has bugs in it - some of them are hard to work around - as far as i know, all version of windows have the same bugs   :bg
the bugs can make it difficult to make a "professional" program for console mode windows
these bugs manifest themselves when the user starts playing with the mouse - most notably, the copy/paste functions
the OS tends to lose its screen position if the user decides to paste into the console window
there is no way to disable or trap these that i know of, even though the docs say you can
it can make a mess out of your display page and, if you try to set the cursor position for output, it will not be where you think it should be
if you use direct writes to the screen buffer, you at least get the correct screen position
however, that does not eliminate the text that the user may have pasted into the window

some time ago, i was playing with this, trying to figure out a fix - kind of dropped it, though
this post has an attachment that can be used to demonstrate what i am talking about...
http://www.masm32.com/board/index.php?topic=11927.msg90799#msg90799
run that program, then start playing with copy/paste and see what the results are   :dazzled:
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 02:36:44 PM
Quote from: dedndave on April 19, 2010, 02:11:18 PM
hiya Frank
one thing i would mention is, the "DOS" console mode window is not perfect
it has bugs in it - some of them are hard to work around - as far as i know, all version of windows have the same bugs   :bg
the bugs can make it difficult to make a "professional" program for console mode windows
these bugs manifest themselves when the user starts playing with the mouse - most notably, the copy/paste functions
the OS tends to lose its screen position if the user decides to paste into the console window
there is no way to disable or trap these that i know of, even though the docs say you can
it can make a mess out of your display page and, if you try to set the cursor position for output, it will not be where you think it should be
if you use direct writes to the screen buffer, you at least get the correct screen position
however, that does not eliminate the text that the user may have pasted into the window

some time ago, i was playing with this, trying to figure out a fix - kind of dropped it, though
this post has an attachment that can be used to demonstrate what i am talking about...
http://www.masm32.com/board/index.php?topic=11927.msg90799#msg90799
run that program, then start playing with copy/paste and see what the results are   :dazzled:

Well, I heard about this problem and fortunately I'm not going to use the mouse for the next 3-5 years  :bg
I've chosen the windows console for learning some PB/ASM just to skip [for now], the GUI weight  ::)
By the way your demo was retrieving the simple keys, not the combination with SHIFT/CTRL/ALT?
This was my impression giving it a try, and maybe it could be useful for something else as well.
I mean I'm just studing ASM, and that is ASM!
Thanks anyway and if you have any idea about my last post please let me know.

Frank
Title: Re: ASM for FUN - #1 SUB
Post by: jj2007 on April 19, 2010, 02:42:33 PM
Quote from: frktons on April 19, 2010, 01:55:24 PM
But I was thinking about something a little bit faster:

Do you think it'll be faster if I read [MOV] four bytes from the original string into a 32 bits register, and after I MOV
the four single 8 bits registers to the addresses pointed by the pointer of the target string?


Try this one:
include \masm32\include\masm32rt.inc
; frktons 19.4.2010

.data
src db "123456789012345678901234567890"
tgt_desired db "1 2 3 4 5 6 7 8 9"
tgt db "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

.code
start: mov ebx, 5
mov esi, offset src
mov edi, offset tgt
.Repeat
mov eax, [esi]
mov [edi], al
mov [edi+2], ah
bswap eax
mov [edi+6], al
mov [edi+4], ah
add esi, 4
add edi, 8
dec ebx
.Until Zero?
print offset tgt
exit
end start
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 03:01:04 PM
Quote from: jj2007 on April 19, 2010, 02:42:33 PM
Quote from: frktons on April 19, 2010, 01:55:24 PM
But I was thinking about something a little bit faster:

Do you think it'll be faster if I read [MOV] four bytes from the original string into a 32 bits register, and after I MOV
the four single 8 bits registers to the addresses pointed by the pointer of the target string?


Try this one:
include \masm32\include\masm32rt.inc
; frktons 19.4.2010

.data
src db "123456789012345678901234567890"
tgt_desired db "1 2 3 4 5 6 7 8 9"
tgt db "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

.code
start: mov ebx, 5
mov esi, offset src
mov edi, offset tgt
.Repeat
mov eax, [esi]
mov [edi], al
mov [edi+2], ah
bswap eax
mov [edi+6], al
mov [edi+4], ah
add esi, 4
add edi, 8
dec ebx
.Until Zero?
print offset tgt
exit
end start


Hi JJ. I compiled and run it, but I didn't see any print to tell me what happened, just a blank line and
the console prompt in the next. Did I forget something?
This is what the assembling says:


---------------------------------------------------------------------------------------------------------------------------------------
Microsoft (R) Macro Assembler Version 6.14.8444
Copyright (C) Microsoft Corp 1981-1997.  All rights reserved.

Assembling: C:\masm32\examples\string19042010a.asm
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

Il volume nell'unità C è WIN7-64
Numero di serie del volume: 24E7-E258

Directory di C:\masm32\examples

19/04/2010  16:50               438 string19042010a.asm
19/04/2010  16:54             2.560 string19042010a.exe
19/04/2010  16:54             1.011 string19042010a.obj
               3 File          4.009 byte
               0 Directory  39.452.909.568 byte disponibili
Premere un tasto per continuare . . .
-----------------------------------------------------------------------------------------------------------------------------------------


And this is what happens when I run it:


-----------------------------------------------------------------------------------------------------------------------------------------
Microsoft Windows [Versione 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. Tutti i diritti riservati.

Directory di C:\masm32\examples

19/04/2010  16:50             2.560 string19042010a.exe
               1 File          2.560 byte
               0 Directory  39.452.934.144 byte disponibili

C:\masm32\examples>string19042010a

C:\masm32\examples>
--------------------------------------------------------------------------------------------------------------------------------------


So something should be wrong. The MASM version? The Linker? or what?

Frank
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 03:14:54 PM
i don't see a null string terminator in Jochen's code, but that shouldn't stop it from displaying something - lol
let me assemble it......
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 03:23:21 PM
ok - i replaced the tabs with spaces (my personal preference) and added a null terminator

C:\masm32\asm>frktons
1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6
C:\masm32\asm>               

see attached...

my guess is, we all have the masm32 package installed
it has includes, libs, and macros that you may not have in place
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 03:37:11 PM
Quote from: dedndave on April 19, 2010, 03:23:21 PM
ok - i replaced the tabs with spaces (my personal preference) and added a null terminator

C:\masm32\asm>frktons
1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6
C:\masm32\asm>               

see attached...

my guess is, we all have the masm32 package installed
it has includes, libs, and macros that you may not have in place

So, missing the null terminator in the "tgt" string prevented the displaying of the string?
As for the "tabs" when I copied the source from the FORUM I didn't see any of them anymore:

include \masm32\include\masm32rt.inc
; frktons 19.4.2010

.data
src db "123456789012345678901234567890"
tgt_desired db "1 2 3 4 5 6 7 8 9"
tgt db "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

.code
start: mov ebx, 5
mov esi, offset src
mov edi, offset tgt
.Repeat
mov eax, [esi]
mov [edi], al
mov [edi+2], ah
bswap eax
mov [edi+6], al
mov [edi+4], ah
add esi, 4
add edi, 8
dec ebx
.Until Zero?
print offset tgt
exit
end start


And the package I have installed I guess is fine, as I assembled your version and it works fine.  :8)

So now we have a good starting point thanks to JJ and Dave for the combined effort.
Would you mind if we use it to move one step further?

I'll tell you what I can understand and you could tell me how far I am from the "truth"  :dazzled:

Here we have the first chunk of CODE:

src db "123456789012345678901234567890"
tgt_desired db "1 2 3 4 5 6 7 8 9"
tgt db "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",0

As far as I know we are declaring the source and the target strings [that was easy, wasn't it?]  :bg

And the next chunk says:

.code
start: mov ebx, 5
mov esi, offset src
mov edi, offset tgt

that should be: we are preparing a loop from 5 to zero using the 32 bit register ebx, than
we move the starting points of the source and target strings to 32 bit registers esi and edi.
first question: esi and edi are just retaining the pointers or the first 4 bytes of our strings?
I mean the offset src and offset tgt are moving pointers to registers or dword data of the strings from that point?

Enough for this post. ::)

Frank
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 03:39:08 PM
that's because the editor expands the tabs to spaces
i have always hated tabbed text - lol - i like to see what is actually there

include \masm32\include\masm32rt.inc
; frktons 19.4.2010

.data
src db "123456789012345678901234567890"
tgt_desired db "1 2 3 4 5 6 7 8 9"
tgt db "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",0

.code
start:  mov ebx, 5
        mov esi, offset src
        mov edi, offset tgt
        .Repeat
                mov eax, [esi]
                mov [edi], al
                mov [edi+2], ah
                bswap eax
                mov [edi+6], al
                mov [edi+4], ah
                add esi, 4
                add edi, 8
                dec ebx
        .Until Zero?
        print offset tgt
exit
end start

let me try it without the null....   :bg

btw - i use a program called "Swiss Army Knife" to de-tabify the file
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 03:44:10 PM
i got the same result with no null terminator
Jochen is relying on the fact that the remainder of the segment is filled with 0's
in many cases, it will continue to display garbage until a 0 is found - lol
although, writing to the console has strange behaviour
for example, if you try to write more than about 56 Kb to the screen using WriteFile, nothing happens
i suspected that was what you might be seeing, with no terminator
Title: Re: ASM for FUN - #1 SUB
Post by: jj2007 on April 19, 2010, 04:02:18 PM
Quote from: dedndave on April 19, 2010, 03:44:10 PM
Jochen is relying on the fact that the remainder of the segment is filled with 0's

No, I just forgot the null terminator :red
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 04:05:08 PM
shhhh - i wasn't gonna say anything   :P
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 04:06:40 PM
Quote from: dedndave on April 19, 2010, 04:05:08 PM
shhhh - i wasn't gonna say anything   :P

OOPs, I was writing while you were answering, please see my previous post.  :bg

Frank
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 04:11:45 PM
you are right about ESI and EDI (source index and destination index - the E means extended to 32 bits)

the loop goes from 5 to 1, actually - when EBX reaches 0, the loop exits

if we take out the Repeat/Until syntax, the assembler probably generates something like this:

start:  mov ebx, 5
        mov esi, offset src
        mov edi, offset tgt

loop_start:
                mov eax, [esi]
                mov [edi], al
                mov [edi+2], ah
                bswap eax
                mov [edi+6], al
                mov [edi+4], ah
                add esi, 4
                add edi, 8
                dec ebx
                jnz loop_start

        print offset tgt
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 04:23:52 PM
Quote from: dedndave on April 19, 2010, 04:11:45 PM
you are right about ESI and EDI (source index and destination index - the E means extended to 32 bits)

the loop goes from 5 to 1, actually - when EBX reaches 0, the loop exits

if we take out the Repeat/Until syntax, the assembler probably generates something like this:

start:  mov ebx, 5
        mov esi, offset src
        mov edi, offset tgt

loop_start:
                mov eax, [esi]
                mov [edi], al
                mov [edi+2], ah
                bswap eax
                mov [edi+6], al
                mov [edi+4], ah
                add esi, 4
                add edi, 8
                dec ebx
                jnz loop_start

        print offset tgt


Well, from the output it looks like it is doing only four cycles of four bytes [16 total], and
the final "x" is not appearing in the target string, I'm trying to figure out why.

So ESI and EDI (source index and destination index) means we have just the pointers to the strings in the registers, haven't we?


Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 04:31:57 PM
OK. I expanded the target string to 40 bytes so the 5 cycles of 4 bytes could be
used, and it works fine. I figured out why it was displaying only 30 bytes, it was because
the target string was just 30 bytes.  :8)
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 04:34:44 PM
that loop is probably pretty fast, too   :U

bswap isn't super-fast, but probably better than the alternatives
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 04:42:24 PM
Very well, here we have the next chunk:

loop_start:
                mov eax, [esi]
                mov [edi], al
                mov [edi+2], ah


We MOV the four bytes of the source string [pointed by the register esi] to eax.
We MOV the low byte [al] of ax to the address pointed by edi, to the target string single byte
We MOV the high byte [ah] of ax to the address pointed by edi + 2, to the target string single byte

Did I guess right?
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 05:03:37 PM
you got it
he grabs 4 bytes with MOV EAX,[ESI]
then, places the lower 2 bytes (AL and AH)
after that, we can't access the upper word of EAX by individual bytes
so, he uses BSWAP to get the upper word of EAX into AX
BSWAP reverses the order of all 4 bytes, so AL and AH appear to be backwards, now
after he has placed all 4 bytes, he updates the index pointers and decrements the loop counter
i might have used ECX for the loop count, as it is the "traditional" count register and EBX wants to be preserved across system callback functions
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 05:07:52 PM
And the last chunk:

bswap eax
mov [edi+6], al
mov [edi+4], ah
add esi, 4
add edi, 8
dec ebx
.Until Zero?
print offset tgt
exit
end start


We bswap the 16 high bits of eax with the 16 low ones.
It looks like it produces the inversion of al with ah, so
we have to move al to [edi+6] and
ah to [edi+4].
After that, as we have moved all the four bytes read from the source string, we move the pointers
to the next four bytes of the source string and the next 8 bytes of the target one.
We decrement our counter by one [dec ebx] and if the counter is not zero we loop.
If it is zero we print the target string [print offset tgt].
We probably use offset tgt because print needs a pointer to the string to print, and of course
as we don't pass anything else print stops when it finds the null terminator.

Did I get it?
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 05:10:40 PM
i updated my previous post
we'll make an ASM programmer out of you, yet   :bg
after a while, you will forget PB and think it refers to something you spread on bread with jelly
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 05:18:16 PM
Quote from: dedndave on April 19, 2010, 05:10:40 PM
i updated my previous post
we'll make an ASM programmer out of you, yet   :bg
after a while, you will forget PB and think it refers to something you spread on bread with jelly

I'm honored.  :P
Well I think this first ASM lesson was a very nice one, thanks to you and JJ for proposing the just right stuff
to learn.
I'll move a little bit forward as I return back to the forum, I hope in few hours or so.

Thanks for your teachings.

P.S. we are not finished yet, we have one more step to achieve.  :bg
Title: Re: ASM for FUN - #1 SUB
Post by: jj2007 on April 19, 2010, 07:39:19 PM
Quote from: dedndave on April 19, 2010, 04:34:44 PM
bswap isn't super-fast, but probably better than the alternatives

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
102     cycles for bswap A
111     cycles for bswap B
111     cycles for shr eax, 8 A
123     cycles for shr eax, 8 B
114     cycles for lodsd+bswap


Try your luck :bg

Unfortunately, I cannot test pshufb on my SSE2 PC...
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 07:43:35 PM
And the last step

We have a functional assembly routine that does what we need, and it should be fast.
What is still missing is to customize it and to translate it in PB INLINE ASSEMBLY in order to
intermix with the existing code and see how it outperform the PEEK/POKE version I managed to CODE.

Well I have to confess that before leaving PB to his destiny [ :8) ], I'd like to use it for learning some ASM, and
I think it is a real good environment, as hutch stated elsewhere in this forum.

So here we have the SUB to customize and intermix with some ASM:

---------------------------------------------------------------------------------------------------------------------------------
SUB GetScreen

    LOCAL lpReadRegion AS SMALL_RECT
    LOCAL SIZE AS DWORD

    REGISTER x AS LONG
    DIM y AS STRING PTR, y1 AS STRING PTR
    sBuf = SPACE$(8000)
    SIZE = MAKDWD(80, 25)

    y = STRPTR(MainStr)
    y1 = STRPTR(sBuf)


    FOR x = 1 TO 8000 STEP 2
      POKE y1, PEEK(y)
      y = y + 1
      y1 = y1 + 2
    NEXT x
    lpReadRegion.xRight = 79
    lpReadRegion.xBottom = 24
    WriteConsoleOutPut GetStdHandle(%STD_OUTPUT_HANDLE), _
    BYVAL STRPTR(sBuf),BYVAL SIZE, BYVAL 0&, lpReadRegion
END SUB 
---------------------------------------------------------------------------------------------------------------------------------


Because I'm not at all well versed in INLINE ASM or any other ASM whatsoever, I need your help
to correct my funny translation of the routine we just discussed before.

This is a first attempt to translate it and intermix it with the BASIC CODE above:


---------------------------------------------------------------------------------------------------------------------------------
SUB GetScreen
   
    #REGISTER NONE : ' we use the registers in our own way               
    LOCAL lpReadRegion AS SMALL_RECT
    LOCAL SIZE AS DWORD

    SIZE = MAKDWD(80, 25)

'    REGISTER x AS LONG  => we don't need it anymore
'--------------------------------------------------------------------------------------------------------------------------------------------
' The following group of instructions could be inside the ASM code as well, but for now let them stay in BASIC side
'--------------------------------------------------------------------------------------------------------------------------------------------
    DIM y AS STRING PTR, y1 AS STRING PTR: we declare 2 pointers to the source and target strings
    sBuf = SPACE$(8000): ' we prepare the area of the target string, filling it with spaces. This instruction too
                                       ' could be improved with ASM, I guess


    y = STRPTR(MainStr): ' MainStr is a GLOBAL STRING and filled elsewhere with the content of a file
                                       ' and we put the starting address into the pointer y
    y1 = STRPTR(sBuf):   ' we put the starting address of the target string into the pointer y1

'-----------------------------------------------------------------------------------------------------------------------------------
' This is BASICally the routine we translate in ASM
'-----------------------------------------------------------------------------------------------------------------------------------
'    FOR x = 1 TO 8000 STEP 2
'      POKE y1, PEEK(y)
'      y = y + 1
'      y1 = y1 + 2
'    NEXT x
'-----------------------------------------------------------------------------------------------------------------------------------
' and this will be the ASM version for PB when somebody'll correct it
'-----------------------------------------------------------------------------------------------------------------------------------

; These PUSHing of the registers we use could be necessary, but at the moment I don't know
;  push ecx
;  push esi
;  push edi

!  MOV  ecx, 1000; we have 1000 cycles to do, and I use ECX instead of EBX as Dave suggested.
!  mov esi, y
!  mov edi, y1
!  loop_start:
!  mov eax, [esi]
!  mov [edi], al
!  mov [edi+2], ah
!  bswap eax
!  mov [edi+6], al
!  mov [edi+4], ah
!  add esi, 4
!  add edi, 8
!  dec ecx
!  jnz loop_start

; These POPPing of the registers we used could be necessary, but at the moment I don't know
;  pop edi
;  pop esi
;  pop ecx


    lpReadRegion.xRight = 79
    lpReadRegion.xBottom = 24
    WriteConsoleOutPut GetStdHandle(%STD_OUTPUT_HANDLE), _
    BYVAL STRPTR(sBuf),BYVAL SIZE, BYVAL 0&, lpReadRegion
END SUB 
---------------------------------------------------------------------------------------------------------------------------------


When somebody who have used PB INLINE ASSEMBLY has got the time to correct the CODE,
I'll be glad to test it and report the performance difference.  :8)

Thanks for your time.

Frank
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 07:48:11 PM
yah - that's what i was saying - the alternatives aren't any better
another approach might be to read words instead of dwords - that doesn't sound very promising   :P
although, with the code you are using, it may help to insure the source string is 4-aligned
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 07:50:08 PM
Frank - Hutch is probably the most familiar with PB in here
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 07:52:11 PM
Quote from: jj2007 on April 19, 2010, 07:39:19 PM
Quote from: dedndave on April 19, 2010, 04:34:44 PM
bswap isn't super-fast, but probably better than the alternatives

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
102     cycles for bswap A
111     cycles for bswap B
111     cycles for shr eax, 8 A
123     cycles for shr eax, 8 B
114     cycles for lodsd+bswap


Try your luck :bg

Unfortunately, I cannot test pshufb on my SSE2 PC...

In my PC the results are different:

Microsoft Windows [Versione 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. Tutti i diritti riservati.

C:\masm32\examples\strings>fillstringfrkttons
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
156     cycles for bswap A
119     cycles for bswap B
107     cycles for shr eax, 8 A
133     cycles for shr eax, 8 B
89      cycles for lodsd+bswap

1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x
1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x

C:\masm32\examples\strings>



It looks like this version is the fastest:

89      cycles for lodsd+bswap

I'll look at the code as I got time, but it will require another ASM lesson  :dazzled:

Well Dave, the string is 4 bytes aligned, so there should be no problem.
If Hutch will be so kind to join us in this path, the code warrior path, he will spare me some headache  :8)
Title: Re: ASM for FUN - #1 SUB
Post by: jj2007 on April 19, 2010, 07:56:39 PM
Quote from: dedndave on April 19, 2010, 07:48:11 PM
another approach might be to read words instead of dwords - that doesn't sound very promising   :P

Even unrolled, it is slower than bswap...

101     cycles for bswap A
110     cycles for bswap B
114     cycles for word read

mov ecx, sizeof src/4
mov esi, offset src
mov edi, offset tgt
.Repeat
movzx eax, word ptr [esi]
mov [edi], al
mov [edi+2], ah
movzx eax, word ptr [esi+2]
mov [edi+4], al
mov [edi+2+4], ah
add esi, 4
add edi, 8
dec ecx
.Until Zero?
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 19, 2010, 08:38:50 PM
In order to test the routine we discussed, we just need 3 things:

1) allocate a source string of 4000 bytes filled with "A" or any ASCII character
2) allocate a target string of 8000 bytes filled with "*" or something different than "A"
3) print the first 20 bytes of the target string after the loop just to be sure it worked.

The loop will start with 1000 going down to 1.

You surely know how to change the code we used before, so if you'd like to carry on the experiment, please
let me know how to do the 3 things above.

I can test the PB routine and the ASM one and see what the difference is, in MASM32 CODE.
When we get the help from hutch or somebody who knows PB, we can do a new test.

Just to have an idea of the PEEK/POKE gain versus the MID$, to convert 1000 strings of 4000 bytes I got:

MID$ version ========> 12,665,435,382 CPU cycles
PEEK/POKE version ===>       53,355,308 CPU cycles

the second version is about 240 times faster than the previous one. I think with MASM we can get to half the cycles. 

Cheers

Frank
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 19, 2010, 11:10:14 PM
we should use the same timing method as you used on the others for comparison
so, i did not put the timing code in that we normally use
let us know how it compares
Title: Re: ASM for FUN - #1 SUB
Post by: jj2007 on April 20, 2010, 12:08:03 AM
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
5596    cycles for bswap 4k
98      cycles for bswap A      23 bytes
105     cycles for bswap B      24 bytes
109     cycles for word read    27 bytes
105     cycles for shr eax, A   31 bytes
117     cycles for shr eax, B   37 bytes
108     cycles for lodsd+bsw A  20 bytes
117     cycles for lodsd+bsw B  24 bytes
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 20, 2010, 01:10:27 AM
prescottIntel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
8115    cycles for bswap 4k
177     cycles for bswap A      23 bytes
179     cycles for bswap B      24 bytes
177     cycles for word read    27 bytes
175     cycles for shr eax, A   31 bytes
205     cycles for shr eax, B   37 bytes
273     cycles for lodsd+bsw A  20 bytes
475     cycles for lodsd+bsw B  24 bytes


QuoteMID$ version ========> 12,665,435,382 CPU cycles
PEEK/POKE version ===>       53,355,308 CPU cycles

the second version is about 240 times faster than the previous one. I think with MASM we can get to half the cycles.

i think Frank is going to see why we like ASM   :bg
how about a 10,000 to 1 improvement
you can almost paint a house in 53,000,000 cycles
Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 20, 2010, 01:25:49 AM
i had just rebuilt my C drive and forgot to apply the throttle registry fix
here are the correct prescott numbers...Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
7381    cycles for bswap 4k
173     cycles for bswap A      23 bytes
175     cycles for bswap B      24 bytes
173     cycles for word read    27 bytes
171     cycles for shr eax, A   31 bytes
177     cycles for shr eax, B   37 bytes
270     cycles for lodsd+bsw A  20 bytes
466     cycles for lodsd+bsw B  24 bytes
Title: Re: ASM for FUN - #1 SUB
Post by: hutch-- on April 20, 2010, 02:57:30 AM
Frank,

This much you will find with the later versions of PB that have the "#align ##" directive, the code you write in assembler is as fast as the code you write in MASM. Where you will find differences is in the stack overhead as PB comforms to the basic specification of zeroing locals and setting strings to a null string. With a procedure of any size it does not matter but with very short procedures like you use in MASM with no stack frame, you have the choice of simply inlining the assembler code and having no stack overhead at all which is an optimal solution.

With a memory buffer for console output, you don't have to limit yourself to matching the screen buffer size, memory in Windows is not all that finely granulated so you could safely allocate 64k for a safe sized buffer with no loss which gives you a heap of headroom when experimenting.

PB is a good vehicle for gradually learning assembler programming as you can with a bit of caution mix HLL and ASM code. Just make sure you use the #REGISTER NONE directive which turns off register based optimisation otherwise you register usage will do things that you did not write.
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 20, 2010, 05:47:53 AM
The sun is rising, and I'm awakening  :eek

So here we have quite an impressive routine from JJ, it'll take a while to get all of it  ::)

By the way, I have used it and I get this output:

----------------------------------------------------------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
5521    cycles for bswap 4k
115     cycles for bswap A      23 bytes
110     cycles for bswap B      24 bytes
113     cycles for word read    27 bytes
108     cycles for shr eax, A   31 bytes
126     cycles for shr eax, B   37 bytes
111     cycles for lodsd+bsw A  20 bytes
127     cycles for lodsd+bsw B  24 bytes

1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x
1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x
----------------------------------------------------------------------------------------------------------------------------


At a first look the "5521    cycles for bswap 4k" line of output is referring to the conversion of a
4000 bytes string into a 8000 bytes one for 1 million times?  :dazzled:

Well I hope I'll have time enough during the day to have a closer look at JJ's routine to grasp something
more:
- the use of counters, timings, and so on.

Of course at the moment I don't know if the way PB counts cycles with TIX instruction is the same of what
this MASM counter does, but anyway it is a massive improvement, much more than expected, considering
I was using POINTERS and bytes moving with PEEK/POKE, quite similar to what we are doing in MASM
with this routine. So I expected the assembly generated from PB compiler would be closer to the hand-crafted routine,
but it looks like I was wrong  :P

Going to work

Have a nice day or evening or night, depending on where you live  :boohoo:

Frank




Title: Re: ASM for FUN - #1 SUB
Post by: dedndave on April 20, 2010, 05:53:00 AM
Jochen's code is hard to read for n00bs like me   :P
he uses a lot of if/while structures that make it harder for me to understand
i prefer straight ASM - and indentation also makes it hard
i am an old dog and it's hard to learn new trix
Title: Re: ASM for FUN - #1 SUB
Post by: jj2007 on April 20, 2010, 06:50:45 AM
Quote from: frktons on April 20, 2010, 05:47:53 AM
At a first look the "5521    cycles for bswap 4k" line of output is referring to the conversion of a 4000 bytes string into a 8000 bytes one for 1 million times?  :dazzled:

hi Frank,
The 5500 cycles are per single conversion of 4000 bytes. I have no idea what PB measures, it might be true cycles or QPC ticks.
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 20, 2010, 07:35:05 AM
Quote from: jj2007 on April 20, 2010, 06:50:45 AM
Quote from: frktons on April 20, 2010, 05:47:53 AM
At a first look the "5521    cycles for bswap 4k" line of output is referring to the conversion of a 4000 bytes string into a 8000 bytes one for 1 million times?  :dazzled:

hi Frank,
The 5500 cycles are per single conversion of 4000 bytes. I have no idea what PB measures, it might be true cycles or QPC ticks.


OOOHHH well. This is more reasonable. So for 1000 loop of 4000 bytes string conversion it'd take some
5,521,000 cycles that is a ten fold  improvement. Quite impressive the same, comparing to original MID$
version it is a 2,400 fold gain.
As soon as I'll be able to test it inside the PB SUB, we'll have the real numbers.

At the moment we have these temporary results:

MID$ version ========> 12,665,435,382 CPU cycles
PEEK/POKE version  ====>       53,355,308 CPU cycles
MASM version  =======>          5,521,000 CPU cycles

The timers.asm is already in the macros folder, no need to grab it.  :8)

Enjoy and thanks for your help

Frank
Title: Re: ASM for FUN - #1 SUB
Post by: MichaelW on April 20, 2010, 08:27:34 AM
http://www.powerbasic.com/support/help/pbcc/tix_statement.htm
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 20, 2010, 08:39:29 AM
Quote from: dedndave on April 20, 2010, 05:53:00 AM
Jochen's code is hard to read for n00bs like me   :P
he uses a lot of if/while structures that make it harder for me to understand
i prefer straight ASM - and indentation also makes it hard
i am an old dog and it's hard to learn new trix

I'm used to both of them, so for me it's easy enough to let me pay attention to things that I don't know yet.  :lol

Quote from: MichaelW on April 20, 2010, 08:27:34 AM
http://www.powerbasic.com/support/help/pbcc/tix_statement.htm

Hi Michael, thanks for the link.  :U
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 29, 2010, 01:53:49 PM
A final touch to the PEEK/POKE version, again
choosing REGISTER variables in a more appropriate way,
and the performance got a 40% improvement.

Old results:

At the moment we have these temporary results:

MID$ version ========> 12,665,435,382 CPU cycles
PEEK/POKE version  ====>       53,355,308 CPU cycles
MASM version  =======>          5,521,000 CPU cycles


New ones:

Using REGISTER variables in a better way:

MID$ version ========> 12,665,435,382 CPU cycles
PEEK/POKE version  ====>       31,865,067 CPU cycles
MASM version  =======>          5,521,000 CPU cycles


This means that just a bit of ASM in your mind gives you many
opportunities to write better CODE.  :P

Frank
Title: Re: ASM for FUN - #1 SUB
Post by: hutch-- on April 29, 2010, 02:32:40 PM
Its usually a good idea to enable register variables if you are not specifically writing asm code as you pick up the performance gain at no cost by doing so. From memory the internal compiler design uses ESI and EDI to perform variable replacements and within boundaries you get good gains in performance by using the default settings.
Title: Re: ASM for FUN - #1 SUB
Post by: frktons on April 29, 2010, 11:56:27 PM
Quote from: hutch-- on April 29, 2010, 02:32:40 PM
Its usually a good idea to enable register variables if you are not specifically writing asm code as you pick up the performance gain at no cost by doing so. From memory the internal compiler design uses ESI and EDI to perform variable replacements and within boundaries you get good gains in performance by using the default settings.

Yes Hutch. Using REGISTERs is a good way of speeding up performance in pure PB
CODE, and, in my experience, it is better to choose the ones that are probably used
more intensively. I was lucky, somehow, picking up the right variables that I imagined
were used a lot, and in two of the SUBs I could get an improvement from 30 to 50%.