News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

ASM for FUN - #1 SUB

Started by frktons, April 18, 2010, 08:52:38 PM

Previous topic - Next topic

dedndave

i got the same result with no null terminator
Jochen is relying on the fact that the remainder of the segment is filled with 0's
in many cases, it will continue to display garbage until a 0 is found - lol
although, writing to the console has strange behaviour
for example, if you try to write more than about 56 Kb to the screen using WriteFile, nothing happens
i suspected that was what you might be seeing, with no terminator

jj2007

Quote from: dedndave on April 19, 2010, 03:44:10 PM
Jochen is relying on the fact that the remainder of the segment is filled with 0's

No, I just forgot the null terminator :red

dedndave

shhhh - i wasn't gonna say anything   :P

frktons

Quote from: dedndave on April 19, 2010, 04:05:08 PM
shhhh - i wasn't gonna say anything   :P

OOPs, I was writing while you were answering, please see my previous post.  :bg

Frank
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

you are right about ESI and EDI (source index and destination index - the E means extended to 32 bits)

the loop goes from 5 to 1, actually - when EBX reaches 0, the loop exits

if we take out the Repeat/Until syntax, the assembler probably generates something like this:

start:  mov ebx, 5
        mov esi, offset src
        mov edi, offset tgt

loop_start:
                mov eax, [esi]
                mov [edi], al
                mov [edi+2], ah
                bswap eax
                mov [edi+6], al
                mov [edi+4], ah
                add esi, 4
                add edi, 8
                dec ebx
                jnz loop_start

        print offset tgt

frktons

Quote from: dedndave on April 19, 2010, 04:11:45 PM
you are right about ESI and EDI (source index and destination index - the E means extended to 32 bits)

the loop goes from 5 to 1, actually - when EBX reaches 0, the loop exits

if we take out the Repeat/Until syntax, the assembler probably generates something like this:

start:  mov ebx, 5
        mov esi, offset src
        mov edi, offset tgt

loop_start:
                mov eax, [esi]
                mov [edi], al
                mov [edi+2], ah
                bswap eax
                mov [edi+6], al
                mov [edi+4], ah
                add esi, 4
                add edi, 8
                dec ebx
                jnz loop_start

        print offset tgt


Well, from the output it looks like it is doing only four cycles of four bytes [16 total], and
the final "x" is not appearing in the target string, I'm trying to figure out why.

So ESI and EDI (source index and destination index) means we have just the pointers to the strings in the registers, haven't we?


Mind is like a parachute. You know what to do in order to use it :-)

frktons

OK. I expanded the target string to 40 bytes so the 5 cycles of 4 bytes could be
used, and it works fine. I figured out why it was displaying only 30 bytes, it was because
the target string was just 30 bytes.  :8)
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

that loop is probably pretty fast, too   :U

bswap isn't super-fast, but probably better than the alternatives

frktons

Very well, here we have the next chunk:

loop_start:
                mov eax, [esi]
                mov [edi], al
                mov [edi+2], ah


We MOV the four bytes of the source string [pointed by the register esi] to eax.
We MOV the low byte [al] of ax to the address pointed by edi, to the target string single byte
We MOV the high byte [ah] of ax to the address pointed by edi + 2, to the target string single byte

Did I guess right?
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

you got it
he grabs 4 bytes with MOV EAX,[ESI]
then, places the lower 2 bytes (AL and AH)
after that, we can't access the upper word of EAX by individual bytes
so, he uses BSWAP to get the upper word of EAX into AX
BSWAP reverses the order of all 4 bytes, so AL and AH appear to be backwards, now
after he has placed all 4 bytes, he updates the index pointers and decrements the loop counter
i might have used ECX for the loop count, as it is the "traditional" count register and EBX wants to be preserved across system callback functions

frktons

And the last chunk:

bswap eax
mov [edi+6], al
mov [edi+4], ah
add esi, 4
add edi, 8
dec ebx
.Until Zero?
print offset tgt
exit
end start


We bswap the 16 high bits of eax with the 16 low ones.
It looks like it produces the inversion of al with ah, so
we have to move al to [edi+6] and
ah to [edi+4].
After that, as we have moved all the four bytes read from the source string, we move the pointers
to the next four bytes of the source string and the next 8 bytes of the target one.
We decrement our counter by one [dec ebx] and if the counter is not zero we loop.
If it is zero we print the target string [print offset tgt].
We probably use offset tgt because print needs a pointer to the string to print, and of course
as we don't pass anything else print stops when it finds the null terminator.

Did I get it?
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

i updated my previous post
we'll make an ASM programmer out of you, yet   :bg
after a while, you will forget PB and think it refers to something you spread on bread with jelly

frktons

Quote from: dedndave on April 19, 2010, 05:10:40 PM
i updated my previous post
we'll make an ASM programmer out of you, yet   :bg
after a while, you will forget PB and think it refers to something you spread on bread with jelly

I'm honored.  :P
Well I think this first ASM lesson was a very nice one, thanks to you and JJ for proposing the just right stuff
to learn.
I'll move a little bit forward as I return back to the forum, I hope in few hours or so.

Thanks for your teachings.

P.S. we are not finished yet, we have one more step to achieve.  :bg
Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Quote from: dedndave on April 19, 2010, 04:34:44 PM
bswap isn't super-fast, but probably better than the alternatives

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
102     cycles for bswap A
111     cycles for bswap B
111     cycles for shr eax, 8 A
123     cycles for shr eax, 8 B
114     cycles for lodsd+bswap


Try your luck :bg

Unfortunately, I cannot test pshufb on my SSE2 PC...

frktons

And the last step

We have a functional assembly routine that does what we need, and it should be fast.
What is still missing is to customize it and to translate it in PB INLINE ASSEMBLY in order to
intermix with the existing code and see how it outperform the PEEK/POKE version I managed to CODE.

Well I have to confess that before leaving PB to his destiny [ :8) ], I'd like to use it for learning some ASM, and
I think it is a real good environment, as hutch stated elsewhere in this forum.

So here we have the SUB to customize and intermix with some ASM:

---------------------------------------------------------------------------------------------------------------------------------
SUB GetScreen

    LOCAL lpReadRegion AS SMALL_RECT
    LOCAL SIZE AS DWORD

    REGISTER x AS LONG
    DIM y AS STRING PTR, y1 AS STRING PTR
    sBuf = SPACE$(8000)
    SIZE = MAKDWD(80, 25)

    y = STRPTR(MainStr)
    y1 = STRPTR(sBuf)


    FOR x = 1 TO 8000 STEP 2
      POKE y1, PEEK(y)
      y = y + 1
      y1 = y1 + 2
    NEXT x
    lpReadRegion.xRight = 79
    lpReadRegion.xBottom = 24
    WriteConsoleOutPut GetStdHandle(%STD_OUTPUT_HANDLE), _
    BYVAL STRPTR(sBuf),BYVAL SIZE, BYVAL 0&, lpReadRegion
END SUB 
---------------------------------------------------------------------------------------------------------------------------------


Because I'm not at all well versed in INLINE ASM or any other ASM whatsoever, I need your help
to correct my funny translation of the routine we just discussed before.

This is a first attempt to translate it and intermix it with the BASIC CODE above:


---------------------------------------------------------------------------------------------------------------------------------
SUB GetScreen
   
    #REGISTER NONE : ' we use the registers in our own way               
    LOCAL lpReadRegion AS SMALL_RECT
    LOCAL SIZE AS DWORD

    SIZE = MAKDWD(80, 25)

'    REGISTER x AS LONG  => we don't need it anymore
'--------------------------------------------------------------------------------------------------------------------------------------------
' The following group of instructions could be inside the ASM code as well, but for now let them stay in BASIC side
'--------------------------------------------------------------------------------------------------------------------------------------------
    DIM y AS STRING PTR, y1 AS STRING PTR: we declare 2 pointers to the source and target strings
    sBuf = SPACE$(8000): ' we prepare the area of the target string, filling it with spaces. This instruction too
                                       ' could be improved with ASM, I guess


    y = STRPTR(MainStr): ' MainStr is a GLOBAL STRING and filled elsewhere with the content of a file
                                       ' and we put the starting address into the pointer y
    y1 = STRPTR(sBuf):   ' we put the starting address of the target string into the pointer y1

'-----------------------------------------------------------------------------------------------------------------------------------
' This is BASICally the routine we translate in ASM
'-----------------------------------------------------------------------------------------------------------------------------------
'    FOR x = 1 TO 8000 STEP 2
'      POKE y1, PEEK(y)
'      y = y + 1
'      y1 = y1 + 2
'    NEXT x
'-----------------------------------------------------------------------------------------------------------------------------------
' and this will be the ASM version for PB when somebody'll correct it
'-----------------------------------------------------------------------------------------------------------------------------------

; These PUSHing of the registers we use could be necessary, but at the moment I don't know
;  push ecx
;  push esi
;  push edi

!  MOV  ecx, 1000; we have 1000 cycles to do, and I use ECX instead of EBX as Dave suggested.
!  mov esi, y
!  mov edi, y1
!  loop_start:
!  mov eax, [esi]
!  mov [edi], al
!  mov [edi+2], ah
!  bswap eax
!  mov [edi+6], al
!  mov [edi+4], ah
!  add esi, 4
!  add edi, 8
!  dec ecx
!  jnz loop_start

; These POPPing of the registers we used could be necessary, but at the moment I don't know
;  pop edi
;  pop esi
;  pop ecx


    lpReadRegion.xRight = 79
    lpReadRegion.xBottom = 24
    WriteConsoleOutPut GetStdHandle(%STD_OUTPUT_HANDLE), _
    BYVAL STRPTR(sBuf),BYVAL SIZE, BYVAL 0&, lpReadRegion
END SUB 
---------------------------------------------------------------------------------------------------------------------------------


When somebody who have used PB INLINE ASSEMBLY has got the time to correct the CODE,
I'll be glad to test it and report the performance difference.  :8)

Thanks for your time.

Frank
Mind is like a parachute. You know what to do in order to use it :-)