News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

ASM for FUN - #1 SUB

Started by frktons, April 18, 2010, 08:52:38 PM

Previous topic - Next topic

dedndave

yah - that's what i was saying - the alternatives aren't any better
another approach might be to read words instead of dwords - that doesn't sound very promising   :P
although, with the code you are using, it may help to insure the source string is 4-aligned

dedndave

Frank - Hutch is probably the most familiar with PB in here

frktons

Quote from: jj2007 on April 19, 2010, 07:39:19 PM
Quote from: dedndave on April 19, 2010, 04:34:44 PM
bswap isn't super-fast, but probably better than the alternatives

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
102     cycles for bswap A
111     cycles for bswap B
111     cycles for shr eax, 8 A
123     cycles for shr eax, 8 B
114     cycles for lodsd+bswap


Try your luck :bg

Unfortunately, I cannot test pshufb on my SSE2 PC...

In my PC the results are different:

Microsoft Windows [Versione 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. Tutti i diritti riservati.

C:\masm32\examples\strings>fillstringfrkttons
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
156     cycles for bswap A
119     cycles for bswap B
107     cycles for shr eax, 8 A
133     cycles for shr eax, 8 B
89      cycles for lodsd+bswap

1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x
1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x

C:\masm32\examples\strings>



It looks like this version is the fastest:

89      cycles for lodsd+bswap

I'll look at the code as I got time, but it will require another ASM lesson  :dazzled:

Well Dave, the string is 4 bytes aligned, so there should be no problem.
If Hutch will be so kind to join us in this path, the code warrior path, he will spare me some headache  :8)
Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Quote from: dedndave on April 19, 2010, 07:48:11 PM
another approach might be to read words instead of dwords - that doesn't sound very promising   :P

Even unrolled, it is slower than bswap...

101     cycles for bswap A
110     cycles for bswap B
114     cycles for word read

mov ecx, sizeof src/4
mov esi, offset src
mov edi, offset tgt
.Repeat
movzx eax, word ptr [esi]
mov [edi], al
mov [edi+2], ah
movzx eax, word ptr [esi+2]
mov [edi+4], al
mov [edi+2+4], ah
add esi, 4
add edi, 8
dec ecx
.Until Zero?

frktons

#34
In order to test the routine we discussed, we just need 3 things:

1) allocate a source string of 4000 bytes filled with "A" or any ASCII character
2) allocate a target string of 8000 bytes filled with "*" or something different than "A"
3) print the first 20 bytes of the target string after the loop just to be sure it worked.

The loop will start with 1000 going down to 1.

You surely know how to change the code we used before, so if you'd like to carry on the experiment, please
let me know how to do the 3 things above.

I can test the PB routine and the ASM one and see what the difference is, in MASM32 CODE.
When we get the help from hutch or somebody who knows PB, we can do a new test.

Just to have an idea of the PEEK/POKE gain versus the MID$, to convert 1000 strings of 4000 bytes I got:

MID$ version ========> 12,665,435,382 CPU cycles
PEEK/POKE version ===>       53,355,308 CPU cycles

the second version is about 240 times faster than the previous one. I think with MASM we can get to half the cycles. 

Cheers

Frank
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

we should use the same timing method as you used on the others for comparison
so, i did not put the timing code in that we normally use
let us know how it compares

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
5596    cycles for bswap 4k
98      cycles for bswap A      23 bytes
105     cycles for bswap B      24 bytes
109     cycles for word read    27 bytes
105     cycles for shr eax, A   31 bytes
117     cycles for shr eax, B   37 bytes
108     cycles for lodsd+bsw A  20 bytes
117     cycles for lodsd+bsw B  24 bytes

dedndave

prescottIntel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
8115    cycles for bswap 4k
177     cycles for bswap A      23 bytes
179     cycles for bswap B      24 bytes
177     cycles for word read    27 bytes
175     cycles for shr eax, A   31 bytes
205     cycles for shr eax, B   37 bytes
273     cycles for lodsd+bsw A  20 bytes
475     cycles for lodsd+bsw B  24 bytes


QuoteMID$ version ========> 12,665,435,382 CPU cycles
PEEK/POKE version ===>       53,355,308 CPU cycles

the second version is about 240 times faster than the previous one. I think with MASM we can get to half the cycles.

i think Frank is going to see why we like ASM   :bg
how about a 10,000 to 1 improvement
you can almost paint a house in 53,000,000 cycles

dedndave

i had just rebuilt my C drive and forgot to apply the throttle registry fix
here are the correct prescott numbers...Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
7381    cycles for bswap 4k
173     cycles for bswap A      23 bytes
175     cycles for bswap B      24 bytes
173     cycles for word read    27 bytes
171     cycles for shr eax, A   31 bytes
177     cycles for shr eax, B   37 bytes
270     cycles for lodsd+bsw A  20 bytes
466     cycles for lodsd+bsw B  24 bytes

hutch--

Frank,

This much you will find with the later versions of PB that have the "#align ##" directive, the code you write in assembler is as fast as the code you write in MASM. Where you will find differences is in the stack overhead as PB comforms to the basic specification of zeroing locals and setting strings to a null string. With a procedure of any size it does not matter but with very short procedures like you use in MASM with no stack frame, you have the choice of simply inlining the assembler code and having no stack overhead at all which is an optimal solution.

With a memory buffer for console output, you don't have to limit yourself to matching the screen buffer size, memory in Windows is not all that finely granulated so you could safely allocate 64k for a safe sized buffer with no loss which gives you a heap of headroom when experimenting.

PB is a good vehicle for gradually learning assembler programming as you can with a bit of caution mix HLL and ASM code. Just make sure you use the #REGISTER NONE directive which turns off register based optimisation otherwise you register usage will do things that you did not write.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

frktons

#40
The sun is rising, and I'm awakening  :eek

So here we have quite an impressive routine from JJ, it'll take a while to get all of it  ::)

By the way, I have used it and I get this output:

----------------------------------------------------------------------------------------------------------------------------
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
5521    cycles for bswap 4k
115     cycles for bswap A      23 bytes
110     cycles for bswap B      24 bytes
113     cycles for word read    27 bytes
108     cycles for shr eax, A   31 bytes
126     cycles for shr eax, B   37 bytes
111     cycles for lodsd+bsw A  20 bytes
127     cycles for lodsd+bsw B  24 bytes

1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x
1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x1x2x3x4x5x6x7x8x9x0x
----------------------------------------------------------------------------------------------------------------------------


At a first look the "5521    cycles for bswap 4k" line of output is referring to the conversion of a
4000 bytes string into a 8000 bytes one for 1 million times?  :dazzled:

Well I hope I'll have time enough during the day to have a closer look at JJ's routine to grasp something
more:
- the use of counters, timings, and so on.

Of course at the moment I don't know if the way PB counts cycles with TIX instruction is the same of what
this MASM counter does, but anyway it is a massive improvement, much more than expected, considering
I was using POINTERS and bytes moving with PEEK/POKE, quite similar to what we are doing in MASM
with this routine. So I expected the assembly generated from PB compiler would be closer to the hand-crafted routine,
but it looks like I was wrong  :P

Going to work

Have a nice day or evening or night, depending on where you live  :boohoo:

Frank




Mind is like a parachute. You know what to do in order to use it :-)

dedndave

Jochen's code is hard to read for n00bs like me   :P
he uses a lot of if/while structures that make it harder for me to understand
i prefer straight ASM - and indentation also makes it hard
i am an old dog and it's hard to learn new trix

jj2007

Quote from: frktons on April 20, 2010, 05:47:53 AM
At a first look the "5521    cycles for bswap 4k" line of output is referring to the conversion of a 4000 bytes string into a 8000 bytes one for 1 million times?  :dazzled:

hi Frank,
The 5500 cycles are per single conversion of 4000 bytes. I have no idea what PB measures, it might be true cycles or QPC ticks.

frktons

#43
Quote from: jj2007 on April 20, 2010, 06:50:45 AM
Quote from: frktons on April 20, 2010, 05:47:53 AM
At a first look the "5521    cycles for bswap 4k" line of output is referring to the conversion of a 4000 bytes string into a 8000 bytes one for 1 million times?  :dazzled:

hi Frank,
The 5500 cycles are per single conversion of 4000 bytes. I have no idea what PB measures, it might be true cycles or QPC ticks.


OOOHHH well. This is more reasonable. So for 1000 loop of 4000 bytes string conversion it'd take some
5,521,000 cycles that is a ten fold  improvement. Quite impressive the same, comparing to original MID$
version it is a 2,400 fold gain.
As soon as I'll be able to test it inside the PB SUB, we'll have the real numbers.

At the moment we have these temporary results:

MID$ version ========> 12,665,435,382 CPU cycles
PEEK/POKE version  ====>       53,355,308 CPU cycles
MASM version  =======>          5,521,000 CPU cycles

The timers.asm is already in the macros folder, no need to grab it.  :8)

Enjoy and thanks for your help

Frank
Mind is like a parachute. You know what to do in order to use it :-)