News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Initializing Memory with ones

Started by Empirelord, November 20, 2010, 11:42:05 PM

Previous topic - Next topic

Empirelord

What is the fastest way to fill a couple of memory, exactly we are talking about 1,6 GByte, completly with FF in hexadecimal.
My best idea was moving it all into the mem via mov, but knowing little about ram architektur I thougt there must be a faster way.

dedndave

fill it with 0's, then invert   :lol
or, you could try this code
        mov     edi,offset MemBuff
        mov     ecx,(sizeof MemBuff)/4
        mov     eax,0FFFFFFFFh
        rep     stosd

it will be reasonably fast, as long as the base address of MemBuff is 4-aligned

notice that the buffer size should be divisable by 4, as well
if it isn't, add a few pad bytes to the end so that it is

hutch--

On memory of that size if it has to be done regularly you would be better to use SSE 128 bit fills. Think of instructions like MOVNTDQA if the memory is aligned correctly.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

clive

        .686

        .MODEL FLAT,C

        .MMX
        .XMM

        .CODE

FastFill PROC   DataSize:DWORD, Buffer:PTR BYTE

        push    esi
        mov     esi, Buffer
        mov     ecx, DataSize
        shr     ecx, 6

        movups     xmm0,AllFF

@@:
        movups     [esi +  0], xmm0
        movups     [esi + 16], xmm0
        movups     [esi + 32], xmm0
        movups     [esi + 48], xmm0
        add     esi, 64
        add     ecx, -1
        jnz     @B

        pop     esi
        ret

FastFill        ENDP

        .DATA

AllFF   dd      -1,-1,-1,-1

        END
It could be a random act of randomness. Those happen a lot as well.

dedndave

the AllFF define should be 16 aligned ?

Gunther

Quote from: dedndave, November 21, 2010, at 03:20:51 AMthe AllFF define should be 16 aligned ?

No, not in that case, because Clive is using MOVUPS (move unaligned packed single).

Gunther
Forgive your enemies, but never forget their names.

jj2007

Quote from: hutch-- on November 21, 2010, 02:49:12 AM
On memory of that size if it has to be done regularly you would be better to use SSE 128 bit fills. Think of instructions like MOVNTDQA if the memory is aligned correctly.

Hutch has the fastest solution. Align the memory first (but most probably it is already aligned), then use MOVNTDQA. You can unroll it a little bit to save some cycles.

The point about MOVNTDQA is that it does not write to the data cache.

sinsi

Isn't MOVNTDQA sse4?

I thought for more than 256 meg 'rep stosd' was pretty speedy.
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on November 21, 2010, 09:31:37 AM
Isn't MOVNTDQA sse4?

I thought for more than 256 meg 'rep stosd' was pretty speedy.

Yes, correct - it's SSE4. But there is an 'ordinary' variant, movntdq. Note that in standard timing benchmarks it looks pretty bad because it writes without caching; you would have to change the testbed for Gigabyte size to see the difference:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1405    cycles for 100*movdqa
????   cycles for 100*movntdq


EDIT: There is something weird here. See attachment, third loop.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1554    cycles for 100*movdqa
1924    cycles for 100*movntdq
1571    cycles for 100*MOVNTPD


1549    cycles for 100*movdqa
24888   cycles for 100*movntdq ; without 'speedup'



More detail on performance here.

Antariy

Quote from: jj2007 on November 21, 2010, 10:13:56 AM
Yes, correct - it's SSE4. But there is an 'ordinary' variant, movntdq. Note that in standard timing benchmarks it looks pretty bad because it writes without caching; you would have to change the testbed for Gigabyte size to see the difference:

Go to: "http://www.masm32.com/board/index.php?topic=14685.msg119904#msg119904" and follow thread at all.

For buffer which bigger than L2 cache in some times - MOVNTDQ would be best choice.



Alex

Empirelord

Thanks for all the replys, great forum.
I'm going to figure out what is the fastest solution in my case, and which brings up enough compatibility(not every pc has SSE4).

@hutch-- :
I was not quite sure where to post my question, so thanks for moving it to the right subforum.

dedndave

i still say my original idea sounds best   :P

Quotefill it with 0's, then invert

Antariy

Quote from: dedndave on November 21, 2010, 09:06:13 PM
i still say my original idea sounds best   :P

Quotefill it with 0's, then invert...

... and all with non-temporal writes for big buffers :P