ZeroMemory with SSE2

Mark_Larson · March 02, 2008, 05:34:48 PM

Quote from: asmfan on March 02, 2008, 02:44:56 PM
Have a look at - AMD_block_prefetch_paper.pdf (c)2001 Advanced Micro Devices Inc.
QuoteUsing Block Prefetch for Optimized Memory Performance by Advanced Micro Devices
Mike Wall
Member of Technical Staff
Developer Performance Team
And you'll find out what h/w prefetch I'm talking about.
Also second interesting file also here

I was aware. Intel works the same way. What I meant was, that there is no instruction you can use that was specifically designed to do h/w prefetching. You read ahead a cache line in advance to force the data into the cache.

EDIT: On Intel processors I get better performance from using a prefetchnta than a MOV. But I have to use my program to find the optimum way to use it in the loop.

EDIT2: I found my program, let me play with it a bit and see what it says. I'll post it later. It is written in C.

NightWare · March 03, 2008, 03:15:15 AM

Quote from: Mark_Larson on March 02, 2008, 01:45:59 PM
Do you have a Core 2 Duo, NightWare?

yep, but it doesn't matter, :lol there is a constant in all the docs i've read, and i think we completly forgotten the basis here... :red with prefetch we read datas (and then, optionally alterate them, and write them).
but here, for zeromem or fillmem algo there is no need to READ and alterate... we just need to write a value we already have... so psd/prefetch/etc... are useless here... no ? :lol

Mark_Larson · March 03, 2008, 04:56:47 PM

Quote from: NightWare on March 03, 2008, 03:15:15 AM
Quote from: Mark_Larson on March 02, 2008, 01:45:59 PM
Do you have a Core 2 Duo, NightWare?
yep, but it doesn't matter, :lol there is a constant in all the docs i've read, and i think we completly forgotten the basis here... :red with prefetch we read datas (and then, optionally alterate them, and write them).
but here, for zeromem or fillmem algo there is no need to READ and alterate... we just need to write a value we already have... so psd/prefetch/etc... are useless here... no ? :lol

actually if you want the algorithm to run as fast as possible the data you want to write to memory needs to be in the cache. And the way you ensure that is by h/w or s/w prefetch.

NightWare · March 04, 2008, 02:57:41 AM

Quote from: Mark_Larson on March 03, 2008, 04:56:47 PM
the data you want to write to memory needs to be in the cache

it's in the cache, you put it when used xor reg,reg or mov reg,imm. so why do you want to copy the block you're gonna clean in the cache ? it's a useless work you ask to the cpu... :wink

Mark_Larson · March 04, 2008, 03:22:26 PM

you need to re-look at the code. The code is actually pre-reading the NEXT cache line, while you are doing the current caches lines data. That way you are guaranteed when you start processing the data that it is already in the cache. The TLB priming works the same way. You load the next Page, before you actually need it.

NightWare · March 06, 2008, 02:27:17 AM

Quote from: Mark_Larson on March 04, 2008, 03:22:26 PM
You load the next Page, before you actually need it.

::) and there is a reason/case where you need it, for a zeromem or fillmem algo ?

Mark_Larson · March 06, 2008, 03:33:42 PM

Quote from: NightWare on March 06, 2008, 02:27:17 AM
Quote from: Mark_Larson on March 04, 2008, 03:22:26 PM
You load the next Page, before you actually need it.
::) and there is a reason/case where you need it, for a zeromem or fillmem algo ?

yes so you don't get cache misses on the data. By doing the prefetching you guarantee the data is always in the cache.

asmfan · March 06, 2008, 04:03:42 PM

As I somewhere said dynamically determining the cache line size with cpuid we can make things faster.
I posted cpuid program on fasm forum... wait a minute I'll find it
Here it is
By using some manuals (Intel/amd - whatever) we can easily find cpuid input value to determine cache line size (80000006h -> ecx [0:7]).
I believe it's the way to smart programs which will use cpuid first and then show max performance according to CPU abilities.

daydreamer · March 07, 2008, 08:17:16 AM

Quote from: hutch-- on March 01, 2008, 10:41:34 PM
Something I should have mentioned with multicore processors, its worth a try using multiple threads to handle the memory to zero in multiple blocks, on a single processor machine this would be much slower as thread overhead would kill it but if thread overhead considerations can be overcome you may get the advantages of parallelism if done on a multiple processor machine.

synchronizing variable can do cachepolluting, which you dont want to get so it interferes cache to work optimum to fillmemory, Intels paper I read on this says clearly variable must be alone on 128byte aligned cacheline or mixed with other stuff can cause cachepolluting and to output 128bytes sometimes for that will also cut on bandwidth needed to fillmemory, cachepolluting means a loop of cache doing writeback to memory and readback and performance will drop radically
non synchronized threads doing fills, no idea how much ineffective zeroing will be caused by one thread gets too far away memoryadresses from the other?
and if you already reached bottleneck of memoryspeed with one thread whats the point?
but if you have an idea howto memoryfill with multiple threads that could work you could theoretically speedup x2 for a dualcore until memorybandwidth bottlenecks it lets code it

daydreamer · March 07, 2008, 08:23:44 AM

Quote from: asmfan on March 06, 2008, 04:03:42 PM
As I somewhere said dynamically determining the cache line size with cpuid we can make things faster.
I posted cpuid program on fasm forum... wait a minute I'll find it
Here it is
By using some manuals (Intel/amd - whatever) we can easily find cpuid input value to determine cache line size (80000006h -> ecx [0:7]).
I believe it's the way to smart programs which will use cpuid first and then show max performance according to CPU abilities.

thanks nice info, you also know howto use cpuid for knowing how much texture I could keep in cache? reading cache's memory size?

asmfan · March 07, 2008, 10:07:54 AM

Quote from: daydreamer on March 07, 2008, 08:23:44 AM
thanks nice info, you also know howto use cpuid for knowing how much texture I could keep in cache? reading cache's memory size?

I think this question isn't CPU but GPU and videocard relative but if you do your own processing you can determine how much L1/2/3 cache available and much cores and threads per core are supported. Dividing total L2 cache on chunks (texture sizes) you'll get total num of textures fit in L2 cache. Passing this chunks to appropriate numbers of threads will give performance boost.
Take a look at
AMD Man-s
[AMD CPUID Specification] 25481.pdf
[Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2A - Instruction Set Reference, A-M] 253666.pdf
Some fields of cpuid are available on both chips (Intl & AMd) but others are specific. For compatibility compare documentation of different developers.

johnsa · March 10, 2008, 02:37:09 PM

From my tests with Memory Filling (or zeroing) I've found the following to be pretty much consistently true:

0 - 4096 bytes (Use RISC moves, 4x unrolled with a negative loop / index value)
4096 bytes - 32kb (Use SSE+TLB Priming+8x unroll)
32kb-3Mb (Use REP STOSD - This is probably dependant on L2 cache size?)
3Mb+ (Use an MMX or SSE non temporal store with movntdq or movntq)

I've tried it on 5 or 6 different processor types and those ranges seem to pretty close on all of them.

NightWare · March 11, 2008, 02:12:07 AM

johnsa,

after your post i've tested an old rep stosd/stosb algo (old because it's slower in speed test), and the result is just a bit under movntps (-4 fps, beside, here the job is entirely done). thank you for the info :U

johnsa · March 11, 2008, 10:39:49 AM

I'm going to perform the same tests with mem-copying and try to determine which algo's work best for what size data.
It seems that the best solution for a generic mem-fill or mem-copy would be to determine the data size up front and use one of the 4 possible solutions that best fit.

It's quite interesting, for example if you were using the fill memory to clear your screen buffer, the resolution you are running in would determine which approach is better.
(1024*768*32bit) = 3meg which is just on the boundary between a REP STOSD and an MMX/SSE NT write. So 1280x1024 would be best filled using NT and 800x600 would be best with a rep stosd.

NightWare · March 16, 2008, 02:15:56 AM

hi,

i've said somewhere "i'm going to test movnti", i've been busy but i've tested it now, and it's totally the subject of the topic :

Code Select

ALIGN 16
;
; Syntaxe :
; mov eax,BlockSize
; mov edx,FillValue
; mov edi,MemBlockPointer
; call Sse2_DwFillMem_NT
;
Sse2_DwFillMem_NT PROC
		push ecx

		and eax,11111111111111111111111111111100b			; to avoid gpf 
; owords
Label1:	mov ecx,eax
		and ecx,11111111111111111111111111110000b
		jz Label3
Label2:	movnti DWORD PTR[edi],edx
		movnti DWORD PTR[edi+4],edx
		movnti DWORD PTR[edi+8],edx
		movnti DWORD PTR[edi+12],edx
		add edi,DWORD*4
		sub ecx,DWORD*4
		jnz Label2
; dwords
Label3:	mov ecx,eax
		and ecx,00000000000000000000000000001100b
		jz Label5
		add edi,ecx
		neg ecx
Label4:	movnti DWORD PTR [edi+ecx],edx
		add ecx,DWORD
		jnz Label4
; end
Label5:	sub edi,eax									; restore edi

		pop ecx
	ret
Sse2_DwFillMem_NT ENDP

and i've obtained exactly the same result as movntps/movntdq (the one i use in my app) :

Code Select

ALIGN 16
;
; Syntaxe :
; mov eax,BlockSize
; mov edx,FillValue
; mov edi,MemBlockPointer
; call Sse2_DwFillMem_NT
;
Sse2_DwFillMem_NT PROC
		push ecx

		and eax,11111111111111111111111111111100b			; to avoid gpf 
; value to simd register
		movd XMM0,edx									; XMM0 = _,_,_,x
		pshufd XMM0,XMM0,000h							; XMM0 = x,x,x,x
; owords x4
Label1:	mov ecx,eax
		and ecx,11111111111111111111111111000000b
		jz Label3
Label2:	movntdq OWORD PTR[edi],XMM0
		movntdq OWORD PTR[edi+16],XMM0
		movntdq OWORD PTR[edi+32],XMM0
		movntdq OWORD PTR[edi+48],XMM0
		add edi,OWORD*4
		sub ecx,OWORD*4
		jnz Label2
; owords
Label3:	mov ecx,eax
		and ecx,00000000000000000000000000110000b
		jz Label5
		add edi,ecx
		neg ecx
Label4:	movntdq OWORD PTR[edi+ecx],XMM0
		add ecx,OWORD
		jnz Label4
; dwords
Label5:	mov ecx,eax
		and ecx,00000000000000000000000000001100b
		jz Label7
		add edi,ecx
		neg ecx
Label6:	movnti DWORD PTR [edi+ecx],edx
		add ecx,DWORD
		jnz Label6
; end
Label7:	sub edi,eax									; restore edi

		pop ecx
	ret
Sse2_DwFillMem_NT ENDP

now, it's Non-Temporal so if we think in term of speed, one of the previous algos certainly finish the job before the other, i don't know wich one yet. but movnti is really interesting even for those who don't want to use simd stuff. of course, you need to understand when you have to use it, but in some case, it can really speed up your code.

advantages :
the treatment is by dword and not by qword/oword (simd register), so it's easier to use.
you can easilly replace your mov instruction with it (when you need to store data in your existing algos).
you preserve a simd register (and i appreciate...).
your memory block doesn't have to be 16 bytes aligned.

limits :
you can't use it for word/byte

disadvantages :
even if you use it exclusively on 32bits register, it's a Sse2 instruction. So you will have to test the cpu, before using it.
you also need to understand why and when to use lfence/sfence/mfence.

conclusion : VOTE FOR MOVNTI ! i'm nightware and i approve this message... :lol

News:

ZeroMemory with SSE2