News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Copying to L1 cache minimizing conflicts

Started by ASMManiac, February 18, 2012, 08:13:33 PM

Previous topic - Next topic

ASMManiac

I have a large (2D) matrix and i would like to copy sub-matrices to contiguous memory so they will all fit in L1 cache for processing.
I will be using movdqu from the sse instruction set to move the data.

My question is this:
How can I copy the data to minimize cache conflicts?
If I am copying a 128x128 submatrix (1-byte data) I will copy each colomn (colomnwise ordering) from the large array to my temporary contiguous array.
However, I don't want the data from the original submatrix polluting the cache.
The non-temporal move instructions are not desirable here because I do want to bring in the colomns from the input array into the cache once, to make copying the rest of the colomn faster, but after I have copied the entire colomn I would like to evict it entirely.

Also, how do I discover the amount of L1 cache on my machine?

dedndave

QuoteAlso, how do I discover the amount of L1 cache on my machine?

for older processors, CPUID, EAX = 2

for newer processors, you do a CPUID with EAX = 80000000h to determine the highest supported extended function (returned in EAX)
if that value is 80000005h or higher, you can use...
CPUID with EAX = 80000005h
the returned values are a bit more complex to interpret, as caches on newer CPU's vary a bit

Intel CPUID specification
http://www.intel.com/Assets/PDF/appnote/241618.pdf

AMD CPUID specification
http://support.amd.com/us/Embedded_TechDocs/25481.pdf

ASMManiac

Thanks, I also just found a program called CPU-Z which tells me everything I need to know about my cpu
http://www.cpuid.com/downloads/cpu-z/1.59-setup-en.exe

hutch--

The action in what you are trying to do is preloaded AND aligned memory, even if you need another thread to keep feeding it. You can have a play with the number of prefetch instruction variations to see if it makes any difference. Experiment with both cache bypass instructions on writeback and the normal ones to see if cache pollution is the problem.

The idea with additional thread(s) is to isolate the initial load from the repeat fetches you do with the data size you need to transfer but note that you are performing 2 memory operations and memory is slow alongside register operations. Normal logic is to reduce memory operations to get your speed up.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

Given the length of a time slice, and assuming that the L1 cache is per core, I can't see how an additional thread could be used to manage anything related to the L1 cache.
eschew obfuscation

hutch--

Additional threads are purely to isolate the main worker thread(s) from performing the initial load into memory, have a separate thread to perform that task while the main worker thread(s) perform the high speed stuff. Its probably worth the effort to,
(a) restrict the number of threads to the number of cores.
(b) specify each thread to a core.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

Quote from: ASMManiac on February 18, 2012, 11:26:17 PM
Thanks, I also just found a program called CPU-Z which tells me everything I need to know about my cpu
http://www.cpuid.com/downloads/cpu-z/1.59-setup-en.exe

that will tell you what you have
you may want to run on various machines - and expect it to run efficiently   :P

ASMManiac

Does anyone have code showing where the prefetching or non-temporal read/writes actually benefit?
I've tried using them a lot in the past and never saw any speed up.  A lot of times it slowed my code down by ~10%

A simple memcpy routine would be good enough to demonstrate this.

chrisw

Quote from: ASMManiac on February 19, 2012, 05:40:30 PM
Does anyone have code showing where the prefetching or non-temporal read/writes actually benefit?
I've tried using them a lot in the past and never saw any speed up. A lot of times it slowed my code down by ~10%

A simple memcpy routine would be good enough to demonstrate this.

You won't see any benefit from prefetch your memory in case of a simple mempcy due to the regular memory access, since the hardware prefetchers will read the memory anyway. But in case of irregular memory access you may get some positive effect. I got an improvement of a 2D boxcar filter by about 10% using manual prefetch instuctions in one of the loops.