Timings and cache

jj2007 · March 12, 2009, 08:46:38 PM

Some timings for the strlen algos in the szLen thread:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
-- test 0, misaligned 0, 95 bytes
  Masm32 lib szLen   126 cycles
  crt strlen         101 cycles
strlen32s            33 cycles
strlen64LingoB       28 cycles
_strlen (Agner Fog)  30 cycles

-- test 1, misaligned 1, 95 bytes
  Masm32 lib szLen   126 cycles
  crt strlen         111 cycles
strlen32s            33 cycles
strlen64LingoB       29 cycles
_strlen (Agner Fog)  34 cycles

Same but with a scheme that eliminates the influence of the cache:

Code Select

-- test 0, misaligned 0, 95 bytes
  Masm32 lib szLen   *** 672 ms for 7078488 loops
  crt strlen         *** 609 ms for 7078488 loops
strlen32s            *** 328 ms for 7078488 loops
strlen64LingoB       *** 343 ms for 7078488 loops
_strlen (Agner Fog)  *** 344 ms for 7078488 loops

-- test 1, misaligned 1, 95 bytes
  Masm32 lib szLen   *** 672 ms for 7078488 loops
  crt strlen         *** 594 ms for 7078488 loops
strlen32s            *** 328 ms for 7078488 loops
strlen64LingoB       *** 328 ms for 7078488 loops
_strlen (Agner Fog)  *** 344 ms for 7078488 loops

These tests are still rudimentary, but I wonder how to interpret them:
- Are those timings that assume the strings are in the cache more relevant, because string data tends to be in the cache anyway?
- Or should we measure as if we had freshly read in a huge file, so that the data are not in the cache??
::)
In case you want to test yourself, here is the switch:

Code Select

CacheFree = 0

[attachment deleted by admin]

NightWare · March 12, 2009, 11:46:25 PM

it's more complex...

for the l1 cache, there is a part for the code (continuous read / filling), a part for the address (area with conditionnal changements), and a part for the data (same as address but come exclusively from l2)

for the l2 cache, it's the temporary area before entering to the l1 cache... (the place where a copy is placed during the first read).

but here, you alterate the normal behaviour of the cpu for the l1/l2 cache (only for data)... you force a mem read (aprox. 100 cycles), so here, of course, the difference between the algos will be considerably reduced (because of the 100 cycles influence)... it can give you some indications, yes, but that's all...

jj2007 · March 13, 2009, 04:40:57 AM

Quote from: NightWare on March 12, 2009, 11:46:25 PM
it's more complex...

you force a mem read (aprox. 100 cycles)

Merci.

You are right. But the difference is still considerable - over a factor 2 between good ol' len and the two fast algos. By the way, what are your "100 cycles" referring to? A block read? EDIT: Would that mean 100 cycles/cache line size, as returned by GetLogicalProcessorInformation in the LineSize member of CACHE_DESCRIPTOR?

Obviously, it can't be per single byte, because then the difference would vanish completely (the fast algos need 0.17 cycles per byte).

And then the question about the practical relevance: Can we assume that we read mostly string lengths for strings that have just been written to the cache? That is the case that MichaelW's timing macros can deal with - a million repetitions means everything is in the cache...

NightWare · March 13, 2009, 10:28:49 PM

Quote from: jj2007 on March 13, 2009, 04:40:57 AM
what are your "100 cycles" referring to? A block read?

yes it's the price for the first (memory) read, for the second (possibly third) it just take 10 from the l2 cache, after just 1 from the l1 cache... concerning the results obtained, you must keep in mind that the algos (lingo,agner et jj) read the (aligned) memory with simd instructions, for the other algos it's certainly not the case :P

Quote from: jj2007 on March 13, 2009, 04:40:57 AM
And then the question about the practical relevance: Can we assume that we read mostly string lengths for strings that have just been written to the cache? That is the case that MichaelW's timing macros can deal with - a million repetitions means everything is in the cache...

yes, everything is quickly in the cache... MichaelW's macros are excellent, but can't give you the real result for 1 call... the repeatedly loop absorb the first/second read for datas, and also the branch mispredictions for address, and also the cost of the code read... :wink it's something you have to keep in mind when you code your algos... the speed results are relative...

jj2007 · March 13, 2009, 10:59:59 PM

Quote from: NightWare on March 13, 2009, 10:28:49 PM
concerning the results obtained, you must keep in mind that the algos (lingo,agner et jj)

et NightWare

I implemented your algo yesterday, it's equivalent to the other two fast algos. And of course, there are only minor differences left - there are 28 matches for pcmpeqb in the code :bg

BeeOnRope · April 01, 2009, 08:12:40 PM

I think usually there are three domains for the that are interesting:

1: Large data sets
When the data to be operate on doesn't fit in L1 (alternately L2 or L3) at all. In this case, it doesn't really matter whether the cache is hot or not because you are going to be reading all or nearly all the data from memory anyways. For these kinds of input sets, prefetching (hardware or software) is key, to keep the CPU saturated with data. The performance will be the same (per byte) as the same as the shorter input sets only if the memory bandwidth on your machine is adequate to stream data at the required rate (this is a pretty easy calculation), and if the behavior of your code is such that prefetching correctly utilizes the available bandwidth.

2: Small data sets, not in cache

Data sets that could fit in the cache, but are not in the cache. For example, taking the length of many small strings which are not in the cache.

3: Small data sets, in the cache

As above, but the data is assumed to be in the cache.

Unfortunately, there isn't necessarily a global answer for which is the most useful - only the end user of your call (the application developer) will generally know the behavior of app and thus which scenario will apply. It's pretty easy to catch scenario 1 in your timings simply by making sure you have some input data size that exceeds the largest level cache in your system. Distinguishing between 2 and 3 (in particular, benchmarking 2) is perhaps more difficult.

tetsu-jp · April 02, 2009, 06:56:22 PM

small data set, large data set. this is useful classification.

but also (in my opinion), it is important, if the access is linear, or if the access is random.

you know, older CPUs did not have large cache, so there was a more direct relationship.

the REP prefix helped a lot, because there was a (tiny) prefetch queue (i think 6 bytes).
so it was not required to reload the instruction code!

i see there are various threads related to this issue, and it would be nice if someone can put everything on a webpage:

-results
-CPU tables
-thread links
-downloads for softwares
-screenshots eventually
-explanation pages

I can help with HTML pages (I have facility to create them quickly, and I am used to it),
can host software on skydrive, and pictures on photobucket.

also anyone- free POPFLY webpages!

the background is, such benchmarks fascinated me (abuot 12 years ago).

so i am willing to help out, eventually to create a fresh benchmark program myself.

but i am newcomer, and just looking at these threads.
you are a round for years maybe...

let me know if you want me to put all the stuff on a HTML page.
I have HTMLKIT open all the time, currently debugging my own pages (clean up style),
so it's a matter of minutes.

BeeOnRope · April 07, 2009, 01:25:46 AM

Yes, I agree - random access will be very different than linear access, especially due to prefetching and TLB/page table issues. Sometimes (usually?) what kind of access happens is fixed by the problem (e.g. - the strln discussion, we are implicitly talking linear access).

hutch-- · April 10, 2009, 01:52:17 AM

You can make it even messier, haver a large enough data set so the data does not fit into cache them keep reading data that is not in the cache and you will see the time collapse. The problem is memory page thrashing so if you keep reading data that is far enough apart you reload a different memory page each time.

Mark Jones · April 10, 2009, 05:38:44 PM

For AMD CPU's, check out the AMD CodeAnalyst. Nothing else comes close to giving this amount of detail about code. (Except maybe Intel's VTune profiler.)

jj2007 · May 04, 2009, 01:49:11 PM

Quote from: Mark Jones on April 10, 2009, 05:38:44 PM
For AMD CPU's, check out the AMD CodeAnalyst. Nothing else comes close to giving this amount of detail about code. (Except maybe Intel's VTune profiler.)

Since I have Intel on board, I just checked VTune: 341 MB - no thanks, I'd rather do it by hand!

KeepingRealBusy · May 04, 2009, 07:39:50 PM

Mark,

QuoteFor AMD CPU's, check out the AMD CodeAnalyst.

Thanks for the info. I will see what this will do for me

Dave.

Rockoon · July 30, 2009, 03:03:39 AM

A little-bit of a gravedig, but you shouldnt let the size of these tools disuade you from getting and then using them.

At least as far as AMD's CodeAnalyst, this thing gives information you simply cannot get any other reasonable way. No chance.

It can even simulate another CPU entirely. Example, you have an AMD64x2 'Toledo' which is one of their 90nm chips, but want to know how things will go on an AMD64x2 'Brisbane' which is one of their 65nm chips.

This article by AMD is a good primer into what CodeAnalyst can do for you:

http://developer.amd.com/documentation/articles/Pages/111820052.aspx

News:

Timings and cache

BeeOnRope

tetsu-jp

BeeOnRope