Code location sensitivity of timings

jj2007 · June 11, 2009, 10:19:19 PM

Thanks, Mark. Pretty balanced over the MemCoC3 MemCoP4 MemCoC2 columns.
But I am surprised about the cycle counts of ramguru's i7. The Nehalem Wiki is full of overclocking talk, but they seem to bark at the wrong tree: That CPU runs the Masm32 library MemCopy in 240 cycles, instead of the ca. 1700 of Celeron and the over 4000 of the P4, for misaligned data ::)

BlackVortex · June 12, 2009, 12:13:35 AM

Code Select

Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

Algo         _imp__st   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32
               strcpy  dest-al    psllq src+dest  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33
---------------------------------------------------------------------------
2048, d0s0-0     1647      360      210      210      209      165      367
2048, d1s1-0     1649      396      259      266      258      219     1876
2048, d7s7-0     1631      402      265      261      261      218     1876
2048, d7s8-1     2159     1332      861      493      658      670     1439
2048, d7s9-2     2188     1338      862      493      658      666     1906
2048, d8s7+1     2151     1328      855      829      701      700     1393
2048, d8s8-0     1639      402      267      262      268      236      365
2048, d8s9-1     2205     1333      854      493      658      667     1290
2048, d9s7+2     2151     1329      849      828      700      701     1877
2048, d9s8+1     2143     1330      849      830      701      701     1471
2048, d9s9-0     1642      403      266      262      268      235     1884
2048, d15s15     1642      404      270      264      265      235     1893

dedndave · June 12, 2009, 01:02:16 AM

those core duo's are pretty nice, too :U

jj2007 · June 13, 2009, 10:09:19 PM

I collected some evidence that could solve the "location sensitivity" mystery - which seems much more a cacheline split penalty issue. Only Intel CPUs are affected, by the way. It also explains why ramguru's i7 is so blazingly fast...
The attachment contains timings at the bottom. It is RTF, written with RichMasm but WordPad opens it, too.

Cygnus:
Pentium has 3 times penalty for loading across QWORD boundaries.
Pentium has 3 times penalty for loading across cache line boundaries.
Pentium has 3 times penalty for loading across page boundaries.
Pentium has 6 times penalty for saving across QWORD boundaries.
Pentium has 6 times penalty for saving across cache line boundaries.
Pentium has 16 times penalty for saving across page boundaries.

Pentium III has 0.7 times!!! 'penalty' for loading across QWORD boundaries.
Pentium III has 5.5 times penalty for loading across cache line boundaries.
Pentium III has 47 times penalty for loading across page boundaries.
Pentium III has 0.7 times!!! 'penalty' for saving across QWORD boundaries.
Pentium III has 4.5 times penalty for saving across cache line boundaries.
Pentium III has 64 times penalty for saving across page boundaries.

Pentium IV has no penalty for loading across QWORD boundaries.
Pentium IV has 5.5 times penalty for loading across cache line boundaries.
Pentium IV has 20 times penalty for loading across page boundaries.
Pentium IV has no penalty for saving across QWORD boundaries.
Pentium IV has 22 times penalty for saving across cache line boundaries.
Pentium IV has 23 times penalty for saving across page boundaries.

Doom9:
The best case would be to compare only two aligned blocks, but we must shift either the source
or the destination block freely, so one of them should be unaligned in most cases. The trouble is,
that only MMX tolerates unaligned access most SSE instructions don't. To make matters worse,
any newer intel chip hat a noticeable penalty for a memory access across physical cache line borders
and an even worse penalty if that also happens to coincide with a memory page border. So there are
unfortunately only two possible options right now: use simple MMX and ignore the penalty (which is
essentially using the old isse functions) or use a work around like in x256sad and align the memory
blocks (this could be done once for each frame if all data is sized in accordingly (padding and the like)).
Compensating on both sides source and destination is only possible for MMX but the overhead should
be far greater than the gain (ok it would be possible but it is certainly slower).

h264
Core2 (Conroe) can load unaligned data just as quickly as aligned data...
unless the unaligned data spans the border between 2 cachelines, in which
case it's really slow. The exact numbers may differ, but all Intel cpus
have a large penalty for cacheline splits.
(8-byte alignment exactly half way between two cachelines is ok though.)
LDDQU was supposed to fix this, but it only works on Pentium 4.
So in the split case we load aligned data and explicitly perform the
alignment between registers. Like on archs that have only aligned loads,
except complicated by the fact that PALIGNR takes only an immediate, not
a variable alignment.

Phaeron - 12 03 08 - 00:49
On current Intel CPUs, an L1 cache line is fetched and pushed through a shifter, which accesses the
desired words. If the access crosses L1 cache lines, a data cache unit split event occurs (with
associated penalty), bytes are shifted out from the adjacent L1 cache line, and the results are
combined to produce the misaligned words.

x264
movaps/movups are no longer equivalent to their integer equivalents on the Nehalem, so that
substitution is removed. Nehalem has a much lower cacheline split penalty than previous Intel CPUs,
so cacheline workarounds are no longer necessary.
...
Rev. 696: avoid memory loads that span the border between two cachelines.
on core2 this makes x264_pixel_sad an average of 2x faster

Nehalem optimizations: the powerful new Core i7
The cacheline split problem is basically gone: the penalty is now a mere 2 clocks instead of 12 for a
cacheline-split load. This, combined with the SSE speed improvements, ... took 150 cycles on Penryn
without cacheline split optimization, 111 cycles with, and takes 62 cycles on the Nehalem
...
Intel has finally come through on their promise to make float-ops-on-SSE-registers-containing-integers
have a speed penalty. So, we removed a few %defines throughout the code that converted integer ops
into equivalent, but shorter, floating point instructions. Unfortunately, there seems to be no way to
completely avoid floatops on integer registers, as many of these operations have no integer equivalents.
A classic example is "movhps", which takes an 8-byte value from memory and puts it into the high section
of a 16-byte SSE register. In integer ops, one can only directly move into the low 8-byte section of the register.
Emulating these float ops with complex series of integer ops is far too slow to be worthwhile, ...

[attachment deleted by admin]

ramguru · June 13, 2009, 11:05:25 PM

ha-ha file in last attachment has .asc extension .. took me some time to figure out that it should be .rtf :lol

NightWare · June 14, 2009, 12:47:56 AM

Quote from: jj2007 on June 13, 2009, 10:09:19 PM
I collected some evidence that could solve the "location sensitivity" mystery

really ?

>any newer intel chip hat a noticeable penalty for a memory access across physical cache line borders
yep, but it's well known and since a long time, search after "memory aliasing" in docs.

>all Intel cpus have a large penalty for cacheline splits.
same as before...

>If the access crosses L1 cache lines, a data cache unit split event occurs (with
associated penalty), bytes are shifted out from the adjacent L1 cache line, and the results are
combined to produce the misaligned words.
:bdg obviously those guys confuse cache lines and core memory cluster.

>movaps/movups are no longer equivalent to their integer equivalents on the Nehalem
? in virtue of what ? bits are bits !

>float-ops-on-SSE-registers-containing-integers have a speed penalty
same as before, data movments are the same ! and using others float operations has an interest ?

>A classic example is "movhps", which takes an 8-byte value from memory and puts it into the high section
of a 16-byte SSE register. In integer ops, one can only directly move into the low 8-byte section of the register.
Emulating these float ops with complex series of integer ops is far too slow to be worthwhile, ...
pffft... the problem is the cost/manipulations to access to the high part of the register, it the same problem between unpcklps and unpckhps, unpcklbw and unpckhbw, punpckldq and punpckhdq, etc...

MichaelW · June 14, 2009, 01:29:30 AM

Perhaps a (partial) solution would be to use a larger alignment, specifically an alignment that matches the processor's cache line size or some integer multiple of it. And thinking further, you should be able to control the instruction sizes or add padding so that no instruction would split a cache line boundary.

dedndave · June 14, 2009, 01:44:24 AM

that sounds like a lot of work
perhaps for very intensive speed critical areas
but, i think at some point you have to play the law of averages
that is, in part, what intel uses as a guide-line for design (hopefully)

jj2007 · June 14, 2009, 07:23:04 AM

They are not talking about the instruction cache. The data are the problem.

dedndave · June 14, 2009, 11:17:02 AM

yes - i understand that
if you open a file buffer and read 64 kb of a file (or more), you are going to play the law of averages
otherwise, you are going to spend more cpu time thrashing things about than you save
if you are talking about declared data, most of that is used a few times during execution
and that of it that isn't may be adjusted
like i say - there is a point where it won't pay to optimize - let the law of averages take over
on certain portions of data, you will get lucky
on other portions, you will not
intel has designed the system so that you are lucky much more often than not
(of course, that's just a theory - lol)

Rockoon · July 27, 2009, 10:49:23 PM

Some thoughts (in regards to the original threadline of both variable timings, and consistent alignment issues)

---

On the issue of those spurious higher clock timings, even when you wait for the start of a timeslice.

Not all interrupts are timer-based. Your ethernet adapter may trigger an IRQ. Your video card may trigger an IRQ. Your disk drives may trigger an IRQ. Your USB ports may trigger an IRQ. Your mouse can trigger an IRQ.

All IRQ's are more important than your thread. When an IRQ happens, your thread is stopped abruptly so that the interrupt handler (drives, etc) may service it.

---

On the issue of caches.

There are plenty of important concepts that need to be considered. Perhaps the most important on Pentiums, but also the most frustrating, is Intel's "trace cache." Here you are dealing with some alignment issues that are completely hidden from you. Instructions within the trace cache have their own alignment which is independent of your binary. The stuff in the trace cache does not have a 1:1 correspondence with the x86 instruction set you are using. Its all in microcode, a language below x86.

Moving along.. physical memory is organized into cache lines, and each line can only be stored in specific cache sets. In a "4-way set associative" cache, a specific cache line only has 4 unique slots within the cache that it can be stored. The worst-case scenario is when you are constantly accessing 5 or more spots of memory that are all assigned the same set. 5 lines cannot fit in the 4 slots.

---

On the issue of code alignment in regards to cache.

Obsessive aligning can be a detriment. Putting code in the cache that will never get executed definately has its downsides. Those dummy instructions were moved along the bus from system ram into L3, then again into L2, and again into L1, only to have a 0% chance of being needed.

---

On the issue of code alignment in regards to branch prediction.

This is the second biggest issue, and also the second most frustrating. This is true for both AMD's and Intels, with the least effect (that I know of) on Core2's .. I am still unfamiliar with I7's

The forward branch prediction logic uses its own little bit of memory that can be considerd a cache. Here is the deal. Every memory address is assigned an index within this branch cache. When a conditional branch is taken, this little branch cache is updated (remembering what happened... YES it was, or NO it wasn't)

Lots of memory addresses share the same index, and this creates contention. Several different conditional branch instructions within your code can interfere with each other in regards to branch prediction. The Core2 luckily remembers the last several branches for a given index, and is able to handle simple patterns pretty well (YES / NO / YES / NO is terrible on AMD64's (always gets it wrong), but not a problem for the Core2)

---

Deal only with real world scenerios or it doesnt count.

jj2007 · July 27, 2009, 11:00:34 PM

Quote from: Rockoon on July 27, 2009, 10:49:23 PM
In a "4-way set associative" cache, a specific cache line only has 4 unique slots within the cache that it can be stored. The worst-case scenario is when you are constantly accessing 5 or more spots of memory that are all assigned the same set. 5 lines cannot fit in the 4 slots.

Stuff for thought. That could be the culprit indeed.

Good to see you back, Rockoon :thumbu

Rockoon · July 29, 2009, 11:55:52 PM

Quote from: jj2007 on July 27, 2009, 11:00:34 PM
Quote from: Rockoon on July 27, 2009, 10:49:23 PM
5 lines cannot fit in the 4 slots.
Stuff for thought. That could be the culprit indeed.

Good to see you back, Rockoon :thumbu

Thing brings to mind a simple test program that can be written. Allocate a nice big buffer of memory and then test various steppings through it. You could even write something like this in visual basic (as I have done previously) and get meaningfull results, since cache misses are extremely exspensive and thus easily measured.

News:

Code location sensitivity of timings

ramguru