I collected some evidence that could solve the "location sensitivity" mystery - which seems much more a
cacheline split penalty issue. Only Intel CPUs are affected, by the way. It also explains why ramguru's i7 is so blazingly fast...
The attachment contains timings at the bottom. It is RTF, written with
RichMasm but WordPad opens it, too.
Cygnus:Pentium has 3 times penalty for loading across QWORD boundaries.
Pentium has 3 times penalty for loading across cache line boundaries.
Pentium has 3 times penalty for loading across page boundaries.
Pentium has 6 times penalty for saving across QWORD boundaries.
Pentium has 6 times penalty for saving across cache line boundaries.
Pentium has 16 times penalty for saving across page boundaries.
Pentium III has 0.7 times!!! ‘penalty’ for loading across QWORD boundaries.
Pentium III has 5.5 times penalty for loading across cache line boundaries.
Pentium III has 47 times penalty for loading across page boundaries.
Pentium III has 0.7 times!!! ‘penalty’ for saving across QWORD boundaries.
Pentium III has 4.5 times penalty for saving across cache line boundaries.
Pentium III has 64 times penalty for saving across page boundaries.
Pentium IV has no penalty for loading across QWORD boundaries.
Pentium IV has 5.5 times penalty for loading across cache line boundaries.
Pentium IV has 20 times penalty for loading across page boundaries.
Pentium IV has no penalty for saving across QWORD boundaries.
Pentium IV has 22 times penalty for saving across cache line boundaries.
Pentium IV has 23 times penalty for saving across page boundaries.
Doom9:The best case would be to compare only two aligned blocks, but we must shift either the source
or the destination block freely, so one of them should be unaligned in most cases. The trouble is,
that only MMX tolerates unaligned access most SSE instructions don't. To make matters worse,
any newer intel chip hat a noticeable penalty for a memory access across physical cache line bordersand an even worse penalty if that also happens to coincide with a memory page border. So there are
unfortunately only two possible options right now: use simple MMX and ignore the penalty (which is
essentially using the old isse functions) or use a work around like in x256sad and align the memory
blocks (this could be done once for each frame if all data is sized in accordingly (padding and the like)).
Compensating on both sides source and destination is only possible for MMX but the overhead should
be far greater than the gain (ok it would be possible but it is certainly slower).
h264 Core2 (Conroe) can load unaligned data just as quickly as aligned data...
unless the unaligned data spans the border between 2 cachelines, in which
case it's really slow. The exact numbers may differ, but
all Intel cpus
have a large penalty for cacheline splits.
(8-byte alignment exactly half way between two cachelines is ok though.)
LDDQU was supposed to fix this, but it only works on Pentium 4.
So in the split case we load aligned data and explicitly perform the
alignment between registers. Like on archs that have only aligned loads,
except complicated by the fact that PALIGNR takes only an immediate, not
a variable alignment.
Phaeron - 12 03 08 - 00:49On current Intel CPUs, an L1 cache line is fetched and pushed through a shifter, which accesses the
desired words. If the access crosses L1 cache lines, a data cache unit split event occurs (with
associated penalty), bytes are shifted out from the adjacent L1 cache line, and the results are
combined to produce the misaligned words.
x264movaps/movups are no longer equivalent to their integer equivalents on the Nehalem, so that
substitution is removed. Nehalem has a much lower cacheline split penalty than previous Intel CPUs,
so cacheline workarounds are no longer necessary.
...
Rev. 696: avoid memory loads that span the border between two cachelines.
on core2 this makes x264_pixel_sad an average of 2x faster
Nehalem optimizations: the powerful new Core i7The cacheline split problem is basically gone: the penalty is now a mere 2 clocks instead of 12 for a
cacheline-split load. This, combined with the SSE speed improvements, ... took 150 cycles on Penryn
without cacheline split optimization, 111 cycles with, and takes 62 cycles on the Nehalem
...
Intel has finally come through on their promise to make
float-ops-on-SSE-registers-containing-integers
have a speed penalty. So, we removed a few %defines throughout the code that converted integer ops
into equivalent, but shorter, floating point instructions. Unfortunately, there seems to be no way to
completely avoid floatops on integer registers, as many of these operations have no integer equivalents.
A classic example is “movhps”, which takes an 8-byte value from memory and puts it into the high section
of a 16-byte SSE register. In integer ops, one can only directly move into the low 8-byte section of the register.
Emulating these float ops with complex series of integer ops is far too slow to be worthwhile, ...
[attachment deleted by admin]