News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Code location sensitivity of timings

Started by jj2007, May 18, 2009, 12:30:16 PM

Previous topic - Next topic

NightWare

Quote from: hutch-- on May 20, 2009, 03:47:15 AM
I commented on something I have found over time that is not mentioned in any of the optimisation documentation I have ever seen over the last 20 years, effectively a NON-EXECUTED instruction sequence provided by the assembler for the purposes of code alignment that appears to effect the timing of the executed code that follows it.

you consider a cpu like a robot doing always the same thing (where alignment should fit well), but it's NOT the case. the cpu behave differently depending of the instructions that have been executed previously.

when YOU enter an instruction, you "control" the alignment for the cpu. BUT, you must remember that your app/code IS NOT the first and unique app executed. when you launch an app (your code), others instructions have been used before that have filled the code/trace cache, and those instructions will not be cleaned just because you want or think it should (you must use instructions more often (during a delay) to replace them, it's the principle of the cache).
then, to replace the instructions in the cache you replace them from MEMORY (and HERE the alignment is usefull).

now, here we entering in HOW the cpu decode the instructions from the cache, and remember the cache is only few kb (so no space to loose). the instructions (or parts of algos) follow others instructions (or parts of alos), to avoid "pollution".
but of course to obtain that the cpu use internal location to decode the instructions (we are NOT in memory alignment logic anymore). but alignment INSIDE the code/trace cache, and it's something you CAN'T modify, it's decided by the cpu, NOT YOU.
and yes, since we are in ANOTHER alignment, there is speed difference directly linked to this (and the number of uops your cpu can deal with become essential).

jj, stop moving up/down the instructions to obtain speed up, you must only do that to avoid stall, nothing else. otherwise don't be surprised if the cache give you a BIG slowdown when your aligned code suddenly become totally unaligned.

MichaelW

Quote from: jj2007 on May 19, 2009, 02:56:12 PM
But that still doesn't answer the question why the OS behaves like that more often if the same identical and identically aligned code sits a few bytes further down...

I'm having a problem understanding the "identically aligned code sits a few bytes further down" part. If you are using align 16, then any code that sits "a few bytes further down" is not identically aligned. If it is identically aligned, then it must be sitting some multiple of 16 bytes further down.
eschew obfuscation

dedndave

that seems to be the nature of what JJ is describing, Michael
if he uses align 16 - one result
moves it 16 bytes down - different result
it seems to happen on some processors - not others
it is sometimes difficult to reproduce

hutch--

Exactly the reason why I posted the following example is that you CAN CHANGE the effect of large slowdowns by avoiding potentially long instructions (7 byte) just prior to the code that is run. While NOPs did not work, a RETN in some circumstances works OK and padding with INT 3 also seems to work in some instances.

I do not routinely run 16 byte alignment as it has tested slow on many occasions where the ALIGN 4 directive rarely ever causes a problem.


    align 16

    mov esp, [esp+edx*8+128]    ; non executed instruction 7 bytes

    REPEAT 64
      int 3
    ENDM

    ; retn

    align 4


Without having a definitive answer, I suggest it has something to do with the long instruction prior to the aligned code causing problems with the code cache but at least there is a way to change it with a bit of experimentation. What will vary from processor to processor is the block read size for the code cache so it would mean object module level alignment to match each processor level 1 cache read size to fully avoid the problem which then ends up impractical for final code size problems.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

it would be nice to know which processors do what
for most code, it can be overlooked, perhaps
but if there is a real time-crunching routine, it would be possible to load the algo dynamically at runtime, according the host cpu
i suppose you could test the alignment at run-time as well, rather than IDing the cpu
clearly an advanced technique - we better leave it to the experts rofl

hutch--

Dave,

What you have mentioned is reasonably simple to do, allocate executable memory large enough to hold the code plus the maximum alignment required then copy the algo to the alignment you want at runtime. Its an old technique to bypass API wrappers for other functions at lower level in the API function set.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: MichaelW on May 21, 2009, 01:21:09 AM

I'm having a problem understanding the "identically aligned code sits a few bytes further down" part. If you are using align 16, then any code that sits "a few bytes further down" is not identically aligned. If it is identically aligned, then it must be sitting some multiple of 16 bytes further down.


This is indeed what I meant - sorry for not being crystal clear. Simplified:
if MakeSlow
REPEAT 16
nop
ENDM
endif
align 16
MyTest proc
  ret
MyTest endp


And (@Nightware) it has nothing to do with shifting lines up or down to speed up the code. It is the same identical algo that produces different results. Note that you can eliminate that effect by using the sort & eliminate outliers technique described further up. In real apps, however, the effect might be present, depending on things you cannot easily influence, i.e. where that algo happens to be inserted.

Now one thing I have not yet tested is whether the effect was produced by the code immediately before the algo, i.e. the one that according to our most recent theory  :wink might influence the code cache (branch prediction whatever) behaviour...

This avenue looks promising. If Hutch finds that a retn immediately before the start of the algo speeds it up, it makes sense: Even a not-so-intelligent CPU should know that there is nothing to optimise...

Other candidates for non-optimisable instructions are int3, cpuid, rdtsc...

dedndave

yes - cpuid seems like a good choice, as it forces serialization of instructions

jj2007

Good and bad news.
First the good one: I am able to nail down the speed difference to one byte:

Quote
   if MakeSlow
      print chr$(10, "Code sizes:", 13, 10)
;      341 cycles for addx$
;      584 cycles for add4$, old syntax
;      324 cycles for add4$, short syntax
   else
      print chr$(10, 10, "Code sizes:", 13, 10)
;      341 cycles for addx$
;      531 cycles for add4$, old syntax
;      192 cycles for add4$, short syntax
   endif

Timings are rock solid.

Now the bad news:
SLOW version
0040193C               ?  5E                         pop esi
0040193D               ?  5F                         pop edi
0040193E               ?  C2 0800                    retn 8
00401941              .  8DA424 00000000             lea esp, [local.0]
00401948              . 8DA424 00000000             lea esp, [local.1]
0040194F              ?  90                          nop
00401950              >  CC                          int3 <<<<<<<<<<< algo start
00401951              ?  57                          push edi
00401952              ?  56                          push esi

*** print chr$(10, "Code sizes:", 13, 10)
00401805              .  40                          inc eax                                                                  ; Arg1 => ASCII 0A,"Code sizes"
00401806              ?  00E8                        add al, ch                                                               ;
00401808              ?  90                          nop                                                                      ;
00401809              ?  0100                        add dword ptr [eax], eax                                                 ;
0040180B              ?  00CC                        add ah, cl
0040180D              ?  68 59C04000                 push offset AddStrB.0040C059                                             ; ASCII 0A,"Code sizes"
00401812              ?  E8 89030000                 call 00401BA0                                                            ;
00401817              ?  68 68C04000                 push offset AddStrB.0040C068                                             ; ASCII "szCatStr2    = "
0040181C              ?  E8 7F030000                 call 00401BA0

FAST version
0040193C               ?  5E                         pop esi
0040193D               ?  5F                         pop edi
0040193E               ?  C2 0800                    retn 8
00401941              .  8DA424 00000000             lea esp, [local.0]
00401948              . 8DA424 00000000             lea esp, [local.1]
0040194F              ?  90                          nop
00401950              >  CC                          int3 <<<<<<<<<<< algo start
00401951              ?  57                          push edi
00401952              ?  56                          push esi
00401953              .  8B7C24 0C                   mov edi, dword ptr [arg1]

*** print chr$(10, 10, "Code sizes:", 13, 10)
00401805              .  40                          inc eax                                                                  ; Arg1 => ASCII LF,LF,"Code sizes:",CR,LF
00401806              ?  00E8                        add al, ch                                                               ;
00401808              ?  90                          nop                                                                      ;
00401809              ?  0100                        add dword ptr [eax], eax                                                 ;
0040180B              ?  00CC                        add ah, cl
0040180D              ?  68 59C04000                 push offset AddStrB.0040C059                                             ; ASCII LF,LF,"Code sizes:",CR,LF
00401812              ?  E8 89030000                 call 00401BA0                                                            ;
00401817              ?  68 69C04000                 push offset AddStrB.0040C069                                             ; ASCII "szCatStr2    = "
0040181C              ?  E8 7F030000                 call 00401BA0


The algo and its location are identical.

Needless to say that I fumbled a lot with the align 16 plus specific fillbytes (retn, cpuid, ...) options. They have no effect, at least on the Celeron M where I could test this.

[attachment deleted by admin]

redskull

Alright, I need to ask a n00B question - why should any timing code be trusted which can't disable the interrupts?  Every 15ms (at least) the executing thread grinds to a halt to service the kernel, which high performance counter wouldn't compensate for.  Additional hardware events and DPCs would compound the problem even more, right?  I was under the tacit assumption that any code timing, unless done in real mode with interrupts disabled, was AT BEST a good estimation.  Can someone explain why this is not so?

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

dedndave

well - they have been playing with different methods for a while in here
QueryPerfomanceCounter, RDTSC, as well as SetPriorityClass and SetProcessAffinity
they have an idea of what some simple strings of code should return - so they can test their methods
but, the popular approach seems to be to run the test code about 10,000 times and pick one
of the lower result values, assuming that nothing else was going on during that run
it seems to return fairly repeatable results on single-core CPU's
i guess you'd say we are still fine-tuning the measurement methods
to obtain more reliable results on a wider variety of machines
i am playing with some code along that line, myself

hutch--

Its something like being a lone voice crying in the wilderness but until real time testing becomes the norn on large enough samples to emulate real time situations, the results will continue to be all over the place like a mad woman's sewerage.

Short looped test conditions do have their place but it tends to be in the area of instruction sequence testing, not in larger scale algorithm testing where cache effects, memory paging and bare data read/write times take effect. As hardware gets further away from the old pre-pipelined single instruction munchers, many of these assumptions are simply out of date. The only factor that remains constant is real time, the rest vary with hardware changes.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

lol "mad womans sewerage" - do i wanna know what that is ? - reminds me of the ex
i think there is still a chance to get some meaningful data on algos
so long as the data set you select emulates the real world
some of us are not all that interested in "how many ticks does it take"
i am more interested in obtaining some kind of usable comparisons
if things are set up right, and you take into account that different algos work differently on different data sets
you should be able to get enough information to help select the right algo for the job
this algo handles short strings better but is slow on long strings
that algo doesn't care how long the string is, it's not fast anywhere
these generalizations should help you decide what is best for a given application
if you can write the algo 3 ways - it is a matter of copy/paste in the finished program to help decide
i do think there is a lot of room for improvement in the test methods, however
it will never be perfect, of course - but it can be better than it is - we will get there


EDIT: i was playing with some code today and getting nice continuous time values on a dual-core, at least

jj2007

Quote from: redskull on May 22, 2009, 02:09:53 AM
Alright, I need to ask a n00B question - why should any timing code be trusted which can't disable the interrupts?  Every 15ms (at least) the executing thread grinds to a halt to service the kernel

The TIMER macros take account of that by waiting until a new time slice is available (@MichaelW: please correct me if that is imprecise). All readings are done in well under 15ms (actually, this test with 1,000 readings takes 15 ms including all overheads). And readings are very consistent - 967 out of 1,000 tests are in a range of +- 2 cycles around the average:

Quote0:744 1:744 2:744 3:744 4:744 5:744 6:744 7:744 8:744 9:744 10:744
...
951:746 952:746 953:746 954:746 955:746 956:746 957:746 958:747 959:751 960:756
961:780 962:807 963:819 964:834 965:839 966:845 967:971 968:973 969:974 970:976
971:986 972:989 973:989 974:990 975:991 976:991 977:992 978:992 979:993 980:994
981:997 982:999 983:1000 984:1001 985:1002 986:1011 987:1016 988:1025 989:1025
990:1030 991:1033 992:1033 993:1035 994:1044 995:1045 996:1052 997:1056 998:1111
999:1136 1000:1209

average with cutoff     746     cutoff at n=967 for ct>=931 cycles
average w/o cutoff      754
difference raw-cutoff:  8

Here is the same for the fast version - identical code at identical position, see earlier post above:
Quote0:259 1:259 2:259 3:259 4:259 5:259 6:259 7:259 8:259 9:259 10:259
...
951:261 952:261 953:261 954:261 955:261 956:261 957:261 958:261 959:261 960:261
961:261 962:261 963:261 964:261 965:261 966:261 967:261 968:261 969:261 970:261
971:261 972:261 973:261 974:261 975:261 976:261 977:261 978:261 979:261 980:261
981:262 982:262 983:262 984:262 985:262 986:267 987:323 988:374 989:422 990:488
991:488 992:491 993:492 994:500 995:506 996:510 997:511 998:523 999:530 1000:758

MyTest
average with cutoff     259     cutoff at n=988 for ct>=324 cycles
average w/o cutoff      262
difference raw-cutoff:  3

Note these timings are for a P4, so it is not even CPU-specific. But test yourself - I have attached two executables now.

[attachment deleted by admin]

MichaelW

QuoteThe TIMER macros take account of that by waiting until a new time slice is available (@MichaelW: please correct me if that is imprecise).

Only the second set of macros (counter2.asm in counter2.zip) do this. These macros also capture the lowest cycle count that occurs in any loop.

eschew obfuscation