News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Code location sensitivity of timings

Started by jj2007, May 18, 2009, 12:30:16 PM

Previous topic - Next topic

FORTRANS

Quote from: dedndave on May 19, 2009, 11:58:25 AM
i would guess the OS is doing something else during the high count tests

Hello,

  I have a simple graphics program that prints out the iterations
per second. It works in an XP full screen command prompt, but
if run longer than a few seconds, it slows significantly (about half
speed).  I have seen another DOS graphics program slow drastically
after running a few seconds, though that one is not very repeatable.

  Maybe boot to safe mode, or DOS?

FWIW,

Steve N.

BlackVortex

What ?   :bg

It's not a DOS program !

dedndave

it could be a dos graphics program under XP
i have a similar one i wrote years ago for DOS
it uses the 320x200 256 color graphics mode
although, i have not seen this slow-down problem with it

i did notice that it seems disk-intensive
DOS-mode disk I/O running under Windows seems to be a poor mix


jj2007

I have often noticed that apps slow down while the fan starts spinning faster. Maybe I've misunderstood what SpeedStep (TM) really means - I thought apps should run faster when they ask for it :bg

In the meantime, I might have found an answer; the following shows how many readings in 1,000 are low, i.e. 17-18 cycles for 10 push/pop reg32 pairs:

LOOP_COUNT =   10000000
average with cutoff     18      cutoff at n=523 for ct higher than 23 cycles
average w/o cutoff      29
difference raw-cutoff:  11

LOOP_COUNT =   5000000
average with cutoff     17      cutoff at n=639 for ct higher than 23 cycles
average w/o cutoff      26
difference raw-cutoff:  9

LOOP_COUNT =   2000000
average with cutoff     18      cutoff at n=898 for ct higher than 23 cycles
average w/o cutoff      20
difference raw-cutoff:  2

LOOP_COUNT =   1000000
average with cutoff     17      cutoff at n=977 for ct higher than 23 cycles
average w/o cutoff      18
difference raw-cutoff:  1

LOOP_COUNT =   500000
average with cutoff     17      cutoff at n=970 for ct higher than 22 cycles
average w/o cutoff      17
difference raw-cutoff:  0

See the pattern? It seems that if LOOP_COUNT is too high, the OS is forced to do something else, eating up cycles for half of the readings. Probably the time slice is too short to perform the full test then?

Anyway, the cutoff method works fine, and is easier to handle than JimG's copying scheme. But that still doesn't answer the question why the OS behaves like that more often if the same identical and identically aligned code sits a few bytes further down...

dedndave

that is what i was saying in the other thread
i think the test "burst" needs to fall within some predifined window of cycle counts
too few cycles - unrepeatable results due to test/overhead ratio
too many cycles - can't avoid the OS "hic-ups"
i think that window is fairly large and should be easy to hit
once the window has been defined, testing any algo should first adjust it's "burst" length to fit the window - then run the tests
i am working on a program that does it this way - if i can just get finished with the cpu identification nightmare - lol
i am also trying to accomodate a wide variety of machines with multi-cores
example: some off-the-wall guy has a machine with a prescott and 2 norwoods in it
we can acquire information for the prescott and norwood from a single run

MichaelW

I originally recommended a high loop count because it produced better run-to-run consistency. Then some time later I discovered that adding a 3-4 second delay after the program was launched and before entering the first test loop eliminated the need for the high loop count. Running on my P3, with the delay and a loop count of 1000, and testing code that executes somewhere around 20-300 cycles, the run-to-run variation is near zero.
eschew obfuscation

dedndave

good info to have Michael
i would not have thought to look for a high start-up usage by the OS
probably building all the structures that you "might" refer to in function calls - lol - what a waste
it would make more sense to me if the OS set everything up - then run the app - what do i know
i wonder if this affects windows apps as well as console apps - probably does
also, i wonder if this is somehow related Jochen's mystery cycle-count issue

Mark Jones

FWIW,
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3) (XP x64)
Testing timings with cutoff

0:79 1:79 2:79 3:79 4:79 5:79 6:79 7:79 8:79 9:79 10:79

...

701:79 702:79 703:79 704:79 705:79 706:79 707:79 708:79 709:79 710:79 711:79 712
:79 713:79 714:79 715:79 716:79 717:79 718:79 719:79 720:79 721:79 722:79 723:79
724:79 725:79 726:79 727:79 728:79 729:79 730:79 731:79 732:79 733:79 734:79 73
5:79 736:79 737:79 738:79 739:79 740:79 741:79 742:79 743:79 744:79 745:79 746:7
9 747:79 748:79 749:79 750:79 751:79 752:79 753:79 754:79 755:79 756:79 757:79 7
58:79 759:79 760:79 761:79 762:79 763:79 764:79 765:79 766:79 767:79 768:79 769:
79 770:79 771:79 772:79 773:79 774:79 775:79 776:79 777:79 778:79 779:79 780:79
781:79 782:79 783:79 784:79 785:79 786:79 787:79 788:79 789:79 790:79 791:79 792
:79 793:79 794:79 795:79 796:79 797:79 798:79 799:79 800:79 801:79 802:79 803:79
804:79 805:79 806:79 807:79 808:79 809:79 810:79 811:79 812:79 813:79 814:79 81
5:79 816:79 817:79 818:79 819:79 820:79 821:79 822:79 823:79 824:79 825:79 826:7
9 827:79 828:79 829:79 830:79 831:79 832:79 833:79 834:79 835:79 836:79 837:79 8
38:79 839:79 840:79 841:79 842:79 843:79 844:79 845:79 846:79 847:79 848:79 849:
79 850:79 851:79 852:79 853:79 854:79 855:79 856:79 857:79 858:79 859:79 860:79
861:79 862:79 863:79 864:79 865:79 866:79 867:79 868:79 869:79 870:79 871:79 872
:79 873:79 874:79 875:79 876:79 877:79 878:79 879:79 880:79 881:79 882:79 883:79
884:79 885:79 886:79 887:79 888:79 889:79 890:79 891:79 892:79 893:79 894:79 89
5:79 896:79 897:79 898:79 899:79 900:79 901:79 902:79 903:79 904:79 905:79 906:7
9 907:79 908:79 909:79 910:79 911:79 912:79 913:79 914:79 915:79 916:79 917:79 9
18:79 919:79 920:79 921:79 922:79 923:79 924:79 925:79 926:79 927:79 928:79 929:
79 930:79 931:79 932:79 933:79 934:79 935:79 936:79 937:79 938:79 939:79 940:79
941:79 942:79 943:79 944:79 945:79 946:79 947:79 948:79 949:79 950:79 951:79 952
:79 953:79 954:79 955:79 956:79 957:79 958:79 959:79 960:79 961:79 962:79 963:79
964:79 965:79 966:79 967:79 968:79 969:79 970:79 971:79 972:79 973:79 974:79 97
5:79 976:79 977:79 978:79 979:79 980:79 981:79 982:79 983:79 984:79 985:79 986:7
9 987:79 988:79 989:79 990:79 991:79 992:79 993:79 994:79 995:79 996:79 997:80 9
98:80 999:81 1000:109
MyTest: subcounter=     1000
average with cutoff     79      cutoff at n=1000 for ct higher than 99 cycles
average w/o cutoff      78
difference raw-cutoff:  -1
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

hutch--

Having played with enough algos over time, there does seem to be a pattern in the example I posted where setting the alignment to 16 suddenly makes the algo a lot slower, sometimes over 50% slower. It would seem that if the algorithm is dealing with byte data (IE processed at 1 byte at a time) then it is more succeptable to alignment variations that if the algo deals with DWORD sized data.

Ther is possibly another factor with the "ALIGN 16" directive, it can be up to 15 bytes long usually done with two NOP style instructions and while they should never be executed as they are before the label in a procedure, I have a sneaking suspicion that very long instructions stored near the front of the code cache can also have unusual effects in terms of timing.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

well - i always try to use a simple instruction (or two) at a label
that gives the eip something to chew on while the queue fills up
like....

branch:
cld
mov eax,edx

(i.e. use instructions that access no memory)

i have always used a similar method all the way back to 8086
with that processor, the queue was 6 bytes
but, it was capable of pre-fetching and executing at the same time (second generation uP they called it i think)


NightWare

Quote from: hutch-- on May 20, 2009, 02:29:19 AM
I have a sneaking suspicion that very long instructions stored near the front of the code cache can also have unusual effects in terms of timing.
suspicion ? is someone here read what i've wrote ? i've spoken of uops for a good reason... always remember that alignment (made by us) is to make instructions decoding easier for the cpu... BUT it's not always sufficient... take a piece of paper and see the possibilities in the case of a 6 or 4 uops cpu, for example with a 7 bytes instruction... you will certainly understand... anyway, it's not something you can modify by coding, so why bother with that ?

hutch--

NightWare,

Tolerate us mrer mortals and explain what you have mentioned. I commented on something I have found over time that is not mentioned in any of the optimisation documentation I have ever seen over the last 20 years, effectively a NON-EXECUTED instruction sequence provided by the assembler for the purposes of code alignment that appears to effect the timing of the executed code that follows it.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Now while waiting for such pearls of wisdom, here is some interum testing that seems to effect the timing change with different code alignment.


    align 16

    mov esp, [esp+edx*8+128]    ; non executed instruction 7 bytes

    REPEAT 64
      int 3
    ENDM

    ; retn

    align 4


The align 16, 7 byte instruction and a number of others produced the slowdown by over 50%. Large paddings of nops (DB90) did not effect the change but the use of a RETN in some instanced removed the slowdown as did a padding in INT 3 instructions.

I don't have a handy explanation but the effect is reproducable.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

I was wondering...with our new you beaut cpu's and variations with cache and windows with its zillion threads, could the cache get 'saturated'?
Garbage collection?
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on May 20, 2009, 06:59:52 AM
I was wondering...with our new you beaut cpu's and variations with cache and windows with its zillion threads, could the cache get 'saturated'?
Garbage collection?

As far as I've understood, you get a fresh cache every time the OS hands over a time slice - but maybe I am wrong.

Quote from: hutch-- on May 20, 2009, 06:34:32 AM
The align 16, 7 byte instruction and a number of others produced the slowdown by over 50%. Large paddings of nops (DB90) did not effect the change but the use of a RETN in some instanced removed the slowdown as did a padding in INT 3 instructions.

I don't have a handy explanation but the effect is reproducable.

Lingo has the habit to start procs with e.g.
align 16
db 0, 9, 3, 99
LingoCode proc ....

Maybe he has found the secret :wink

The only possibility I see is that the CPU erroneously tries to interpret non-executed code before the proc start, in order to "optimise" the real code - branch prediction etc... ::)

Anyway, for analysis we should keep separate two issues:
1. The alignment that influences speed all the time, e.g. a beneficial nop before a speed-critical loop
2. The observation that the same identical and identically aligned code runs half as fast if placed 16 bytes lower.

For point 2, see my code attached above. The algo runs x% at full speed, and 100-x% at half speed, most probably because the OS takes over occasionally to do other things. I have further tested the phenomenon, and my "wrappers" for Michael's macros can effectively avoid it. Even with a modest invoke Sleep, 20 only and LOOP_COUNT = 10000, I get a rock solid timing of 151 cycles for mov eax, len(offset Tst50) (a 50-byte string). The non-cutoff mode timings instead waddle like a dog's tail, and the cutoff point varies, i.e. there is always a number of readings that are totally out of order.