The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on May 18, 2009, 12:30:16 PM

Title: Code location sensitivity of timings
Post by: jj2007 on May 18, 2009, 12:30:16 PM
This is a spin-off from the add$ thread (http://www.masm32.com/board/index.php?topic=11438.0), regarding a problem in the add$ macro, and a faster technique of concatenating strings with add$. I stumbled again over a code location problem:

809 cycles for addx$
848 cycles for add3$, old syntax
767 cycles for add4$, old syntax
258 cycles for add3$, short syntax
227 cycles for add4$, short syntax

Exactly the same algo, but moved a bit further down using the MakeSlow=1 switch in line 6:

774 cycles for addx$
826 cycles for add3$, old syntax
1214 cycles for add4$, old syntax
261 cycles for add3$, short syntax
763 cycles for add4$, short syntax

Both code locations are align 16, timings are for Prescott P4. It seems it would be worthwhile to look seriously into this issue. JimG's approach to move the code to a constant location is fine for testing purposes (I had no time to test it on this one), but it raises the question of how to optimise real life applications...

[attachment deleted by admin]
Title: Re: Code location sensitivity of timings
Post by: hutch-- on May 18, 2009, 01:14:26 PM
JJ,

The effect is not uncommon and I don't have a definitive answer. I have even played with padding the code either side with NOPS and ZEROS to get 1024 byte alignment and you still get the variation.

It is my guess that it is a combined code alignment/cache content pollution effect and it effects algos with high intensity loop code more than it effects algos that are logic based.

Also try a different type of timing test as this may be effected by the effects mentioned above. Long linear tests are far less sensitive to code location variation than short looped code methods.

Isn't code optimisation PHUN ?
Title: Re: Code location sensitivity of timings
Post by: Jimg on May 18, 2009, 01:18:22 PM
Usually the Amd is very sensitive, but with your test, I saw virtually no difference between the two.
Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 18, 2009, 01:33:47 PM
Quote from: hutch-- on May 18, 2009, 01:14:26 PM
Isn't code optimisation PHUN ?

It rocks :cheekygreen:

Quote from: Jimg on May 18, 2009, 01:18:22 PM
Usually the Amd is very sensitive, but with your test, I saw virtually no difference between the two.

Thanks. I am not surprised. A factor 3 slower just because it's a different CPU ::)
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 18, 2009, 01:40:05 PM
clear the cache - then run the tests ?
this is on the celeron cpu ? - or prescott ?
Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 18, 2009, 02:37:17 PM
Quote from: dedndave on May 18, 2009, 01:40:05 PM
clear the cache - then run the tests ?

The cache should be fine after the first handful of loops (how many are needed? one? any Intel or AMD insiders lurking??). Loop count is 200000, so if the first n loops take 200000 cycles each, the average will be n cycles higher. Not 500...

Quote
this is on the celeron cpu ? - or prescott ?

Prescott. I will check the Celeron tonight, but I doubt I will see the difference. It comes and goes with a few bytes more or less ::)

Edit: Celeron M timings - the effect is there but less strong:
MakeSlow = 0
343 cycles for addx$
627 cycles for add4$, old syntax
221 cycles for add4$, short syntax

MakeSlow = 1
342 cycles for addx$
650 cycles for add4$, old syntax
316 cycles for add4$, short syntax
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 18, 2009, 03:20:45 PM
maybe to write fast code, you have to set up some kind of
dynamic system that places and aligns routines at run-time  :dazzled:
Title: Re: Code location sensitivity of timings
Post by: hutch-- on May 19, 2009, 08:50:45 AM
JJ,

Here is an example of alignment affecting performance in an unusual way.

This algo clocks 406ms on an alignment of 0,1,2,4,8 but drops to 672ms on an alignment of 16.

I was playing with this algo, dropped it from 467 to 406 with an unroll and a few changed instructions.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    szappendx PROTO :DWORD,:DWORD,:DWORD

    .data?
      value dd ?

    .data
      item dd 0

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL hMem  :DWORD
    LOCAL cloc  :DWORD

    mov hMem, alloc(1024*1024*256)      ; 256 meg

    mov cloc, 0

    push esi
    mov esi, 20000000

    invoke GetTickCount
    push eax

  @@:
    mov cloc, rv(szappendx,hMem,"123456789012",cloc)
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax),13,10

    pop esi

    free hMem

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

; align 16

szappendx proc string:DWORD,buffer:DWORD,location:DWORD

  ; ------------------------------------------------------
  ; string      the main buffer to append extra data to.
  ; buffer      the byte data to append to the main buffer
  ; location    current location pointer
  ; ------------------------------------------------------

    push esi
    push edi
    mov edi, 1

    or eax, -1
    mov esi, [esp+12]       ; string
    mov ecx, [esp+16]       ; buffer
    add esi, [esp+20]       ; location ; add offset pointer to source address

  @@:
  REPEAT 3
    add eax, edi
    movzx edx, BYTE PTR [ecx+eax]
    mov [esi+eax], dl
    add edx, edx
    jz @F                   ; exit on written terminator
  ENDM

    add eax, edi
    movzx edx, BYTE PTR [ecx+eax]
    mov [esi+eax], dl
    add edx, edx
    jnz @B                  ; exit on written terminator

  @@:

    add eax, [esp+20]       ; location ; return updated offset pointer

    pop edi
    pop esi

    ret 12

szappendx endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: Code location sensitivity of timings
Post by: MichaelW on May 19, 2009, 09:14:31 AM
Perhaps a solution would be to use larger alignments. The codealign macro below uses the only workable method I have found so far, is somewhat clumsy, and not very well tested.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

alignment MACRO pv
    xor eax, eax
    mov ecx, pv
    bsf ecx, ecx
    jz @F
    mov eax, 1
    shl eax, cl
  @@:
    EXITM <eax>
ENDM

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

;-----------------------------------------------------------------------
; This macro will work only for items in the code segment and only if
; there is a code_base label at the start of the code segment.
;
; For some reason, specifying alignments >= 4096 produces an alignment
; of 4096. Hopefully, there is little or no need for larger alignments.
;-----------------------------------------------------------------------

codealign MACRO alignment
    LOCAL location, aligned_location
    location = $ - code_base
    aligned_location = location + alignment AND NOT (alignment - 1)
    db (alignment + aligned_location - location) dup(90h)
ENDM

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
    code_base:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    ;nops 5

    N = 1
    REPEAT 17
        print ustr$(N),9
        codealign N
      @@:
        print ustr$(alignment(OFFSET @B)),13,10
        ;nops N+5
        N = N SHL 1
    ENDM

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 19, 2009, 10:46:43 AM
Quote from: hutch-- on May 19, 2009, 08:50:45 AM
JJ,

Here is an example of alignment affecting performance in an unusual way.
This algo clocks 406ms on an alignment of 0,1,2,4,8 but drops to 672ms on an alignment of 16.
I was playing with this algo, dropped it from 467 to 406 with an unroll and a few changed instructions.


That is indeed very odd...

Quote from: MichaelW on May 19, 2009, 09:14:31 AM
Perhaps a solution would be to use larger alignments. The codealign macro below uses the only workable method I have found so far, is somewhat clumsy, and not very well tested.

Will have to look at it when I have some spare time. Another way is what JimG did, i.e. copy the whole proc to be tested to a fixed aligned location and test it there.

In the meantime, I wrote a wrapper to your timings macros performing a cutoff of the lowest and highest timings:

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Testing timings with cutoff, subcounter=10000

0:151 1:151 2:151 3:151 4:151 5:151 6:151 7:151 8:151 9:151 10:151 11:151 12:151
13:151 14:151 15:151 16:151 17:151 18:151 19:151 20:151

...

801:152 802:152 803:152 804:152 805:152 806:152 807:152 808:152 809:152 810:152
811:152 812:152 813:152 814:152 815:152 816:152 817:152 818:152 819:153 820:153
821:154 822:154 823:154 824:154 825:155 826:155 827:155 828:155 829:155 830:155
831:155 832:155 833:156 834:156 835:156 836:157 837:157 838:157 839:157 840:157
841:157 842:157 843:157 844:157 845:157 846:157 847:157 848:157 849:157 850:157
851:157 852:157 853:157 854:158 855:158 856:158 857:158 858:158 859:159 860:159
861:159 862:159 863:159 864:159 865:160 866:160 867:160 868:160 869:160 870:161
871:162 872:163 873:163 874:163 875:164 876:164 877:165 878:165 879:170 880:170
881:172 882:173 883:175 884:176 885:178 886:179 887:180 888:187 889:188 890:188
891:188 892:188 893:188 894:188 895:188 896:188 897:188 898:188 899:188 900:188
901:189 902:193 903:202 904:203 905:205 906:226 907:231 908:232 909:243 910:244
911:248 912:252 913:260 914:263 915:263 916:263 917:263 918:264 919:264 920:264
921:264 922:265 923:265 924:265 925:265 926:265 927:265 928:265 929:266 930:266
931:266 932:266 933:267 934:267 935:267 936:267 937:268 938:268 939:268 940:268
941:268 942:268 943:268 944:268 945:268 946:269 947:269 948:269 949:269 950:269
951:269 952:270 953:270 954:270 955:270 956:270 957:270 958:270 959:270 960:270
961:270 962:270 963:270 964:270 965:270 966:270 967:270 968:270 969:270 970:270
971:270 972:271 973:271 974:271 975:271 976:272 977:272 978:272 979:272 980:272
981:273 982:273 983:273 984:273 985:273 986:273 987:274 988:275 989:280 990:280
991:281 992:281 993:282 994:282 995:288 996:288 997:297 998:305 999:312 1000:331

MyTest: subcounter=     10000
average with cutoff     151
average w/o cutoff      164


That might be a way to stabilise timings, but doesn't solve the location problem, of course.


[attachment deleted by admin]
Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 19, 2009, 11:56:00 AM
Maybe I found something interesting. This version shows a sudden jump of the timings to a significantly higher value:

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Testing timings with cutoff, subcounter=1000

0:151 1:151 2:152 3:152 4:152 5:152 6:152 7:152 8:152 9:152 10:152

...

401:152 402:152 403:152 404:152 405:168 406:211 407:255 408:255 409:256 410:256
411:256 412:257 413:257 414:258 415:258 416:258 417:258 418:259 419:259 420:259
421:259 422:259 423:260 424:260 425:260 426:260 427:261 428:261 429:261 430:261
431:262 432:262 433:262 434:262 435:262 436:262 437:262 438:262 439:263 440:263
441:263 442:263 443:263 444:263 445:263 446:263 447:263 448:264 449:264 450:264
451:264 452:264 453:264 454:264 455:264 456:264 457:264 458:264 459:264 460:264
461:264 462:264 463:265 464:265 465:265 466:265 467:265 468:265 469:265 470:265
471:265 472:265 473:265 474:265 475:265 476:265 477:265 478:265 479:265 480:265
481:265 482:265 483:265 484:265 485:265 486:265 487:265 488:265 489:265 490:265
491:265 492:265 493:266 494:266 495:266 496:266 497:266 498:266 499:266 500:266
501:266 502:266 503:266 504:266 505:266 506:266 507:266 508:266 509:266 510:266
511:266 512:266 513:266 514:266 515:266 516:266 517:266 518:266 519:266 520:266
521:266 522:266 523:266 524:266 525:266 526:266 527:266 528:266 529:266 530:266
531:266 532:266 533:266 534:266 535:266 536:266 537:266 538:266 539:266 540:266
541:266 542:266 543:266 544:266 545:266 546:266 547:266 548:266 549:266 550:267
551:267 552:267 553:267 554:267 555:267 556:267 557:267 558:267 559:267 560:267
561:267 562:267 563:267 564:267 565:267 566:267 567:267 568:267 569:267 570:267
571:267 572:267 573:267 574:267 575:267 576:267 577:267 578:267 579:267 580:267
581:267 582:267 583:267 584:267 585:267 586:267 587:267 588:267 589:267 590:267
591:267 592:267 593:267 594:267 595:267 596:267 597:267 598:267 599:267 600:267
601:267 602:267 603:267 604:267 605:267 606:267 607:267 608:267 609:267 610:267
611:267 612:267 613:267 614:267 615:267 616:267 617:267 618:267 619:267 620:267
621:267 622:267 623:267 624:267 625:267 626:267 627:267 628:267 629:267 630:267
631:267 632:267 633:267 634:267 635:267 636:267 637:267 638:267 639:267 640:267
641:268 642:268 643:268 644:268 645:268 646:268 647:268 648:268 649:268 650:268
651:268 652:268 653:268 654:268 655:268 656:268 657:268 658:268 659:268 660:268
661:268 662:268 663:268 664:268 665:268 666:268 667:268 668:268 669:268 670:268
671:268 672:268 673:268 674:268 675:268 676:268 677:268 678:268 679:268 680:268
681:268 682:268 683:268 684:268 685:268 686:268 687:268 688:268 689:268 690:268
691:268 692:268 693:268 694:268 695:268 696:268 697:268 698:268 699:268 700:268
701:268 702:268 703:268 704:268 705:268 706:268 707:268 708:268 709:268 710:268
711:268 712:268 713:268 714:268 715:268 716:268 717:268 718:268 719:268 720:268
721:268 722:268 723:268 724:268 725:268 726:268 727:268 728:268 729:268 730:268
731:268 732:269 733:269 734:269 735:269 736:269 737:269 738:269 739:269 740:269
741:269 742:269 743:269 744:269 745:269 746:269 747:269 748:269 749:269 750:269
751:269 752:269 753:269 754:269 755:269 756:269 757:269 758:269 759:269 760:269
761:269 762:269 763:269 764:269 765:269 766:269 767:269 768:269 769:269 770:269
771:269 772:269 773:269 774:269 775:269 776:269 777:269 778:269 779:269 780:269
781:269 782:269 783:269 784:269 785:269 786:269 787:269 788:269 789:269 790:269
791:269 792:269 793:269 794:269 795:269 796:269 797:269 798:269 799:269 800:269
801:269 802:269 803:269 804:269 805:269 806:269 807:269 808:269 809:269 810:269
811:269 812:269 813:269 814:269 815:269 816:269 817:269 818:269 819:269 820:269
821:269 822:269 823:269 824:269 825:269 826:269 827:270 828:270 829:270 830:270
831:270 832:270 833:270 834:270 835:270 836:270 837:270 838:270 839:270 840:270
841:270 842:270 843:270 844:270 845:270 846:270 847:270 848:270 849:270 850:270
851:270 852:270 853:270 854:270 855:270 856:270 857:270 858:270 859:270 860:270
861:270 862:270 863:270 864:270 865:270 866:270 867:270 868:270 869:270 870:270
871:270 872:270 873:270 874:270 875:270 876:270 877:270 878:270 879:270 880:270
881:270 882:270 883:270 884:270 885:270 886:270 887:270 888:270 889:270 890:270
891:270 892:270 893:270 894:270 895:270 896:270 897:270 898:270 899:270 900:270
901:270 902:270 903:270 904:270 905:270 906:270 907:270 908:270 909:270 910:270
911:270 912:270 913:270 914:271 915:271 916:271 917:271 918:271 919:271 920:271
921:271 922:271 923:271 924:271 925:271 926:271 927:271 928:271 929:271 930:271
931:271 932:271 933:271 934:271 935:271 936:271 937:271 938:271 939:271 940:271
941:271 942:271 943:272 944:272 945:272 946:272 947:272 948:272 949:272 950:272
951:272 952:272 953:272 954:272 955:272 956:272 957:273 958:273 959:273 960:273
961:273 962:273 963:274 964:274 965:274 966:274 967:275 968:275 969:275 970:275
971:275 972:275 973:275 974:276 975:279 976:281 977:286 978:287 979:304 980:318
981:319 982:334 983:374 984:378 985:396 986:404 987:412 988:421 989:422 990:427
991:435 992:441 993:445 994:467 995:487 996:493 997:496 998:497 999:502 1000:789

MyTest: subcounter=     1000
average with cutoff     152
average w/o cutoff      224
difference raw-cutoff:  72

The point of change is not fixed. With the same code, I get also e.g.
581:152 582:152 583:152 584:152 585:152 586:152 587:169 588:199 589:247 590:252
591:256 592:257 593:257 594:257 595:257 596:258 597:258 598:259 599:260 600:261

But the pattern is clear: One part of the timings are fast, another a lot slower, and both consistently so. The code location plays a role in this, obviously. But the code is not always slow. This makes the phenomenon different from what we experience with changing alignments.

EDIT: New attachment with automatic cutoff following a rule:
- take the 10th lowest value as a reference
- multiply with 1.25
- cutoff when the first count higher than this value is being found
Usually, for the P4 the cutoff n is somewhere around 600...900 of 1000 readings, and timings are very stable.

[attachment deleted by admin]
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 19, 2009, 11:58:25 AM
i would guess the OS is doing something else during the high count tests
Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 19, 2009, 12:38:51 PM
Quote from: dedndave on May 19, 2009, 11:58:25 AM
i would guess the OS is doing something else during the high count tests


Combined with flushing the cache...?
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 19, 2009, 12:43:32 PM
is that a rhetorical question Jochen, or am i supposed to make up an answer ?
Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 19, 2009, 01:43:36 PM
Quote from: dedndave on May 19, 2009, 12:43:32 PM
is that a rhetorical question Jochen, or am i supposed to make up an answer ?

If you find an answer... but don't tell me it's 42 (http://theanswers42.blogspot.com/) :wink
Title: Re: Code location sensitivity of timings
Post by: FORTRANS on May 19, 2009, 01:44:39 PM
Quote from: dedndave on May 19, 2009, 11:58:25 AM
i would guess the OS is doing something else during the high count tests

Hello,

  I have a simple graphics program that prints out the iterations
per second. It works in an XP full screen command prompt, but
if run longer than a few seconds, it slows significantly (about half
speed).  I have seen another DOS graphics program slow drastically
after running a few seconds, though that one is not very repeatable.

  Maybe boot to safe mode, or DOS?

FWIW,

Steve N.
Title: Re: Code location sensitivity of timings
Post by: BlackVortex on May 19, 2009, 02:20:02 PM
What ?   :bg

It's not a DOS program !
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 19, 2009, 02:31:47 PM
it could be a dos graphics program under XP
i have a similar one i wrote years ago for DOS
it uses the 320x200 256 color graphics mode
although, i have not seen this slow-down problem with it

i did notice that it seems disk-intensive
DOS-mode disk I/O running under Windows seems to be a poor mix

Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 19, 2009, 02:56:12 PM
I have often noticed that apps slow down while the fan starts spinning faster. Maybe I've misunderstood what SpeedStep (TM) really means - I thought apps should run faster when they ask for it :bg

In the meantime, I might have found an answer; the following shows how many readings in 1,000 are low, i.e. 17-18 cycles for 10 push/pop reg32 pairs:

LOOP_COUNT =   10000000
average with cutoff     18      cutoff at n=523 for ct higher than 23 cycles
average w/o cutoff      29
difference raw-cutoff:  11

LOOP_COUNT =   5000000
average with cutoff     17      cutoff at n=639 for ct higher than 23 cycles
average w/o cutoff      26
difference raw-cutoff:  9

LOOP_COUNT =   2000000
average with cutoff     18      cutoff at n=898 for ct higher than 23 cycles
average w/o cutoff      20
difference raw-cutoff:  2

LOOP_COUNT =   1000000
average with cutoff     17      cutoff at n=977 for ct higher than 23 cycles
average w/o cutoff      18
difference raw-cutoff:  1

LOOP_COUNT =   500000
average with cutoff     17      cutoff at n=970 for ct higher than 22 cycles
average w/o cutoff      17
difference raw-cutoff:  0

See the pattern? It seems that if LOOP_COUNT is too high, the OS is forced to do something else, eating up cycles for half of the readings. Probably the time slice is too short to perform the full test then?

Anyway, the cutoff method works fine, and is easier to handle than JimG's copying scheme. But that still doesn't answer the question why the OS behaves like that more often if the same identical and identically aligned code sits a few bytes further down...
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 19, 2009, 03:26:13 PM
that is what i was saying in the other thread
i think the test "burst" needs to fall within some predifined window of cycle counts
too few cycles - unrepeatable results due to test/overhead ratio
too many cycles - can't avoid the OS "hic-ups"
i think that window is fairly large and should be easy to hit
once the window has been defined, testing any algo should first adjust it's "burst" length to fit the window - then run the tests
i am working on a program that does it this way - if i can just get finished with the cpu identification nightmare - lol
i am also trying to accomodate a wide variety of machines with multi-cores
example: some off-the-wall guy has a machine with a prescott and 2 norwoods in it
we can acquire information for the prescott and norwood from a single run
Title: Re: Code location sensitivity of timings
Post by: MichaelW on May 19, 2009, 04:25:38 PM
I originally recommended a high loop count because it produced better run-to-run consistency. Then some time later I discovered that adding a 3-4 second delay after the program was launched and before entering the first test loop eliminated the need for the high loop count. Running on my P3, with the delay and a loop count of 1000, and testing code that executes somewhere around 20-300 cycles, the run-to-run variation is near zero.
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 19, 2009, 04:33:18 PM
good info to have Michael
i would not have thought to look for a high start-up usage by the OS
probably building all the structures that you "might" refer to in function calls - lol - what a waste
it would make more sense to me if the OS set everything up - then run the app - what do i know
i wonder if this affects windows apps as well as console apps - probably does
also, i wonder if this is somehow related Jochen's mystery cycle-count issue
Title: Re: Code location sensitivity of timings
Post by: Mark Jones on May 20, 2009, 02:02:28 AM
FWIW,
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3) (XP x64)
Testing timings with cutoff

0:79 1:79 2:79 3:79 4:79 5:79 6:79 7:79 8:79 9:79 10:79

...

701:79 702:79 703:79 704:79 705:79 706:79 707:79 708:79 709:79 710:79 711:79 712
:79 713:79 714:79 715:79 716:79 717:79 718:79 719:79 720:79 721:79 722:79 723:79
724:79 725:79 726:79 727:79 728:79 729:79 730:79 731:79 732:79 733:79 734:79 73
5:79 736:79 737:79 738:79 739:79 740:79 741:79 742:79 743:79 744:79 745:79 746:7
9 747:79 748:79 749:79 750:79 751:79 752:79 753:79 754:79 755:79 756:79 757:79 7
58:79 759:79 760:79 761:79 762:79 763:79 764:79 765:79 766:79 767:79 768:79 769:
79 770:79 771:79 772:79 773:79 774:79 775:79 776:79 777:79 778:79 779:79 780:79
781:79 782:79 783:79 784:79 785:79 786:79 787:79 788:79 789:79 790:79 791:79 792
:79 793:79 794:79 795:79 796:79 797:79 798:79 799:79 800:79 801:79 802:79 803:79
804:79 805:79 806:79 807:79 808:79 809:79 810:79 811:79 812:79 813:79 814:79 81
5:79 816:79 817:79 818:79 819:79 820:79 821:79 822:79 823:79 824:79 825:79 826:7
9 827:79 828:79 829:79 830:79 831:79 832:79 833:79 834:79 835:79 836:79 837:79 8
38:79 839:79 840:79 841:79 842:79 843:79 844:79 845:79 846:79 847:79 848:79 849:
79 850:79 851:79 852:79 853:79 854:79 855:79 856:79 857:79 858:79 859:79 860:79
861:79 862:79 863:79 864:79 865:79 866:79 867:79 868:79 869:79 870:79 871:79 872
:79 873:79 874:79 875:79 876:79 877:79 878:79 879:79 880:79 881:79 882:79 883:79
884:79 885:79 886:79 887:79 888:79 889:79 890:79 891:79 892:79 893:79 894:79 89
5:79 896:79 897:79 898:79 899:79 900:79 901:79 902:79 903:79 904:79 905:79 906:7
9 907:79 908:79 909:79 910:79 911:79 912:79 913:79 914:79 915:79 916:79 917:79 9
18:79 919:79 920:79 921:79 922:79 923:79 924:79 925:79 926:79 927:79 928:79 929:
79 930:79 931:79 932:79 933:79 934:79 935:79 936:79 937:79 938:79 939:79 940:79
941:79 942:79 943:79 944:79 945:79 946:79 947:79 948:79 949:79 950:79 951:79 952
:79 953:79 954:79 955:79 956:79 957:79 958:79 959:79 960:79 961:79 962:79 963:79
964:79 965:79 966:79 967:79 968:79 969:79 970:79 971:79 972:79 973:79 974:79 97
5:79 976:79 977:79 978:79 979:79 980:79 981:79 982:79 983:79 984:79 985:79 986:7
9 987:79 988:79 989:79 990:79 991:79 992:79 993:79 994:79 995:79 996:79 997:80 9
98:80 999:81 1000:109
MyTest: subcounter=     1000
average with cutoff     79      cutoff at n=1000 for ct higher than 99 cycles
average w/o cutoff      78
difference raw-cutoff:  -1
Title: Re: Code location sensitivity of timings
Post by: hutch-- on May 20, 2009, 02:29:19 AM
Having played with enough algos over time, there does seem to be a pattern in the example I posted where setting the alignment to 16 suddenly makes the algo a lot slower, sometimes over 50% slower. It would seem that if the algorithm is dealing with byte data (IE processed at 1 byte at a time) then it is more succeptable to alignment variations that if the algo deals with DWORD sized data.

Ther is possibly another factor with the "ALIGN 16" directive, it can be up to 15 bytes long usually done with two NOP style instructions and while they should never be executed as they are before the label in a procedure, I have a sneaking suspicion that very long instructions stored near the front of the code cache can also have unusual effects in terms of timing.
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 20, 2009, 02:44:09 AM
well - i always try to use a simple instruction (or two) at a label
that gives the eip something to chew on while the queue fills up
like....

branch:
cld
mov eax,edx

(i.e. use instructions that access no memory)

i have always used a similar method all the way back to 8086
with that processor, the queue was 6 bytes
but, it was capable of pre-fetching and executing at the same time (second generation uP they called it i think)

Title: Re: Code location sensitivity of timings
Post by: NightWare on May 20, 2009, 03:14:46 AM
Quote from: hutch-- on May 20, 2009, 02:29:19 AM
I have a sneaking suspicion that very long instructions stored near the front of the code cache can also have unusual effects in terms of timing.
suspicion ? is someone here read what i've wrote ? i've spoken of uops for a good reason... always remember that alignment (made by us) is to make instructions decoding easier for the cpu... BUT it's not always sufficient... take a piece of paper and see the possibilities in the case of a 6 or 4 uops cpu, for example with a 7 bytes instruction... you will certainly understand... anyway, it's not something you can modify by coding, so why bother with that ?
Title: Re: Code location sensitivity of timings
Post by: hutch-- on May 20, 2009, 03:47:15 AM
NightWare,

Tolerate us mrer mortals and explain what you have mentioned. I commented on something I have found over time that is not mentioned in any of the optimisation documentation I have ever seen over the last 20 years, effectively a NON-EXECUTED instruction sequence provided by the assembler for the purposes of code alignment that appears to effect the timing of the executed code that follows it.
Title: Re: Code location sensitivity of timings
Post by: hutch-- on May 20, 2009, 06:34:32 AM
Now while waiting for such pearls of wisdom, here is some interum testing that seems to effect the timing change with different code alignment.


    align 16

    mov esp, [esp+edx*8+128]    ; non executed instruction 7 bytes

    REPEAT 64
      int 3
    ENDM

    ; retn

    align 4


The align 16, 7 byte instruction and a number of others produced the slowdown by over 50%. Large paddings of nops (DB90) did not effect the change but the use of a RETN in some instanced removed the slowdown as did a padding in INT 3 instructions.

I don't have a handy explanation but the effect is reproducable.
Title: Re: Code location sensitivity of timings
Post by: sinsi on May 20, 2009, 06:59:52 AM
I was wondering...with our new you beaut cpu's and variations with cache and windows with its zillion threads, could the cache get 'saturated'?
Garbage collection?
Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 20, 2009, 09:10:36 AM
Quote from: sinsi on May 20, 2009, 06:59:52 AM
I was wondering...with our new you beaut cpu's and variations with cache and windows with its zillion threads, could the cache get 'saturated'?
Garbage collection?

As far as I've understood, you get a fresh cache every time the OS hands over a time slice - but maybe I am wrong.

Quote from: hutch-- on May 20, 2009, 06:34:32 AM
The align 16, 7 byte instruction and a number of others produced the slowdown by over 50%. Large paddings of nops (DB90) did not effect the change but the use of a RETN in some instanced removed the slowdown as did a padding in INT 3 instructions.

I don't have a handy explanation but the effect is reproducable.

Lingo has the habit to start procs with e.g.
align 16
db 0, 9, 3, 99
LingoCode proc ....

Maybe he has found the secret :wink

The only possibility I see is that the CPU erroneously tries to interpret non-executed code before the proc start, in order to "optimise" the real code - branch prediction etc... ::)

Anyway, for analysis we should keep separate two issues:
1. The alignment that influences speed all the time, e.g. a beneficial nop before a speed-critical loop
2. The observation that the same identical and identically aligned code runs half as fast if placed 16 bytes lower.

For point 2, see my code attached above. The algo runs x% at full speed, and 100-x% at half speed, most probably because the OS takes over occasionally to do other things. I have further tested the phenomenon, and my "wrappers" for Michael's macros can effectively avoid it. Even with a modest invoke Sleep, 20 only and LOOP_COUNT = 10000, I get a rock solid timing of 151 cycles for mov eax, len(offset Tst50) (a 50-byte string). The non-cutoff mode timings instead waddle like a dog's tail, and the cutoff point varies, i.e. there is always a number of readings that are totally out of order.
Title: Re: Code location sensitivity of timings
Post by: NightWare on May 20, 2009, 11:40:00 PM
Quote from: hutch-- on May 20, 2009, 03:47:15 AM
I commented on something I have found over time that is not mentioned in any of the optimisation documentation I have ever seen over the last 20 years, effectively a NON-EXECUTED instruction sequence provided by the assembler for the purposes of code alignment that appears to effect the timing of the executed code that follows it.

you consider a cpu like a robot doing always the same thing (where alignment should fit well), but it's NOT the case. the cpu behave differently depending of the instructions that have been executed previously.

when YOU enter an instruction, you "control" the alignment for the cpu. BUT, you must remember that your app/code IS NOT the first and unique app executed. when you launch an app (your code), others instructions have been used before that have filled the code/trace cache, and those instructions will not be cleaned just because you want or think it should (you must use instructions more often (during a delay) to replace them, it's the principle of the cache).
then, to replace the instructions in the cache you replace them from MEMORY (and HERE the alignment is usefull).

now, here we entering in HOW the cpu decode the instructions from the cache, and remember the cache is only few kb (so no space to loose). the instructions (or parts of algos) follow others instructions (or parts of alos), to avoid "pollution".
but of course to obtain that the cpu use internal location to decode the instructions (we are NOT in memory alignment logic anymore). but alignment INSIDE the code/trace cache, and it's something you CAN'T modify, it's decided by the cpu, NOT YOU.
and yes, since we are in ANOTHER alignment, there is speed difference directly linked to this (and the number of uops your cpu can deal with become essential).

jj, stop moving up/down the instructions to obtain speed up, you must only do that to avoid stall, nothing else. otherwise don't be surprised if the cache give you a BIG slowdown when your aligned code suddenly become totally unaligned.
Title: Re: Code location sensitivity of timings
Post by: MichaelW on May 21, 2009, 01:21:09 AM
Quote from: jj2007 on May 19, 2009, 02:56:12 PM
But that still doesn't answer the question why the OS behaves like that more often if the same identical and identically aligned code sits a few bytes further down...

I'm having a problem understanding the "identically aligned code sits a few bytes further down" part. If you are using align 16, then any code that sits "a few bytes further down" is not identically aligned. If it is identically aligned, then it must be sitting some multiple of 16 bytes further down.
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 21, 2009, 01:49:38 AM
that seems to be the nature of what JJ is describing, Michael
if he uses align 16 - one result
moves it 16 bytes down - different result
it seems to happen on some processors - not others
it is sometimes difficult to reproduce
Title: Re: Code location sensitivity of timings
Post by: hutch-- on May 21, 2009, 03:29:12 AM
Exactly the reason why I posted the following example is that you CAN CHANGE the effect of large slowdowns by avoiding potentially long instructions (7 byte) just prior to the code that is run. While NOPs did not work, a RETN in some circumstances works OK and padding with INT 3 also seems to work in some instances.

I do not routinely run 16 byte alignment as it has tested slow on many occasions where the ALIGN 4 directive rarely ever causes a problem.


    align 16

    mov esp, [esp+edx*8+128]    ; non executed instruction 7 bytes

    REPEAT 64
      int 3
    ENDM

    ; retn

    align 4


Without having a definitive answer, I suggest it has something to do with the long instruction prior to the aligned code causing problems with the code cache but at least there is a way to change it with a bit of experimentation. What will vary from processor to processor is the block read size for the code cache so it would mean object module level alignment to match each processor level 1 cache read size to fully avoid the problem which then ends up impractical for final code size problems.
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 21, 2009, 03:43:05 AM
it would be nice to know which processors do what
for most code, it can be overlooked, perhaps
but if there is a real time-crunching routine, it would be possible to load the algo dynamically at runtime, according the host cpu
i suppose you could test the alignment at run-time as well, rather than IDing the cpu
clearly an advanced technique - we better leave it to the experts rofl
Title: Re: Code location sensitivity of timings
Post by: hutch-- on May 21, 2009, 06:37:04 AM
Dave,

What you have mentioned is reasonably simple to do, allocate executable memory large enough to hold the code plus the maximum alignment required then copy the algo to the alignment you want at runtime. Its an old technique to bypass API wrappers for other functions at lower level in the API function set.
Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 21, 2009, 06:54:35 AM
Quote from: MichaelW on May 21, 2009, 01:21:09 AM

I'm having a problem understanding the "identically aligned code sits a few bytes further down" part. If you are using align 16, then any code that sits "a few bytes further down" is not identically aligned. If it is identically aligned, then it must be sitting some multiple of 16 bytes further down.


This is indeed what I meant - sorry for not being crystal clear. Simplified:
if MakeSlow
REPEAT 16
nop
ENDM
endif
align 16
MyTest proc
  ret
MyTest endp


And (@Nightware) it has nothing to do with shifting lines up or down to speed up the code. It is the same identical algo that produces different results. Note that you can eliminate that effect by using the sort & eliminate outliers technique described further up. In real apps, however, the effect might be present, depending on things you cannot easily influence, i.e. where that algo happens to be inserted.

Now one thing I have not yet tested is whether the effect was produced by the code immediately before the algo, i.e. the one that according to our most recent theory  :wink might influence the code cache (branch prediction whatever) behaviour...

This avenue looks promising. If Hutch finds that a retn immediately before the start of the algo speeds it up, it makes sense: Even a not-so-intelligent CPU should know that there is nothing to optimise...

Other candidates for non-optimisable instructions are int3, cpuid, rdtsc...
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 21, 2009, 10:30:52 AM
yes - cpuid seems like a good choice, as it forces serialization of instructions
Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 21, 2009, 09:42:59 PM
Good and bad news.
First the good one: I am able to nail down the speed difference to one byte:

Quote
   if MakeSlow
      print chr$(10, "Code sizes:", 13, 10)
;      341 cycles for addx$
;      584 cycles for add4$, old syntax
;      324 cycles for add4$, short syntax
   else
      print chr$(10, 10, "Code sizes:", 13, 10)
;      341 cycles for addx$
;      531 cycles for add4$, old syntax
;      192 cycles for add4$, short syntax
   endif

Timings are rock solid.

Now the bad news:
SLOW version
0040193C               ?  5E                         pop esi
0040193D               ?  5F                         pop edi
0040193E               ?  C2 0800                    retn 8
00401941              .  8DA424 00000000             lea esp, [local.0]
00401948              . 8DA424 00000000             lea esp, [local.1]
0040194F              ?  90                          nop
00401950              >  CC                          int3 <<<<<<<<<<< algo start
00401951              ?  57                          push edi
00401952              ?  56                          push esi

*** print chr$(10, "Code sizes:", 13, 10)
00401805              .  40                          inc eax                                                                  ; Arg1 => ASCII 0A,"Code sizes"
00401806              ?  00E8                        add al, ch                                                               ;
00401808              ?  90                          nop                                                                      ;
00401809              ?  0100                        add dword ptr [eax], eax                                                 ;
0040180B              ?  00CC                        add ah, cl
0040180D              ?  68 59C04000                 push offset AddStrB.0040C059                                             ; ASCII 0A,"Code sizes"
00401812              ?  E8 89030000                 call 00401BA0                                                            ;
00401817              ?  68 68C04000                 push offset AddStrB.0040C068                                             ; ASCII "szCatStr2    = "
0040181C              ?  E8 7F030000                 call 00401BA0

FAST version
0040193C               ?  5E                         pop esi
0040193D               ?  5F                         pop edi
0040193E               ?  C2 0800                    retn 8
00401941              .  8DA424 00000000             lea esp, [local.0]
00401948              . 8DA424 00000000             lea esp, [local.1]
0040194F              ?  90                          nop
00401950              >  CC                          int3 <<<<<<<<<<< algo start
00401951              ?  57                          push edi
00401952              ?  56                          push esi
00401953              .  8B7C24 0C                   mov edi, dword ptr [arg1]

*** print chr$(10, 10, "Code sizes:", 13, 10)
00401805              .  40                          inc eax                                                                  ; Arg1 => ASCII LF,LF,"Code sizes:",CR,LF
00401806              ?  00E8                        add al, ch                                                               ;
00401808              ?  90                          nop                                                                      ;
00401809              ?  0100                        add dword ptr [eax], eax                                                 ;
0040180B              ?  00CC                        add ah, cl
0040180D              ?  68 59C04000                 push offset AddStrB.0040C059                                             ; ASCII LF,LF,"Code sizes:",CR,LF
00401812              ?  E8 89030000                 call 00401BA0                                                            ;
00401817              ?  68 69C04000                 push offset AddStrB.0040C069                                             ; ASCII "szCatStr2    = "
0040181C              ?  E8 7F030000                 call 00401BA0


The algo and its location are identical.

Needless to say that I fumbled a lot with the align 16 plus specific fillbytes (retn, cpuid, ...) options. They have no effect, at least on the Celeron M where I could test this.

[attachment deleted by admin]
Title: Re: Code location sensitivity of timings
Post by: redskull on May 22, 2009, 02:09:53 AM
Alright, I need to ask a n00B question - why should any timing code be trusted which can't disable the interrupts?  Every 15ms (at least) the executing thread grinds to a halt to service the kernel, which high performance counter wouldn't compensate for.  Additional hardware events and DPCs would compound the problem even more, right?  I was under the tacit assumption that any code timing, unless done in real mode with interrupts disabled, was AT BEST a good estimation.  Can someone explain why this is not so?

-r
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 22, 2009, 02:52:22 AM
well - they have been playing with different methods for a while in here
QueryPerfomanceCounter, RDTSC, as well as SetPriorityClass and SetProcessAffinity
they have an idea of what some simple strings of code should return - so they can test their methods
but, the popular approach seems to be to run the test code about 10,000 times and pick one
of the lower result values, assuming that nothing else was going on during that run
it seems to return fairly repeatable results on single-core CPU's
i guess you'd say we are still fine-tuning the measurement methods
to obtain more reliable results on a wider variety of machines
i am playing with some code along that line, myself
Title: Re: Code location sensitivity of timings
Post by: hutch-- on May 22, 2009, 03:52:08 AM
Its something like being a lone voice crying in the wilderness but until real time testing becomes the norn on large enough samples to emulate real time situations, the results will continue to be all over the place like a mad woman's sewerage.

Short looped test conditions do have their place but it tends to be in the area of instruction sequence testing, not in larger scale algorithm testing where cache effects, memory paging and bare data read/write times take effect. As hardware gets further away from the old pre-pipelined single instruction munchers, many of these assumptions are simply out of date. The only factor that remains constant is real time, the rest vary with hardware changes.
Title: Re: Code location sensitivity of timings
Post by: dedndave on May 22, 2009, 04:06:02 AM
lol "mad womans sewerage" - do i wanna know what that is ? - reminds me of the ex
i think there is still a chance to get some meaningful data on algos
so long as the data set you select emulates the real world
some of us are not all that interested in "how many ticks does it take"
i am more interested in obtaining some kind of usable comparisons
if things are set up right, and you take into account that different algos work differently on different data sets
you should be able to get enough information to help select the right algo for the job
this algo handles short strings better but is slow on long strings
that algo doesn't care how long the string is, it's not fast anywhere
these generalizations should help you decide what is best for a given application
if you can write the algo 3 ways - it is a matter of copy/paste in the finished program to help decide
i do think there is a lot of room for improvement in the test methods, however
it will never be perfect, of course - but it can be better than it is - we will get there


EDIT: i was playing with some code today and getting nice continuous time values on a dual-core, at least
Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 22, 2009, 07:45:57 AM
Quote from: redskull on May 22, 2009, 02:09:53 AM
Alright, I need to ask a n00B question - why should any timing code be trusted which can't disable the interrupts?  Every 15ms (at least) the executing thread grinds to a halt to service the kernel

The TIMER macros take account of that by waiting until a new time slice is available (@MichaelW: please correct me if that is imprecise). All readings are done in well under 15ms (actually, this test with 1,000 readings takes 15 ms including all overheads). And readings are very consistent - 967 out of 1,000 tests are in a range of +- 2 cycles around the average:

Quote0:744 1:744 2:744 3:744 4:744 5:744 6:744 7:744 8:744 9:744 10:744
...
951:746 952:746 953:746 954:746 955:746 956:746 957:746 958:747 959:751 960:756
961:780 962:807 963:819 964:834 965:839 966:845 967:971 968:973 969:974 970:976
971:986 972:989 973:989 974:990 975:991 976:991 977:992 978:992 979:993 980:994
981:997 982:999 983:1000 984:1001 985:1002 986:1011 987:1016 988:1025 989:1025
990:1030 991:1033 992:1033 993:1035 994:1044 995:1045 996:1052 997:1056 998:1111
999:1136 1000:1209

average with cutoff     746     cutoff at n=967 for ct>=931 cycles
average w/o cutoff      754
difference raw-cutoff:  8

Here is the same for the fast version - identical code at identical position, see earlier post above:
Quote0:259 1:259 2:259 3:259 4:259 5:259 6:259 7:259 8:259 9:259 10:259
...
951:261 952:261 953:261 954:261 955:261 956:261 957:261 958:261 959:261 960:261
961:261 962:261 963:261 964:261 965:261 966:261 967:261 968:261 969:261 970:261
971:261 972:261 973:261 974:261 975:261 976:261 977:261 978:261 979:261 980:261
981:262 982:262 983:262 984:262 985:262 986:267 987:323 988:374 989:422 990:488
991:488 992:491 993:492 994:500 995:506 996:510 997:511 998:523 999:530 1000:758

MyTest
average with cutoff     259     cutoff at n=988 for ct>=324 cycles
average w/o cutoff      262
difference raw-cutoff:  3

Note these timings are for a P4, so it is not even CPU-specific. But test yourself - I have attached two executables now.

[attachment deleted by admin]
Title: Re: Code location sensitivity of timings
Post by: MichaelW on May 22, 2009, 09:12:12 AM
QuoteThe TIMER macros take account of that by waiting until a new time slice is available (@MichaelW: please correct me if that is imprecise).

Only the second set of macros (counter2.asm in counter2.zip) do this. These macros also capture the lowest cycle count that occurs in any loop.

Title: Re: Code location sensitivity of timings
Post by: jj2007 on May 22, 2009, 10:09:37 AM
Quote from: MichaelW on May 22, 2009, 09:12:12 AM
QuoteThe TIMER macros take account of that by waiting until a new time slice is available (@MichaelW: please correct me if that is imprecise).

Only the second set of macros (counter2.asm in counter2.zip) do this. These macros also capture the lowest cycle count that occurs in any loop.


Oops, I stand corrected - counter2.asm, lines 57 and 106:

invoke Sleep, 0    ;; Start a new time slice

If that is sufficient to guarantee a new slice, it can be done with a
QuotecyctCutFast =   10
cyctSleepMs =    0

See in cyct_begin:
  ifdef cyctSleepMs
invoke Sleep, cyctSleepMs
  endif


And unfortunately, it has no influence on the observed speed difference, so that does not seem to be the culprit...

Re lowest cycle count, my wrapper uses CombSortA to eliminate the outliers. On some occasions, I saw negative cycle counts, too.
Title: Re: Code location sensitivity of timings
Post by: Mark Jones on May 22, 2009, 01:29:40 PM
AMD x64 4000+ / WinXP Pro x64

AddStrBSlow:
average with cutoff     215     cutoff at n=995 for ct>=269 cycles
average w/o cutoff      222
difference raw-cutoff:  7



AddStrBFast:
average with cutoff     230     cutoff at n=987 for ct>=288 cycles
average w/o cutoff      239
difference raw-cutoff:  9
Title: Re: Code location sensitivity of timings
Post by: jj2007 on June 09, 2009, 01:56:05 PM
Another cute example (d7s8=destination aligned 16+7, source 16+8):

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Algo              MemCo1    MemCo2    MemCo5   MemCopy _imp__str
Description    rep movsd    movdqa    movdqa    Masm32       CRT
                 dest-al  src-al A  src-al B   library    strcpy
Code size             63       201       270         ?         ?
----------------------------------------------------------------
2048, d0s0           729       603       602      1113      2389
2048, d7s8          1585      1220       831      1615      5430
2048, d7s9          1586      1226       831      4438      4948


Same code with MakeSlow   = 1:
Algo              MemCo1    MemCo2    MemCo5   MemCopy _imp__str
Description    rep movsd    movdqa    movdqa    Masm32       CRT
                 dest-al  src-al A  src-al B   library    strcpy
Code size             63       201       270         ?         ?
----------------------------------------------------------------
2048, d7s8          1562      7280      7257      1622      5423
2048, d7s9          1562      7284      7271      5237      4923


if MakeSlow ; used exactly once, before the table
print chr$(13, 13)
endif


Good to know that others have the same problem:
I'm a developer for the x264 video encoder. (http://software.intel.com/en-us/forums/watercooler-catchall/topic/56271/)  When Loren Merritt, another developer, was testing SAD assembly operation performance, he noticed that the number of clocks required varied wildly.  In particular, when the unaligned data to be loaded crossed a cache line boundary, which occurred 25% of the time for blocks of width 16 bytes, up to eight times as many clock cycles were required for the entire operation, depending on the processor.  This issue was found to exist not only for SSE2 load operations like MOVDQU, but also for MMX, and on processors going all the way back to the Pentium 3 (though obviously the cache line intervals varied).  No older processors were tested than the Pentium 3.  Page lines were also encountered every 64 cache lines, i.e. 1/256th of the time.  Data crossing a page line resulted in up to 100 times slower performance; thousands of clock cycles for a single 16x16 SAD operation that normally took 48.  However, when testing the same operations on AMD processors, we noticed there was no measurable penalty at all for unaligned loads across cache lines, and furthermore, the page line penalty was a mere 5-10%.


[attachment deleted by admin]
Title: Re: Code location sensitivity of timings
Post by: hutch-- on June 09, 2009, 02:01:20 PM
JJ,

To force SleepEx to actually take a time slice set it to over 20 ms and you are reasonably safe, set to 0 it can immediately return without reference to a timeslice.

I usually use,


    invoke SleepEx,100,0


By doing it this wy you get the fresh boost directly after the task switch and its very uniform from one instance of SleepEx() to another.
Title: Re: Code location sensitivity of timings
Post by: jj2007 on June 09, 2009, 02:23:10 PM
Quote from: hutch-- on June 09, 2009, 02:01:20 PM

    invoke SleepEx,100,0


By doing it this wy you get the fresh boost directly after the task switch and its very uniform from one instance of SleepEx() to another.

Thanks, Hutch. I'll implement it next time, although it seems to have no influence on the timings. In the attachment above, I use the cutoff method, i.e. I take a thousand timings and use the average from n=10 to the first one that is 25% slower than cycles(10). The cutoff usually takes place at >990 for non-P4 and between 500...950 for P4. The resulting timings are very stable.

The problem here is not the measurement. That code is really slow, under specific conditions, and some call it "Cacheline splits aka Intel hell (http://x264dev.blogspot.com/2008/05/cacheline-splits-aka-intel-hell.html)"  :toothy

P.S.: The code posted above has a new set of macros, wrapping the well-known TIMER.ASM macros by MichaelW. The table above was produced with this code (apologies to the macro haters :wink):

.code
MyString db "A 16 bytes text.", 0

start:
invoke ShowCpu, 1 ; prints the brand string and returns SSE level in eax
mov ebx, ByteCount/16+1
.Repeat
invoke lstrcat, addr src16, offset MyString
dec ebx
.Until Zero?

cyctProcs equ MemCo1, MemCo2, MemCo5, MemCopy, crt_strcpy

cyctTable "2048, d0s0", offset dest16, offset src16, 2048
cyctTable "2048, d8s8", offset dest8, offset src8, 2048
cyctTable "2048, d7s7", offset dest7, offset src7, 2048
cyctTable "2048, d7s8", offset dest7, offset src8, 2048
cyctTable "2048, d7s9", offset dest7, offset src9, 2048
cyctTable "2048, d9s7", offset dest9, offset src7, 2048
cyctTable "2048, d9s8", offset dest9, offset src8, 2048
cyctTable "2048, d9s9", offset dest9, offset src9, 2048
inkey chr$(13, 10, 10, "--- ok ---", 13)
exit
Title: Re: Code location sensitivity of timings
Post by: UtillMasm on June 09, 2009, 02:27:39 PM
 :U@echo off
\masm32\bin\ml.exe /FoMemCopySSE2.obj /nologo MemCopySSE2.asm
pause
Assembling: MemCopySSE2.asm
MemCopySSE2.asm(3) : fatal error A1000: cannot open file : \masm32\Gfa2Masm\Gfa2
Masm.inc
Press any key to continue . . .
Title: Re: Code location sensitivity of timings
Post by: jj2007 on June 09, 2009, 03:53:58 PM
Quote from: UtillMasm on June 09, 2009, 02:27:39 PM
fatal error A1000: cannot open file : \masm32\Gfa2Masm\Gfa2Masm.inc

Oops... download new version above, or change lines 2+3 as follows:

include \masm32\include\masm32rt.inc
; include \masm32\Gfa2Masm\Gfa2Masm.inc   ; needed only for deb macro
Title: Re: Code location sensitivity of timings
Post by: dedndave on June 09, 2009, 04:00:09 PM
hey Jochen,
  I was looking at this post you refered to...
http://software.intel.com/en-us/forums/watercooler-catchall/topic/56271/
they are using "ffmpegs bench.h" to time algos
do you know where to find that?
i looked high and low for it
Title: Re: Code location sensitivity of timings
Post by: jj2007 on June 09, 2009, 08:23:52 PM
Quote from: dedndave on June 09, 2009, 04:00:09 PM
hey Jochen,
  I was looking at this post you refered to...
http://software.intel.com/en-us/forums/watercooler-catchall/topic/56271/
they are using "ffmpegs bench.h" to time algos
do you know where to find that?
i looked high and low for it

Apparently ffmpegs is a suite for Linux that includes also a -bench option...
Title: Re: Code location sensitivity of timings
Post by: dedndave on June 09, 2009, 11:06:56 PM
well - i found ffmpeg.org - they make an open source mpeg player/editor/whatever
i was interested in their timing code
the file i was trying to find was named "bench.h"
Title: Re: Code location sensitivity of timings
Post by: Mark Jones on June 10, 2009, 02:29:39 AM
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

Algo              MemCo1    MemCo2    MemCo5   MemCopy _imp__str
Description    rep movsd    movdqa    movdqa    Masm32       CRT
                 dest-al  src-al A  src-al B   library    strcpy
Code size             63       201       270         ?         ?
----------------------------------------------------------------
2048, d0s0           549       424       361       546      2094
2048, d8s8           585       465       403       555      2100
2048, d7s7           593       478       411      1390      2088
2048, d7s8           834       562       564      1055      2172
2048, d7s9           834       564       564      1187      2175
2048, d9s7           833       564       565      1561      2173
2048, d9s8           834       562       566      1216      2172
2048, d9s9           575       472       410      1399      2091
Title: Re: Code location sensitivity of timings
Post by: jj2007 on June 11, 2009, 08:05:17 PM
Got some weird timings again:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Algo         _imp__st   MemCo1   MemCo2  MemCoC3  MemCoP4   MemCo5   MemCoL
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32
               strcpy  dest-al    psllq src+dest  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33
---------------------------------------------------------------------------
2048, d0s0-0     2591      566      363      363      373      363      560
2048, d1s1-0     2598      619      421      423      444      423     1726
2048, d7s7-0     2584      619      419      421      446      421     1735
2048, d7s8-1     2782     1525     1090      441      956     1033     1527
2048, d7s9-2     2788     1526     1090      441      957     1033     1755
2048, d8s7+1     2786     1309     1090      683      809      815     1467
2048, d8s8-0     2581      619      421      423      448      423      560
2048, d8s9-1     2789     1526     1083      441      956     1033     1465
2048, d9s7+2     2790     1309     1081      684      809      815     1802
2048, d9s8+1     2782     1309     1081      684      809      815     1511
2048, d9s9-0     2589      619      421      423      448      423     1737
2048, d15s15     2582      619      423      425      446      424     1723


MemCoP4 for the P4. MemCo2 should be insensitive to the cacheline split trap, as it does no unaligned memory accesses.
MemCoC3 is optimised for Celeron, and pretty fast, but very sensitive to code location.

Grateful for tests on other CPUs, jj - [EDIT] with the new MemCopySSE2c.zip, see two posts further down.
Title: Re: Code location sensitivity of timings
Post by: ramguru on June 11, 2009, 08:22:40 PM

Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz (SSE4)

Algo         _imp__st   MemCo1   MemCo2  MemCoC3  MemCoP4   MemCo5   MemCoL
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32
               strcpy  dest-al    psllq src+dest  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33
---------------------------------------------------------------------------
2048, d0s0-0     1985      191      204      205      203      175      196
2048, d1s1-0     1992      223      251      248      251      222      244
2048, d7s7-0     1982      227      253      250      248      223      243
2048, d7s8-1     1976      246      592      501      193      211      243
2048, d7s9-2     1986      247      594      501      194      212      243
2048, d8s7+1     1982      249      587      501      193      179      244
2048, d8s8-0     1976      229      255      253      251      226      243
2048, d8s9-1     1983      249      588      502      194      211      243
2048, d9s7+2     1982      248      583      501      193      179      243
2048, d9s8+1     1977      249      583      501      193      179      243
2048, d9s9-0     1984      229      255      253      251      226      244
2048, d15s15     1974      228      256      256      258      226      243


--- ok ---
Title: Re: Code location sensitivity of timings
Post by: jj2007 on June 11, 2009, 10:02:07 PM
Core 2 shows dramatic improvements...!

I have renamed MemCo5 to MemCoC2, for Core 2. And I moved the somewhat faster loop for cases d7s8-1 etc. from MemCoP4 to the new MemCoC2.


[attachment deleted by admin]
Title: Re: Code location sensitivity of timings
Post by: Mark Jones on June 11, 2009, 10:07:00 PM
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

Algo         _imp__st   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32
               strcpy  dest-al    psllq src+dest  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33
---------------------------------------------------------------------------
2048, d0s0-0     2079      551      359      424      424      359      547
2048, d1s1-0     2084      613      410      473      473      410     1060
2048, d7s7-0     2080      598      412      474      474      411     1059
2048, d7s8-1     2156      853     1016      564      567      569      802
2048, d7s9-2     2162      859     1016      564      567      568     1058
2048, d8s7+1     2172      849      868      564      565      566      804
2048, d8s8-0     2082      603      404      465      465      402      547
2048, d8s9-1     2177      848      995      564      565      568      803
2048, d9s7+2     2156      855      862      581      565      580     1060
2048, d9s8+1     2167      869      878      564      567      566      803
2048, d9s9-0     2090      595      412      472      472      409     1060
2048, d15s15     2064      592      410      470      486      408     1060
Title: Re: Code location sensitivity of timings
Post by: jj2007 on June 11, 2009, 10:19:19 PM
Thanks, Mark. Pretty balanced over the MemCoC3  MemCoP4  MemCoC2 columns.
But I am surprised about the cycle counts of ramguru's i7. The Nehalem Wiki (http://en.wikipedia.org/wiki/Intel_Core_3#Performance) is full of overclocking talk, but they seem to bark at the wrong tree: That CPU runs the Masm32 library MemCopy in 240 cycles, instead of the ca. 1700 of Celeron and the over 4000 of the P4, for misaligned data ::)
Title: Re: Code location sensitivity of timings
Post by: BlackVortex on June 12, 2009, 12:13:35 AM
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (SSE4)

Algo         _imp__st   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32
               strcpy  dest-al    psllq src+dest  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33
---------------------------------------------------------------------------
2048, d0s0-0     1647      360      210      210      209      165      367
2048, d1s1-0     1649      396      259      266      258      219     1876
2048, d7s7-0     1631      402      265      261      261      218     1876
2048, d7s8-1     2159     1332      861      493      658      670     1439
2048, d7s9-2     2188     1338      862      493      658      666     1906
2048, d8s7+1     2151     1328      855      829      701      700     1393
2048, d8s8-0     1639      402      267      262      268      236      365
2048, d8s9-1     2205     1333      854      493      658      667     1290
2048, d9s7+2     2151     1329      849      828      700      701     1877
2048, d9s8+1     2143     1330      849      830      701      701     1471
2048, d9s9-0     1642      403      266      262      268      235     1884
2048, d15s15     1642      404      270      264      265      235     1893
Title: Re: Code location sensitivity of timings
Post by: dedndave on June 12, 2009, 01:02:16 AM
those core duo's are pretty nice, too   :U
Title: Re: Code location sensitivity of timings
Post by: jj2007 on June 13, 2009, 10:09:19 PM
I collected some evidence that could solve the "location sensitivity" mystery - which seems much more a cacheline split penalty (http://www.google.com/search?q=cacheline+split+penalty) issue. Only Intel CPUs are affected, by the way. It also explains why ramguru's i7 is so blazingly fast...
The attachment contains timings at the bottom. It is RTF, written with RichMasm (http://www.masm32.com/board/index.php?topic=9044.msg73814#msg73814) but WordPad opens it, too.

Cygnus: (http://www.cygnus-software.com/papers/infinitytestresults.html)
Pentium has 3 times penalty for loading across QWORD boundaries.
Pentium has 3 times penalty for loading across cache line boundaries.
Pentium has 3 times penalty for loading across page boundaries.
Pentium has 6 times penalty for saving across QWORD boundaries.
Pentium has 6 times penalty for saving across cache line boundaries.
Pentium has 16 times penalty for saving across page boundaries.

Pentium III has 0.7 times!!! 'penalty' for loading across QWORD boundaries.
Pentium III has 5.5 times penalty for loading across cache line boundaries.
Pentium III has 47 times penalty for loading across page boundaries.
Pentium III has 0.7 times!!! 'penalty' for saving across QWORD boundaries.
Pentium III has 4.5 times penalty for saving across cache line boundaries.
Pentium III has 64 times penalty for saving across page boundaries.

Pentium IV has no penalty for loading across QWORD boundaries.
Pentium IV has 5.5 times penalty for loading across cache line boundaries.
Pentium IV has 20 times penalty for loading across page boundaries.
Pentium IV has no penalty for saving across QWORD boundaries.
Pentium IV has 22 times penalty for saving across cache line boundaries.
Pentium IV has 23 times penalty for saving across page boundaries.

Doom9: (http://forum.doom9.org/archive/index.php/t-131033.html)
The best case would be to compare only two aligned blocks, but we must shift either the source
or the destination block freely, so one of them should be unaligned in most cases. The trouble is,
that only MMX tolerates unaligned access most SSE instructions don't. To make matters worse,
any newer intel chip hat a noticeable penalty for a memory access across physical cache line borders
and an even worse penalty if that also happens to coincide with a memory page border. So there are
unfortunately only two possible options right now: use simple MMX and ignore the penalty (which is
essentially using the old isse functions) or use a work around like in x256sad and align the memory
blocks (this could be done once for each frame if all data is sized in accordingly (padding and the like)).
Compensating on both sides source and destination is only possible for MMX but the overhead should
be far greater than the gain (ok it would be possible but it is certainly slower).

h264 (https://svn.blender.org/svnroot/bf-blender/trunk/blender/extern/x264/common/i386/pixel-sse2.asm)
Core2 (Conroe) can load unaligned data just as quickly as aligned data...
unless the unaligned data spans the border between 2 cachelines, in which
case it's really slow. The exact numbers may differ, but all Intel cpus
have a large penalty for cacheline splits.
(8-byte alignment exactly half way between two cachelines is ok though.)
LDDQU was supposed to fix this, but it only works on Pentium 4.
So in the split case we load aligned data and explicitly perform the
alignment between registers. Like on archs that have only aligned loads,
except complicated by the fact that PALIGNR takes only an immediate, not
a variable alignment.

Phaeron - 12 03 08 - 00:49 (http://www.virtualdub.org/blog/pivot/entry.php?id=190)
On current Intel CPUs, an L1 cache line is fetched and pushed through a shifter, which accesses the
desired words. If the access crosses L1 cache lines, a data cache unit split event occurs (with
associated penalty), bytes are shifted out from the adjacent L1 cache line, and the results are
combined to produce the misaligned words.

x264 (http://www.digital-digest.com/software/x264_history.html)
movaps/movups are no longer equivalent to their integer equivalents on the Nehalem, so that
substitution is removed. Nehalem has a much lower cacheline split penalty than previous Intel CPUs,
so cacheline workarounds are no longer necessary.
...
Rev. 696: avoid memory loads that span the border between two cachelines.
on core2 this makes x264_pixel_sad an average of 2x faster

Nehalem optimizations: the powerful new Core i7 (http://x264dev.multimedia.cx/?cat=1)
The cacheline split problem is basically gone: the penalty is now a mere 2 clocks instead of 12 for a
cacheline-split load. This, combined with the SSE speed improvements, ... took 150 cycles on Penryn
without cacheline split optimization, 111 cycles with, and takes 62 cycles on the Nehalem
...
Intel has finally come through on their promise to make float-ops-on-SSE-registers-containing-integers
have a speed penalty. So, we removed a few %defines throughout the code that converted integer ops
into equivalent, but shorter, floating point instructions. Unfortunately, there seems to be no way to
completely avoid floatops on integer registers, as many of these operations have no integer equivalents.
A classic example is "movhps", which takes an 8-byte value from memory and puts it into the high section
of a 16-byte SSE register. In integer ops, one can only directly move into the low 8-byte section of the register.
Emulating these float ops with complex series of integer ops is far too slow to be worthwhile, ...

[attachment deleted by admin]
Title: Re: Code location sensitivity of timings
Post by: ramguru on June 13, 2009, 11:05:25 PM
ha-ha file in last attachment has .asc extension .. took me some time to figure out that it should be .rtf  :lol
Title: Re: Code location sensitivity of timings
Post by: NightWare on June 14, 2009, 12:47:56 AM
Quote from: jj2007 on June 13, 2009, 10:09:19 PM
I collected some evidence that could solve the "location sensitivity" mystery
really ?

>any newer intel chip hat a noticeable penalty for a memory access across physical cache line borders
yep, but it's well known and since a long time, search after "memory aliasing" in docs.

>all Intel cpus have a large penalty for cacheline splits.
same as before...

>If the access crosses L1 cache lines, a data cache unit split event occurs (with
associated penalty), bytes are shifted out from the adjacent L1 cache line, and the results are
combined to produce the misaligned words.
:bdg obviously those guys confuse cache lines and core memory cluster.

>movaps/movups are no longer equivalent to their integer equivalents on the Nehalem
? in virtue of what ? bits are bits !

>float-ops-on-SSE-registers-containing-integers have a speed penalty
same as before, data movments are the same ! and using others float operations has an interest ?

>A classic example is "movhps", which takes an 8-byte value from memory and puts it into the high section
of a 16-byte SSE register. In integer ops, one can only directly move into the low 8-byte section of the register.
Emulating these float ops with complex series of integer ops is far too slow to be worthwhile, ...
pffft... the problem is the cost/manipulations to access to the high part of the register, it the same problem between unpcklps and unpckhps, unpcklbw and unpckhbw, punpckldq and punpckhdq, etc...
Title: Re: Code location sensitivity of timings
Post by: MichaelW on June 14, 2009, 01:29:30 AM
Perhaps a (partial) solution would be to use a larger alignment, specifically an alignment that matches the processor's cache line size or some integer multiple of it. And thinking further, you should be able to control the instruction sizes or add padding so that no instruction would split a cache line boundary.

Title: Re: Code location sensitivity of timings
Post by: dedndave on June 14, 2009, 01:44:24 AM
that sounds like a lot of work
perhaps for very intensive speed critical areas
but, i think at some point you have to play the law of averages
that is, in part, what intel uses as a guide-line for design (hopefully)
Title: Re: Code location sensitivity of timings
Post by: jj2007 on June 14, 2009, 07:23:04 AM
They are not talking about the instruction cache. The data are the problem.
Title: Re: Code location sensitivity of timings
Post by: dedndave on June 14, 2009, 11:17:02 AM
yes - i understand that
if you open a file buffer and read 64 kb of a file (or more), you are going to play the law of averages
otherwise, you are going to spend more cpu time thrashing things about than you save
if you are talking about declared data, most of that is used a few times during execution
and that of it that isn't may be adjusted
like i say - there is a point where it won't pay to optimize - let the law of averages take over
on certain portions of data, you will get lucky
on other portions, you will not
intel has designed the system so that you are lucky much more often than not
(of course, that's just a theory - lol)
Title: Re: Code location sensitivity of timings
Post by: Rockoon on July 27, 2009, 10:49:23 PM
Some thoughts (in regards to the original threadline of both variable timings, and consistent alignment issues)

---

On the issue of those spurious higher clock timings, even when you wait for the start of a timeslice.

Not all interrupts are timer-based. Your ethernet adapter may trigger an IRQ. Your video card may trigger an IRQ. Your disk drives may trigger an IRQ. Your USB ports may trigger an IRQ. Your mouse can trigger an IRQ.

All IRQ's are more important than your thread. When an IRQ happens, your thread is stopped abruptly so that the interrupt handler (drives, etc) may service it.

---

On the issue of caches.

There are plenty of important concepts that need to be considered. Perhaps the most important on Pentiums, but also the most frustrating, is Intel's "trace cache." Here you are dealing with some alignment issues that are completely hidden from you. Instructions within the trace cache have their own alignment which is independent of your binary. The stuff in the trace cache does not have a 1:1 correspondence with the x86 instruction set you are using. Its all in microcode, a language below x86.

Moving along.. physical memory is organized into cache lines, and each line can only be stored in specific cache sets. In a "4-way set associative" cache, a specific cache line only has 4 unique slots within the cache that it can be stored. The worst-case scenario is when you are constantly accessing 5 or more spots of memory that are all assigned the same set. 5 lines cannot fit in the 4 slots.

---

On the issue of code alignment in regards to cache.

Obsessive aligning can be a detriment. Putting code in the cache that will never get executed definately has its downsides. Those dummy instructions were moved along the bus from system ram into L3, then again into L2, and again into L1, only to have a 0% chance of being needed.

---

On the issue of code alignment in regards to branch prediction.

This is the second biggest issue, and also the second most frustrating. This is true for both AMD's and Intels, with the least effect (that I know of) on Core2's .. I am still unfamiliar with I7's

The forward branch prediction logic uses its own little bit of memory that can be considerd a cache. Here is the deal. Every memory address is assigned an index within this branch cache. When a conditional branch is taken, this little branch cache is updated (remembering what happened... YES it was, or NO it wasn't)

Lots of memory addresses share the same index, and this creates contention. Several different conditional branch instructions within your code can interfere with each other in regards to branch prediction. The Core2 luckily remembers the last several branches for a given index, and is able to handle simple patterns pretty well (YES / NO / YES / NO is terrible on AMD64's (always gets it wrong), but not a problem for the Core2)

---

Deal only with real world scenerios or it doesnt count.
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 27, 2009, 11:00:34 PM
Quote from: Rockoon on July 27, 2009, 10:49:23 PM
In a "4-way set associative" cache, a specific cache line only has 4 unique slots within the cache that it can be stored. The worst-case scenario is when you are constantly accessing 5 or more spots of memory that are all assigned the same set. 5 lines cannot fit in the 4 slots.
Stuff for thought. That could be the culprit indeed.

Good to see you back, Rockoon :thumbu
Title: Re: Code location sensitivity of timings
Post by: Rockoon on July 29, 2009, 11:55:52 PM
Quote from: jj2007 on July 27, 2009, 11:00:34 PM
Quote from: Rockoon on July 27, 2009, 10:49:23 PM
5 lines cannot fit in the 4 slots.
Stuff for thought. That could be the culprit indeed.

Good to see you back, Rockoon :thumbu

Thing brings to mind a simple test program that can be written. Allocate a nice big buffer of memory and then test various steppings through it. You could even write something like this in visual basic (as I have done previously) and get meaningfull results, since cache misses are extremely exspensive and thus easily measured.