Code location sensitivity of timings

jj2007 · May 18, 2009, 12:30:16 PM

This is a spin-off from the add$ thread, regarding a problem in the add$ macro, and a faster technique of concatenating strings with add$. I stumbled again over a code location problem:

809 cycles for addx$
848 cycles for add3$, old syntax
767 cycles for add4$, old syntax
258 cycles for add3$, short syntax
227 cycles for add4$, short syntax

Exactly the same algo, but moved a bit further down using the MakeSlow=1 switch in line 6:

774 cycles for addx$
826 cycles for add3$, old syntax
1214 cycles for add4$, old syntax
261 cycles for add3$, short syntax
763 cycles for add4$, short syntax

Both code locations are align 16, timings are for Prescott P4. It seems it would be worthwhile to look seriously into this issue. JimG's approach to move the code to a constant location is fine for testing purposes (I had no time to test it on this one), but it raises the question of how to optimise real life applications...

[attachment deleted by admin]

hutch-- · May 18, 2009, 01:14:26 PM

JJ,

The effect is not uncommon and I don't have a definitive answer. I have even played with padding the code either side with NOPS and ZEROS to get 1024 byte alignment and you still get the variation.

It is my guess that it is a combined code alignment/cache content pollution effect and it effects algos with high intensity loop code more than it effects algos that are logic based.

Also try a different type of timing test as this may be effected by the effects mentioned above. Long linear tests are far less sensitive to code location variation than short looped code methods.

Isn't code optimisation PHUN ?

Jimg · May 18, 2009, 01:18:22 PM

Usually the Amd is very sensitive, but with your test, I saw virtually no difference between the two.

jj2007 · May 18, 2009, 01:33:47 PM

Quote from: hutch-- on May 18, 2009, 01:14:26 PM
Isn't code optimisation PHUN ?

It rocks :cheekygreen:

Quote from: Jimg on May 18, 2009, 01:18:22 PM
Usually the Amd is very sensitive, but with your test, I saw virtually no difference between the two.

Thanks. I am not surprised. A factor 3 slower just because it's a different CPU ::)

dedndave · May 18, 2009, 01:40:05 PM

clear the cache - then run the tests ?
this is on the celeron cpu ? - or prescott ?

jj2007 · May 18, 2009, 02:37:17 PM

Quote from: dedndave on May 18, 2009, 01:40:05 PM
clear the cache - then run the tests ?

The cache should be fine after the first handful of loops (how many are needed? one? any Intel or AMD insiders lurking??). Loop count is 200000, so if the first n loops take 200000 cycles each, the average will be n cycles higher. Not 500...

Quote
this is on the celeron cpu ? - or prescott ?

Prescott. I will check the Celeron tonight, but I doubt I will see the difference. It comes and goes with a few bytes more or less ::)

Edit: Celeron M timings - the effect is there but less strong:
MakeSlow = 0
343 cycles for addx$
627 cycles for add4$, old syntax
221 cycles for add4$, short syntax

MakeSlow = 1
342 cycles for addx$
650 cycles for add4$, old syntax
316 cycles for add4$, short syntax

dedndave · May 18, 2009, 03:20:45 PM

maybe to write fast code, you have to set up some kind of
dynamic system that places and aligns routines at run-time :dazzled:

hutch-- · May 19, 2009, 08:50:45 AM

JJ,

Here is an example of alignment affecting performance in an unusual way.

This algo clocks 406ms on an alignment of 0,1,2,4,8 but drops to 672ms on an alignment of 16.

I was playing with this algo, dropped it from 467 to 406 with an unroll and a few changed instructions.

Code Select


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
    include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    szappendx PROTO :DWORD,:DWORD,:DWORD

    .data?
      value dd ?

    .data
      item dd 0

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL hMem  :DWORD
    LOCAL cloc  :DWORD

    mov hMem, alloc(1024*1024*256)      ; 256 meg

    mov cloc, 0

    push esi
    mov esi, 20000000

    invoke GetTickCount
    push eax

  @@:
    mov cloc, rv(szappendx,hMem,"123456789012",cloc)
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax),13,10

    pop esi

    free hMem

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 

; align 16

szappendx proc string:DWORD,buffer:DWORD,location:DWORD

  ; ------------------------------------------------------
  ; string      the main buffer to append extra data to.
  ; buffer      the byte data to append to the main buffer
  ; location    current location pointer
  ; ------------------------------------------------------

    push esi
    push edi
    mov edi, 1

    or eax, -1
    mov esi, [esp+12]       ; string
    mov ecx, [esp+16]       ; buffer
    add esi, [esp+20]       ; location ; add offset pointer to source address

  @@:
  REPEAT 3
    add eax, edi
    movzx edx, BYTE PTR [ecx+eax]
    mov [esi+eax], dl
    add edx, edx
    jz @F                   ; exit on written terminator
  ENDM

    add eax, edi
    movzx edx, BYTE PTR [ecx+eax]
    mov [esi+eax], dl
    add edx, edx
    jnz @B                  ; exit on written terminator

  @@:

    add eax, [esp+20]       ; location ; return updated offset pointer

    pop edi
    pop esi

    ret 12

szappendx endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

MichaelW · May 19, 2009, 09:14:31 AM

Perhaps a solution would be to use larger alignments. The codealign macro below uses the only workable method I have found so far, is somewhat clumsy, and not very well tested.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

alignment MACRO pv
    xor eax, eax
    mov ecx, pv
    bsf ecx, ecx
    jz @F
    mov eax, 1
    shl eax, cl
  @@:
    EXITM <eax>
ENDM

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

;-----------------------------------------------------------------------
; This macro will work only for items in the code segment and only if
; there is a code_base label at the start of the code segment.
;
; For some reason, specifying alignments >= 4096 produces an alignment
; of 4096. Hopefully, there is little or no need for larger alignments.
;-----------------------------------------------------------------------

codealign MACRO alignment
    LOCAL location, aligned_location
    location = $ - code_base
    aligned_location = location + alignment AND NOT (alignment - 1)
    db (alignment + aligned_location - location) dup(90h)
ENDM

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
    code_base:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    ;nops 5

    N = 1
    REPEAT 17
        print ustr$(N),9
        codealign N
      @@:
        print ustr$(alignment(OFFSET @B)),13,10
        ;nops N+5
        N = N SHL 1
    ENDM

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

jj2007 · May 19, 2009, 10:46:43 AM

Quote from: hutch-- on May 19, 2009, 08:50:45 AM
JJ,

Here is an example of alignment affecting performance in an unusual way.
This algo clocks 406ms on an alignment of 0,1,2,4,8 but drops to 672ms on an alignment of 16.
I was playing with this algo, dropped it from 467 to 406 with an unroll and a few changed instructions.

That is indeed very odd...

Quote from: MichaelW on May 19, 2009, 09:14:31 AM
Perhaps a solution would be to use larger alignments. The codealign macro below uses the only workable method I have found so far, is somewhat clumsy, and not very well tested.

Will have to look at it when I have some spare time. Another way is what JimG did, i.e. copy the whole proc to be tested to a fixed aligned location and test it there.

In the meantime, I wrote a wrapper to your timings macros performing a cutoff of the lowest and highest timings:

Code Select


Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Testing timings with cutoff, subcounter=10000

0:151 1:151 2:151 3:151 4:151 5:151 6:151 7:151 8:151 9:151 10:151 11:151 12:151
 13:151 14:151 15:151 16:151 17:151 18:151 19:151 20:151

...

801:152 802:152 803:152 804:152 805:152 806:152 807:152 808:152 809:152 810:152
811:152 812:152 813:152 814:152 815:152 816:152 817:152 818:152 819:153 820:153
821:154 822:154 823:154 824:154 825:155 826:155 827:155 828:155 829:155 830:155
831:155 832:155 833:156 834:156 835:156 836:157 837:157 838:157 839:157 840:157
841:157 842:157 843:157 844:157 845:157 846:157 847:157 848:157 849:157 850:157
851:157 852:157 853:157 854:158 855:158 856:158 857:158 858:158 859:159 860:159
861:159 862:159 863:159 864:159 865:160 866:160 867:160 868:160 869:160 870:161
871:162 872:163 873:163 874:163 875:164 876:164 877:165 878:165 879:170 880:170
881:172 882:173 883:175 884:176 885:178 886:179 887:180 888:187 889:188 890:188
891:188 892:188 893:188 894:188 895:188 896:188 897:188 898:188 899:188 900:188
901:189 902:193 903:202 904:203 905:205 906:226 907:231 908:232 909:243 910:244
911:248 912:252 913:260 914:263 915:263 916:263 917:263 918:264 919:264 920:264
921:264 922:265 923:265 924:265 925:265 926:265 927:265 928:265 929:266 930:266
931:266 932:266 933:267 934:267 935:267 936:267 937:268 938:268 939:268 940:268
941:268 942:268 943:268 944:268 945:268 946:269 947:269 948:269 949:269 950:269
951:269 952:270 953:270 954:270 955:270 956:270 957:270 958:270 959:270 960:270
961:270 962:270 963:270 964:270 965:270 966:270 967:270 968:270 969:270 970:270
971:270 972:271 973:271 974:271 975:271 976:272 977:272 978:272 979:272 980:272
981:273 982:273 983:273 984:273 985:273 986:273 987:274 988:275 989:280 990:280
991:281 992:281 993:282 994:282 995:288 996:288 997:297 998:305 999:312 1000:331

MyTest: subcounter=     10000
average with cutoff     151
average w/o cutoff      164

That might be a way to stabilise timings, but doesn't solve the location problem, of course.

[attachment deleted by admin]

jj2007 · May 19, 2009, 11:56:00 AM

Maybe I found something interesting. This version shows a sudden jump of the timings to a significantly higher value:

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Testing timings with cutoff, subcounter=1000

0:151 1:151 2:152 3:152 4:152 5:152 6:152 7:152 8:152 9:152 10:152

...

401:152 402:152 403:152 404:152 405:168 406:211 407:255 408:255 409:256 410:256
411:256 412:257 413:257 414:258 415:258 416:258 417:258 418:259 419:259 420:259
421:259 422:259 423:260 424:260 425:260 426:260 427:261 428:261 429:261 430:261
431:262 432:262 433:262 434:262 435:262 436:262 437:262 438:262 439:263 440:263
441:263 442:263 443:263 444:263 445:263 446:263 447:263 448:264 449:264 450:264
451:264 452:264 453:264 454:264 455:264 456:264 457:264 458:264 459:264 460:264
461:264 462:264 463:265 464:265 465:265 466:265 467:265 468:265 469:265 470:265
471:265 472:265 473:265 474:265 475:265 476:265 477:265 478:265 479:265 480:265
481:265 482:265 483:265 484:265 485:265 486:265 487:265 488:265 489:265 490:265
491:265 492:265 493:266 494:266 495:266 496:266 497:266 498:266 499:266 500:266
501:266 502:266 503:266 504:266 505:266 506:266 507:266 508:266 509:266 510:266
511:266 512:266 513:266 514:266 515:266 516:266 517:266 518:266 519:266 520:266
521:266 522:266 523:266 524:266 525:266 526:266 527:266 528:266 529:266 530:266
531:266 532:266 533:266 534:266 535:266 536:266 537:266 538:266 539:266 540:266
541:266 542:266 543:266 544:266 545:266 546:266 547:266 548:266 549:266 550:267
551:267 552:267 553:267 554:267 555:267 556:267 557:267 558:267 559:267 560:267
561:267 562:267 563:267 564:267 565:267 566:267 567:267 568:267 569:267 570:267
571:267 572:267 573:267 574:267 575:267 576:267 577:267 578:267 579:267 580:267
581:267 582:267 583:267 584:267 585:267 586:267 587:267 588:267 589:267 590:267
591:267 592:267 593:267 594:267 595:267 596:267 597:267 598:267 599:267 600:267
601:267 602:267 603:267 604:267 605:267 606:267 607:267 608:267 609:267 610:267
611:267 612:267 613:267 614:267 615:267 616:267 617:267 618:267 619:267 620:267
621:267 622:267 623:267 624:267 625:267 626:267 627:267 628:267 629:267 630:267
631:267 632:267 633:267 634:267 635:267 636:267 637:267 638:267 639:267 640:267
641:268 642:268 643:268 644:268 645:268 646:268 647:268 648:268 649:268 650:268
651:268 652:268 653:268 654:268 655:268 656:268 657:268 658:268 659:268 660:268
661:268 662:268 663:268 664:268 665:268 666:268 667:268 668:268 669:268 670:268
671:268 672:268 673:268 674:268 675:268 676:268 677:268 678:268 679:268 680:268
681:268 682:268 683:268 684:268 685:268 686:268 687:268 688:268 689:268 690:268
691:268 692:268 693:268 694:268 695:268 696:268 697:268 698:268 699:268 700:268
701:268 702:268 703:268 704:268 705:268 706:268 707:268 708:268 709:268 710:268
711:268 712:268 713:268 714:268 715:268 716:268 717:268 718:268 719:268 720:268
721:268 722:268 723:268 724:268 725:268 726:268 727:268 728:268 729:268 730:268
731:268 732:269 733:269 734:269 735:269 736:269 737:269 738:269 739:269 740:269
741:269 742:269 743:269 744:269 745:269 746:269 747:269 748:269 749:269 750:269
751:269 752:269 753:269 754:269 755:269 756:269 757:269 758:269 759:269 760:269
761:269 762:269 763:269 764:269 765:269 766:269 767:269 768:269 769:269 770:269
771:269 772:269 773:269 774:269 775:269 776:269 777:269 778:269 779:269 780:269
781:269 782:269 783:269 784:269 785:269 786:269 787:269 788:269 789:269 790:269
791:269 792:269 793:269 794:269 795:269 796:269 797:269 798:269 799:269 800:269
801:269 802:269 803:269 804:269 805:269 806:269 807:269 808:269 809:269 810:269
811:269 812:269 813:269 814:269 815:269 816:269 817:269 818:269 819:269 820:269
821:269 822:269 823:269 824:269 825:269 826:269 827:270 828:270 829:270 830:270
831:270 832:270 833:270 834:270 835:270 836:270 837:270 838:270 839:270 840:270
841:270 842:270 843:270 844:270 845:270 846:270 847:270 848:270 849:270 850:270
851:270 852:270 853:270 854:270 855:270 856:270 857:270 858:270 859:270 860:270
861:270 862:270 863:270 864:270 865:270 866:270 867:270 868:270 869:270 870:270
871:270 872:270 873:270 874:270 875:270 876:270 877:270 878:270 879:270 880:270
881:270 882:270 883:270 884:270 885:270 886:270 887:270 888:270 889:270 890:270
891:270 892:270 893:270 894:270 895:270 896:270 897:270 898:270 899:270 900:270
901:270 902:270 903:270 904:270 905:270 906:270 907:270 908:270 909:270 910:270
911:270 912:270 913:270 914:271 915:271 916:271 917:271 918:271 919:271 920:271
921:271 922:271 923:271 924:271 925:271 926:271 927:271 928:271 929:271 930:271
931:271 932:271 933:271 934:271 935:271 936:271 937:271 938:271 939:271 940:271
941:271 942:271 943:272 944:272 945:272 946:272 947:272 948:272 949:272 950:272
951:272 952:272 953:272 954:272 955:272 956:272 957:273 958:273 959:273 960:273
961:273 962:273 963:274 964:274 965:274 966:274 967:275 968:275 969:275 970:275
971:275 972:275 973:275 974:276 975:279 976:281 977:286 978:287 979:304 980:318
981:319 982:334 983:374 984:378 985:396 986:404 987:412 988:421 989:422 990:427
991:435 992:441 993:445 994:467 995:487 996:493 997:496 998:497 999:502 1000:789

MyTest: subcounter= 1000
average with cutoff 152
average w/o cutoff 224
difference raw-cutoff: 72

The point of change is not fixed. With the same code, I get also e.g.
581:152 582:152 583:152 584:152 585:152 586:152 587:169 588:199 589:247 590:252
591:256 592:257 593:257 594:257 595:257 596:258 597:258 598:259 599:260 600:261

But the pattern is clear: One part of the timings are fast, another a lot slower, and both consistently so. The code location plays a role in this, obviously. But the code is not always slow. This makes the phenomenon different from what we experience with changing alignments.

EDIT: New attachment with automatic cutoff following a rule:
- take the 10th lowest value as a reference
- multiply with 1.25
- cutoff when the first count higher than this value is being found
Usually, for the P4 the cutoff n is somewhere around 600...900 of 1000 readings, and timings are very stable.

[attachment deleted by admin]

dedndave · May 19, 2009, 11:58:25 AM

i would guess the OS is doing something else during the high count tests

jj2007 · May 19, 2009, 12:38:51 PM

Quote from: dedndave on May 19, 2009, 11:58:25 AM
i would guess the OS is doing something else during the high count tests

Combined with flushing the cache...?

dedndave · May 19, 2009, 12:43:32 PM

is that a rhetorical question Jochen, or am i supposed to make up an answer ?

jj2007 · May 19, 2009, 01:43:36 PM

Quote from: dedndave on May 19, 2009, 12:43:32 PM
is that a rhetorical question Jochen, or am i supposed to make up an answer ?

If you find an answer... but don't tell me it's 42 :wink

News:

Code location sensitivity of timings