|
Pages: [1] 2 3 ... 5
|
 |
|
Author
|
Topic: Code location sensitivity of timings (Read 54382 times)
|
|
jj2007
|
This is a spin-off from the add$ thread, regarding a problem in the add$ macro, and a faster technique of concatenating strings with add$. I stumbled again over a code location problem: 809 cycles for addx$ 848 cycles for add3$, old syntax 767 cycles for add4$, old syntax 258 cycles for add3$, short syntax 227 cycles for add4$, short syntax Exactly the same algo, but moved a bit further down using the MakeSlow=1 switch in line 6: 774 cycles for addx$ 826 cycles for add3$, old syntax 1214 cycles for add4$, old syntax 261 cycles for add3$, short syntax 763 cycles for add4$, short syntax Both code locations are align 16, timings are for Prescott P4. It seems it would be worthwhile to look seriously into this issue. JimG's approach to move the code to a constant location is fine for testing purposes (I had no time to test it on this one), but it raises the question of how to optimise real life applications... [attachment deleted by admin]
|
|
|
|
|
Logged
|
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
JJ,
The effect is not uncommon and I don't have a definitive answer. I have even played with padding the code either side with NOPS and ZEROS to get 1024 byte alignment and you still get the variation.
It is my guess that it is a combined code alignment/cache content pollution effect and it effects algos with high intensity loop code more than it effects algos that are logic based.
Also try a different type of timing test as this may be effected by the effects mentioned above. Long linear tests are far less sensitive to code location variation than short looped code methods.
Isn't code optimisation PHUN ?
|
|
|
|
|
Logged
|
|
|
|
|
Jimg
|
Usually the Amd is very sensitive, but with your test, I saw virtually no difference between the two.
|
|
|
|
|
Logged
|
|
|
|
|
jj2007
|
Isn't code optimisation PHUN ?
It rocks  Usually the Amd is very sensitive, but with your test, I saw virtually no difference between the two.
Thanks. I am not surprised. A factor 3 slower just because it's a different CPU 
|
|
|
|
|
Logged
|
|
|
|
|
dedndave
|
clear the cache - then run the tests ? this is on the celeron cpu ? - or prescott ?
|
|
|
|
|
Logged
|
|
|
|
|
jj2007
|
clear the cache - then run the tests ?
The cache should be fine after the first handful of loops (how many are needed? one? any Intel or AMD insiders lurking??). Loop count is 200000, so if the first n loops take 200000 cycles each, the average will be n cycles higher. Not 500... this is on the celeron cpu ? - or prescott ?
Prescott. I will check the Celeron tonight, but I doubt I will see the difference. It comes and goes with a few bytes more or less Edit: Celeron M timings - the effect is there but less strong: MakeSlow = 0 343 cycles for addx$ 627 cycles for add4$, old syntax 221 cycles for add4$, short syntax MakeSlow = 1 342 cycles for addx$ 650 cycles for add4$, old syntax 316 cycles for add4$, short syntax
|
|
|
|
« Last Edit: May 18, 2009, 05:09:31 PM by jj2007 »
|
Logged
|
|
|
|
|
dedndave
|
maybe to write fast code, you have to set up some kind of dynamic system that places and aligns routines at run-time 
|
|
|
|
|
Logged
|
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
JJ, Here is an example of alignment affecting performance in an unusual way. This algo clocks 406ms on an alignment of 0,1,2,4,8 but drops to 672ms on an alignment of 16. I was playing with this algo, dropped it from 467 to 406 with an unroll and a few changed instructions. ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤ include \masm32\include\masm32rt.inc ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
comment * ----------------------------------------------------- Build this template with "CONSOLE ASSEMBLE AND LINK" ----------------------------------------------------- *
szappendx PROTO :DWORD,:DWORD,:DWORD
.data? value dd ?
.data item dd 0
.code
start: ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main inkey exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL hMem :DWORD LOCAL cloc :DWORD
mov hMem, alloc(1024*1024*256) ; 256 meg
mov cloc, 0
push esi mov esi, 20000000
invoke GetTickCount push eax
@@: mov cloc, rv(szappendx,hMem,"123456789012",cloc) sub esi, 1 jnz @B
invoke GetTickCount pop ecx sub eax, ecx
print str$(eax),13,10
pop esi
free hMem
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE
; align 16
szappendx proc string:DWORD,buffer:DWORD,location:DWORD
; ------------------------------------------------------ ; string the main buffer to append extra data to. ; buffer the byte data to append to the main buffer ; location current location pointer ; ------------------------------------------------------
push esi push edi mov edi, 1
or eax, -1 mov esi, [esp+12] ; string mov ecx, [esp+16] ; buffer add esi, [esp+20] ; location ; add offset pointer to source address
@@: REPEAT 3 add eax, edi movzx edx, BYTE PTR [ecx+eax] mov [esi+eax], dl add edx, edx jz @F ; exit on written terminator ENDM
add eax, edi movzx edx, BYTE PTR [ecx+eax] mov [esi+eax], dl add edx, edx jnz @B ; exit on written terminator
@@:
add eax, [esp+20] ; location ; return updated offset pointer
pop edi pop esi
ret 12
szappendx endp
OPTION PROLOGUE:PrologueDef OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
|
|
|
|
|
Logged
|
|
|
|
MichaelW
Global Moderator
Member
    
Gender: 
Posts: 5161
|
Perhaps a solution would be to use larger alignments. The codealign macro below uses the only workable method I have found so far, is somewhat clumsy, and not very well tested. ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« include \masm32\include\masm32rt.inc ; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
alignment MACRO pv xor eax, eax mov ecx, pv bsf ecx, ecx jz @F mov eax, 1 shl eax, cl @@: EXITM <eax> ENDM
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
;----------------------------------------------------------------------- ; This macro will work only for items in the code segment and only if ; there is a code_base label at the start of the code segment. ; ; For some reason, specifying alignments >= 4096 produces an alignment ; of 4096. Hopefully, there is little or no need for larger alignments. ;-----------------------------------------------------------------------
codealign MACRO alignment LOCAL location, aligned_location location = $ - code_base aligned_location = location + alignment AND NOT (alignment - 1) db (alignment + aligned_location - location) dup(90h) ENDM
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« .data .code code_base: ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« start: ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« ;nops 5
N = 1 REPEAT 17 print ustr$(N),9 codealign N @@: print ustr$(alignment(OFFSET @B)),13,10 ;nops N+5 N = N SHL 1 ENDM
inkey "Press any key to exit..." exit ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« end start
|
|
|
|
|
Logged
|
eschew obfuscation
|
|
|
|
jj2007
|
JJ,
Here is an example of alignment affecting performance in an unusual way. This algo clocks 406ms on an alignment of 0,1,2,4,8 but drops to 672ms on an alignment of 16. I was playing with this algo, dropped it from 467 to 406 with an unroll and a few changed instructions.
That is indeed very odd... Perhaps a solution would be to use larger alignments. The codealign macro below uses the only workable method I have found so far, is somewhat clumsy, and not very well tested.
Will have to look at it when I have some spare time. Another way is what JimG did, i.e. copy the whole proc to be tested to a fixed aligned location and test it there. In the meantime, I wrote a wrapper to your timings macros performing a cutoff of the lowest and highest timings: Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3) Testing timings with cutoff, subcounter=10000
0:151 1:151 2:151 3:151 4:151 5:151 6:151 7:151 8:151 9:151 10:151 11:151 12:151 13:151 14:151 15:151 16:151 17:151 18:151 19:151 20:151
...
801:152 802:152 803:152 804:152 805:152 806:152 807:152 808:152 809:152 810:152 811:152 812:152 813:152 814:152 815:152 816:152 817:152 818:152 819:153 820:153 821:154 822:154 823:154 824:154 825:155 826:155 827:155 828:155 829:155 830:155 831:155 832:155 833:156 834:156 835:156 836:157 837:157 838:157 839:157 840:157 841:157 842:157 843:157 844:157 845:157 846:157 847:157 848:157 849:157 850:157 851:157 852:157 853:157 854:158 855:158 856:158 857:158 858:158 859:159 860:159 861:159 862:159 863:159 864:159 865:160 866:160 867:160 868:160 869:160 870:161 871:162 872:163 873:163 874:163 875:164 876:164 877:165 878:165 879:170 880:170 881:172 882:173 883:175 884:176 885:178 886:179 887:180 888:187 889:188 890:188 891:188 892:188 893:188 894:188 895:188 896:188 897:188 898:188 899:188 900:188 901:189 902:193 903:202 904:203 905:205 906:226 907:231 908:232 909:243 910:244 911:248 912:252 913:260 914:263 915:263 916:263 917:263 918:264 919:264 920:264 921:264 922:265 923:265 924:265 925:265 926:265 927:265 928:265 929:266 930:266 931:266 932:266 933:267 934:267 935:267 936:267 937:268 938:268 939:268 940:268 941:268 942:268 943:268 944:268 945:268 946:269 947:269 948:269 949:269 950:269 951:269 952:270 953:270 954:270 955:270 956:270 957:270 958:270 959:270 960:270 961:270 962:270 963:270 964:270 965:270 966:270 967:270 968:270 969:270 970:270 971:270 972:271 973:271 974:271 975:271 976:272 977:272 978:272 979:272 980:272 981:273 982:273 983:273 984:273 985:273 986:273 987:274 988:275 989:280 990:280 991:281 992:281 993:282 994:282 995:288 996:288 997:297 998:305 999:312 1000:331
MyTest: subcounter= 10000 average with cutoff 151 average w/o cutoff 164 That might be a way to stabilise timings, but doesn't solve the location problem, of course. [attachment deleted by admin]
|
|
|
|
|
Logged
|
|
|
|
|
jj2007
|
Maybe I found something interesting. This version shows a sudden jump of the timings to a significantly higher value:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3) Testing timings with cutoff, subcounter=1000
0:151 1:151 2:152 3:152 4:152 5:152 6:152 7:152 8:152 9:152 10:152
...
401:152 402:152 403:152 404:152 405:168 406:211 407:255 408:255 409:256 410:256 411:256 412:257 413:257 414:258 415:258 416:258 417:258 418:259 419:259 420:259 421:259 422:259 423:260 424:260 425:260 426:260 427:261 428:261 429:261 430:261 431:262 432:262 433:262 434:262 435:262 436:262 437:262 438:262 439:263 440:263 441:263 442:263 443:263 444:263 445:263 446:263 447:263 448:264 449:264 450:264 451:264 452:264 453:264 454:264 455:264 456:264 457:264 458:264 459:264 460:264 461:264 462:264 463:265 464:265 465:265 466:265 467:265 468:265 469:265 470:265 471:265 472:265 473:265 474:265 475:265 476:265 477:265 478:265 479:265 480:265 481:265 482:265 483:265 484:265 485:265 486:265 487:265 488:265 489:265 490:265 491:265 492:265 493:266 494:266 495:266 496:266 497:266 498:266 499:266 500:266 501:266 502:266 503:266 504:266 505:266 506:266 507:266 508:266 509:266 510:266 511:266 512:266 513:266 514:266 515:266 516:266 517:266 518:266 519:266 520:266 521:266 522:266 523:266 524:266 525:266 526:266 527:266 528:266 529:266 530:266 531:266 532:266 533:266 534:266 535:266 536:266 537:266 538:266 539:266 540:266 541:266 542:266 543:266 544:266 545:266 546:266 547:266 548:266 549:266 550:267 551:267 552:267 553:267 554:267 555:267 556:267 557:267 558:267 559:267 560:267 561:267 562:267 563:267 564:267 565:267 566:267 567:267 568:267 569:267 570:267 571:267 572:267 573:267 574:267 575:267 576:267 577:267 578:267 579:267 580:267 581:267 582:267 583:267 584:267 585:267 586:267 587:267 588:267 589:267 590:267 591:267 592:267 593:267 594:267 595:267 596:267 597:267 598:267 599:267 600:267 601:267 602:267 603:267 604:267 605:267 606:267 607:267 608:267 609:267 610:267 611:267 612:267 613:267 614:267 615:267 616:267 617:267 618:267 619:267 620:267 621:267 622:267 623:267 624:267 625:267 626:267 627:267 628:267 629:267 630:267 631:267 632:267 633:267 634:267 635:267 636:267 637:267 638:267 639:267 640:267 641:268 642:268 643:268 644:268 645:268 646:268 647:268 648:268 649:268 650:268 651:268 652:268 653:268 654:268 655:268 656:268 657:268 658:268 659:268 660:268 661:268 662:268 663:268 664:268 665:268 666:268 667:268 668:268 669:268 670:268 671:268 672:268 673:268 674:268 675:268 676:268 677:268 678:268 679:268 680:268 681:268 682:268 683:268 684:268 685:268 686:268 687:268 688:268 689:268 690:268 691:268 692:268 693:268 694:268 695:268 696:268 697:268 698:268 699:268 700:268 701:268 702:268 703:268 704:268 705:268 706:268 707:268 708:268 709:268 710:268 711:268 712:268 713:268 714:268 715:268 716:268 717:268 718:268 719:268 720:268 721:268 722:268 723:268 724:268 725:268 726:268 727:268 728:268 729:268 730:268 731:268 732:269 733:269 734:269 735:269 736:269 737:269 738:269 739:269 740:269 741:269 742:269 743:269 744:269 745:269 746:269 747:269 748:269 749:269 750:269 751:269 752:269 753:269 754:269 755:269 756:269 757:269 758:269 759:269 760:269 761:269 762:269 763:269 764:269 765:269 766:269 767:269 768:269 769:269 770:269 771:269 772:269 773:269 774:269 775:269 776:269 777:269 778:269 779:269 780:269 781:269 782:269 783:269 784:269 785:269 786:269 787:269 788:269 789:269 790:269 791:269 792:269 793:269 794:269 795:269 796:269 797:269 798:269 799:269 800:269 801:269 802:269 803:269 804:269 805:269 806:269 807:269 808:269 809:269 810:269 811:269 812:269 813:269 814:269 815:269 816:269 817:269 818:269 819:269 820:269 821:269 822:269 823:269 824:269 825:269 826:269 827:270 828:270 829:270 830:270 831:270 832:270 833:270 834:270 835:270 836:270 837:270 838:270 839:270 840:270 841:270 842:270 843:270 844:270 845:270 846:270 847:270 848:270 849:270 850:270 851:270 852:270 853:270 854:270 855:270 856:270 857:270 858:270 859:270 860:270 861:270 862:270 863:270 864:270 865:270 866:270 867:270 868:270 869:270 870:270 871:270 872:270 873:270 874:270 875:270 876:270 877:270 878:270 879:270 880:270 881:270 882:270 883:270 884:270 885:270 886:270 887:270 888:270 889:270 890:270 891:270 892:270 893:270 894:270 895:270 896:270 897:270 898:270 899:270 900:270 901:270 902:270 903:270 904:270 905:270 906:270 907:270 908:270 909:270 910:270 911:270 912:270 913:270 914:271 915:271 916:271 917:271 918:271 919:271 920:271 921:271 922:271 923:271 924:271 925:271 926:271 927:271 928:271 929:271 930:271 931:271 932:271 933:271 934:271 935:271 936:271 937:271 938:271 939:271 940:271 941:271 942:271 943:272 944:272 945:272 946:272 947:272 948:272 949:272 950:272 951:272 952:272 953:272 954:272 955:272 956:272 957:273 958:273 959:273 960:273 961:273 962:273 963:274 964:274 965:274 966:274 967:275 968:275 969:275 970:275 971:275 972:275 973:275 974:276 975:279 976:281 977:286 978:287 979:304 980:318 981:319 982:334 983:374 984:378 985:396 986:404 987:412 988:421 989:422 990:427 991:435 992:441 993:445 994:467 995:487 996:493 997:496 998:497 999:502 1000:789
MyTest: subcounter= 1000 average with cutoff 152 average w/o cutoff 224 difference raw-cutoff: 72
The point of change is not fixed. With the same code, I get also e.g. 581:152 582:152 583:152 584:152 585:152 586:152 587:169 588:199 589:247 590:252 591:256 592:257 593:257 594:257 595:257 596:258 597:258 598:259 599:260 600:261
But the pattern is clear: One part of the timings are fast, another a lot slower, and both consistently so. The code location plays a role in this, obviously. But the code is not always slow. This makes the phenomenon different from what we experience with changing alignments.
EDIT: New attachment with automatic cutoff following a rule: - take the 10th lowest value as a reference - multiply with 1.25 - cutoff when the first count higher than this value is being found Usually, for the P4 the cutoff n is somewhere around 600...900 of 1000 readings, and timings are very stable.
[attachment deleted by admin]
|
|
|
|
« Last Edit: May 19, 2009, 01:39:22 PM by jj2007 »
|
Logged
|
|
|
|
|
dedndave
|
i would guess the OS is doing something else during the high count tests
|
|
|
|
|
Logged
|
|
|
|
|
jj2007
|
i would guess the OS is doing something else during the high count tests
Combined with flushing the cache...?
|
|
|
|
|
Logged
|
|
|
|
|
dedndave
|
is that a rhetorical question Jochen, or am i supposed to make up an answer ?
|
|
|
|
|
Logged
|
|
|
|
|
jj2007
|
is that a rhetorical question Jochen, or am i supposed to make up an answer ?
If you find an answer... but don't tell me it's 42 
|
|
|
|
|
Logged
|
|
|
|
|
|
Pages: [1] 2 3 ... 5
|
|
|
 |