This is a spin-off from the add$ thread (http://www.masm32.com/board/index.php?topic=11438.0), regarding a problem in the add$ macro, and a faster technique of concatenating strings with add$. I stumbled again over a code location problem:
809 cycles for addx$
848 cycles for add3$, old syntax
767 cycles for add4$, old syntax
258 cycles for add3$, short syntax
227 cycles for add4$, short syntax
Exactly the same algo, but moved a bit further down using the MakeSlow=1 switch in line 6:
774 cycles for addx$
826 cycles for add3$, old syntax
1214 cycles for add4$, old syntax
261 cycles for add3$, short syntax
763 cycles for add4$, short syntax
Both code locations are align 16, timings are for Prescott P4. It seems it would be worthwhile to look seriously into this issue. JimG's approach to move the code to a constant location is fine for testing purposes (I had no time to test it on this one), but it raises the question of how to optimise real life applications...
[attachment deleted by admin]
JJ,
The effect is not uncommon and I don't have a definitive answer. I have even played with padding the code either side with NOPS and ZEROS to get 1024 byte alignment and you still get the variation.
It is my guess that it is a combined code alignment/cache content pollution effect and it effects algos with high intensity loop code more than it effects algos that are logic based.
Also try a different type of timing test as this may be effected by the effects mentioned above. Long linear tests are far less sensitive to code location variation than short looped code methods.
Isn't code optimisation PHUN ?
Usually the Amd is very sensitive, but with your test, I saw virtually no difference between the two.
Quote from: hutch-- on May 18, 2009, 01:14:26 PM
Isn't code optimisation PHUN ?
It rocks :cheekygreen:
Quote from: Jimg on May 18, 2009, 01:18:22 PM
Usually the Amd is very sensitive, but with your test, I saw virtually no difference between the two.
Thanks. I am not surprised. A factor 3 slower just because it's a different CPU ::)
clear the cache - then run the tests ?
this is on the celeron cpu ? - or prescott ?
Quote from: dedndave on May 18, 2009, 01:40:05 PM
clear the cache - then run the tests ?
The cache should be fine after the first handful of loops (how many are needed? one? any Intel or AMD insiders lurking??). Loop count is 200000, so if the first n loops take 200000 cycles each, the average will be n cycles higher. Not 500...
Quote
this is on the celeron cpu ? - or prescott ?
Prescott. I will check the Celeron tonight, but I doubt I will see the difference. It comes and goes with a few bytes more or less ::)
Edit: Celeron M timings - the effect is there but less strong:
MakeSlow = 0
343 cycles for addx$
627 cycles for add4$, old syntax
221 cycles for add4$, short syntax
MakeSlow = 1
342 cycles for addx$
650 cycles for add4$, old syntax
316 cycles for add4$, short syntax
maybe to write fast code, you have to set up some kind of
dynamic system that places and aligns routines at run-time :dazzled:
JJ,
Here is an example of alignment affecting performance in an unusual way.
This algo clocks 406ms on an alignment of 0,1,2,4,8 but drops to 672ms on an alignment of 16.
I was playing with this algo, dropped it from 467 to 406 with an unroll and a few changed instructions.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *
szappendx PROTO :DWORD,:DWORD,:DWORD
.data?
value dd ?
.data
item dd 0
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL hMem :DWORD
LOCAL cloc :DWORD
mov hMem, alloc(1024*1024*256) ; 256 meg
mov cloc, 0
push esi
mov esi, 20000000
invoke GetTickCount
push eax
@@:
mov cloc, rv(szappendx,hMem,"123456789012",cloc)
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
print str$(eax),13,10
pop esi
free hMem
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
; align 16
szappendx proc string:DWORD,buffer:DWORD,location:DWORD
; ------------------------------------------------------
; string the main buffer to append extra data to.
; buffer the byte data to append to the main buffer
; location current location pointer
; ------------------------------------------------------
push esi
push edi
mov edi, 1
or eax, -1
mov esi, [esp+12] ; string
mov ecx, [esp+16] ; buffer
add esi, [esp+20] ; location ; add offset pointer to source address
@@:
REPEAT 3
add eax, edi
movzx edx, BYTE PTR [ecx+eax]
mov [esi+eax], dl
add edx, edx
jz @F ; exit on written terminator
ENDM
add eax, edi
movzx edx, BYTE PTR [ecx+eax]
mov [esi+eax], dl
add edx, edx
jnz @B ; exit on written terminator
@@:
add eax, [esp+20] ; location ; return updated offset pointer
pop edi
pop esi
ret 12
szappendx endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Perhaps a solution would be to use larger alignments. The codealign macro below uses the only workable method I have found so far, is somewhat clumsy, and not very well tested.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
alignment MACRO pv
xor eax, eax
mov ecx, pv
bsf ecx, ecx
jz @F
mov eax, 1
shl eax, cl
@@:
EXITM <eax>
ENDM
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
;-----------------------------------------------------------------------
; This macro will work only for items in the code segment and only if
; there is a code_base label at the start of the code segment.
;
; For some reason, specifying alignments >= 4096 produces an alignment
; of 4096. Hopefully, there is little or no need for larger alignments.
;-----------------------------------------------------------------------
codealign MACRO alignment
LOCAL location, aligned_location
location = $ - code_base
aligned_location = location + alignment AND NOT (alignment - 1)
db (alignment + aligned_location - location) dup(90h)
ENDM
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
.code
code_base:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
;nops 5
N = 1
REPEAT 17
print ustr$(N),9
codealign N
@@:
print ustr$(alignment(OFFSET @B)),13,10
;nops N+5
N = N SHL 1
ENDM
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Quote from: hutch-- on May 19, 2009, 08:50:45 AM
JJ,
Here is an example of alignment affecting performance in an unusual way.
This algo clocks 406ms on an alignment of 0,1,2,4,8 but drops to 672ms on an alignment of 16.
I was playing with this algo, dropped it from 467 to 406 with an unroll and a few changed instructions.
That is indeed very odd...
Quote from: MichaelW on May 19, 2009, 09:14:31 AM
Perhaps a solution would be to use larger alignments. The codealign macro below uses the only workable method I have found so far, is somewhat clumsy, and not very well tested.
Will have to look at it when I have some spare time. Another way is what JimG did, i.e. copy the whole proc to be tested to a fixed aligned location and test it there.
In the meantime, I wrote a wrapper to your timings macros performing a cutoff of the lowest and highest timings:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Testing timings with cutoff, subcounter=10000
0:151 1:151 2:151 3:151 4:151 5:151 6:151 7:151 8:151 9:151 10:151 11:151 12:151
13:151 14:151 15:151 16:151 17:151 18:151 19:151 20:151
...
801:152 802:152 803:152 804:152 805:152 806:152 807:152 808:152 809:152 810:152
811:152 812:152 813:152 814:152 815:152 816:152 817:152 818:152 819:153 820:153
821:154 822:154 823:154 824:154 825:155 826:155 827:155 828:155 829:155 830:155
831:155 832:155 833:156 834:156 835:156 836:157 837:157 838:157 839:157 840:157
841:157 842:157 843:157 844:157 845:157 846:157 847:157 848:157 849:157 850:157
851:157 852:157 853:157 854:158 855:158 856:158 857:158 858:158 859:159 860:159
861:159 862:159 863:159 864:159 865:160 866:160 867:160 868:160 869:160 870:161
871:162 872:163 873:163 874:163 875:164 876:164 877:165 878:165 879:170 880:170
881:172 882:173 883:175 884:176 885:178 886:179 887:180 888:187 889:188 890:188
891:188 892:188 893:188 894:188 895:188 896:188 897:188 898:188 899:188 900:188
901:189 902:193 903:202 904:203 905:205 906:226 907:231 908:232 909:243 910:244
911:248 912:252 913:260 914:263 915:263 916:263 917:263 918:264 919:264 920:264
921:264 922:265 923:265 924:265 925:265 926:265 927:265 928:265 929:266 930:266
931:266 932:266 933:267 934:267 935:267 936:267 937:268 938:268 939:268 940:268
941:268 942:268 943:268 944:268 945:268 946:269 947:269 948:269 949:269 950:269
951:269 952:270 953:270 954:270 955:270 956:270 957:270 958:270 959:270 960:270
961:270 962:270 963:270 964:270 965:270 966:270 967:270 968:270 969:270 970:270
971:270 972:271 973:271 974:271 975:271 976:272 977:272 978:272 979:272 980:272
981:273 982:273 983:273 984:273 985:273 986:273 987:274 988:275 989:280 990:280
991:281 992:281 993:282 994:282 995:288 996:288 997:297 998:305 999:312 1000:331
MyTest: subcounter= 10000
average with cutoff 151
average w/o cutoff 164
That might be a way to stabilise timings, but doesn't solve the location problem, of course.
[attachment deleted by admin]
Maybe I found something interesting. This version shows a sudden jump of the timings to a significantly higher value:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Testing timings with cutoff, subcounter=1000
0:151 1:151 2:152 3:152 4:152 5:152 6:152 7:152 8:152 9:152 10:152
...
401:152 402:152 403:152 404:152 405:168 406:211 407:255 408:255 409:256 410:256
411:256 412:257 413:257 414:258 415:258 416:258 417:258 418:259 419:259 420:259
421:259 422:259 423:260 424:260 425:260 426:260 427:261 428:261 429:261 430:261
431:262 432:262 433:262 434:262 435:262 436:262 437:262 438:262 439:263 440:263
441:263 442:263 443:263 444:263 445:263 446:263 447:263 448:264 449:264 450:264
451:264 452:264 453:264 454:264 455:264 456:264 457:264 458:264 459:264 460:264
461:264 462:264 463:265 464:265 465:265 466:265 467:265 468:265 469:265 470:265
471:265 472:265 473:265 474:265 475:265 476:265 477:265 478:265 479:265 480:265
481:265 482:265 483:265 484:265 485:265 486:265 487:265 488:265 489:265 490:265
491:265 492:265 493:266 494:266 495:266 496:266 497:266 498:266 499:266 500:266
501:266 502:266 503:266 504:266 505:266 506:266 507:266 508:266 509:266 510:266
511:266 512:266 513:266 514:266 515:266 516:266 517:266 518:266 519:266 520:266
521:266 522:266 523:266 524:266 525:266 526:266 527:266 528:266 529:266 530:266
531:266 532:266 533:266 534:266 535:266 536:266 537:266 538:266 539:266 540:266
541:266 542:266 543:266 544:266 545:266 546:266 547:266 548:266 549:266 550:267
551:267 552:267 553:267 554:267 555:267 556:267 557:267 558:267 559:267 560:267
561:267 562:267 563:267 564:267 565:267 566:267 567:267 568:267 569:267 570:267
571:267 572:267 573:267 574:267 575:267 576:267 577:267 578:267 579:267 580:267
581:267 582:267 583:267 584:267 585:267 586:267 587:267 588:267 589:267 590:267
591:267 592:267 593:267 594:267 595:267 596:267 597:267 598:267 599:267 600:267
601:267 602:267 603:267 604:267 605:267 606:267 607:267 608:267 609:267 610:267
611:267 612:267 613:267 614:267 615:267 616:267 617:267 618:267 619:267 620:267
621:267 622:267 623:267 624:267 625:267 626:267 627:267 628:267 629:267 630:267
631:267 632:267 633:267 634:267 635:267 636:267 637:267 638:267 639:267 640:267
641:268 642:268 643:268 644:268 645:268 646:268 647:268 648:268 649:268 650:268
651:268 652:268 653:268 654:268 655:268 656:268 657:268 658:268 659:268 660:268
661:268 662:268 663:268 664:268 665:268 666:268 667:268 668:268 669:268 670:268
671:268 672:268 673:268 674:268 675:268 676:268 677:268 678:268 679:268 680:268
681:268 682:268 683:268 684:268 685:268 686:268 687:268 688:268 689:268 690:268
691:268 692:268 693:268 694:268 695:268 696:268 697:268 698:268 699:268 700:268
701:268 702:268 703:268 704:268 705:268 706:268 707:268 708:268 709:268 710:268
711:268 712:268 713:268 714:268 715:268 716:268 717:268 718:268 719:268 720:268
721:268 722:268 723:268 724:268 725:268 726:268 727:268 728:268 729:268 730:268
731:268 732:269 733:269 734:269 735:269 736:269 737:269 738:269 739:269 740:269
741:269 742:269 743:269 744:269 745:269 746:269 747:269 748:269 749:269 750:269
751:269 752:269 753:269 754:269 755:269 756:269 757:269 758:269 759:269 760:269
761:269 762:269 763:269 764:269 765:269 766:269 767:269 768:269 769:269 770:269
771:269 772:269 773:269 774:269 775:269 776:269 777:269 778:269 779:269 780:269
781:269 782:269 783:269 784:269 785:269 786:269 787:269 788:269 789:269 790:269
791:269 792:269 793:269 794:269 795:269 796:269 797:269 798:269 799:269 800:269
801:269 802:269 803:269 804:269 805:269 806:269 807:269 808:269 809:269 810:269
811:269 812:269 813:269 814:269 815:269 816:269 817:269 818:269 819:269 820:269
821:269 822:269 823:269 824:269 825:269 826:269 827:270 828:270 829:270 830:270
831:270 832:270 833:270 834:270 835:270 836:270 837:270 838:270 839:270 840:270
841:270 842:270 843:270 844:270 845:270 846:270 847:270 848:270 849:270 850:270
851:270 852:270 853:270 854:270 855:270 856:270 857:270 858:270 859:270 860:270
861:270 862:270 863:270 864:270 865:270 866:270 867:270 868:270 869:270 870:270
871:270 872:270 873:270 874:270 875:270 876:270 877:270 878:270 879:270 880:270
881:270 882:270 883:270 884:270 885:270 886:270 887:270 888:270 889:270 890:270
891:270 892:270 893:270 894:270 895:270 896:270 897:270 898:270 899:270 900:270
901:270 902:270 903:270 904:270 905:270 906:270 907:270 908:270 909:270 910:270
911:270 912:270 913:270 914:271 915:271 916:271 917:271 918:271 919:271 920:271
921:271 922:271 923:271 924:271 925:271 926:271 927:271 928:271 929:271 930:271
931:271 932:271 933:271 934:271 935:271 936:271 937:271 938:271 939:271 940:271
941:271 942:271 943:272 944:272 945:272 946:272 947:272 948:272 949:272 950:272
951:272 952:272 953:272 954:272 955:272 956:272 957:273 958:273 959:273 960:273
961:273 962:273 963:274 964:274 965:274 966:274 967:275 968:275 969:275 970:275
971:275 972:275 973:275 974:276 975:279 976:281 977:286 978:287 979:304 980:318
981:319 982:334 983:374 984:378 985:396 986:404 987:412 988:421 989:422 990:427
991:435 992:441 993:445 994:467 995:487 996:493 997:496 998:497 999:502 1000:789
MyTest: subcounter= 1000
average with cutoff 152
average w/o cutoff 224
difference raw-cutoff: 72
The point of change is not fixed. With the same code, I get also e.g.
581:152 582:152 583:152 584:152 585:152 586:152 587:169 588:199 589:247 590:252
591:256 592:257 593:257 594:257 595:257 596:258 597:258 598:259 599:260 600:261
But the pattern is clear: One part of the timings are fast, another a lot slower, and both consistently so. The code location plays a role in this, obviously. But the code is not always slow. This makes the phenomenon different from what we experience with changing alignments.
EDIT: New attachment with automatic cutoff following a rule:
- take the 10th lowest value as a reference
- multiply with 1.25
- cutoff when the first count higher than this value is being found
Usually, for the P4 the cutoff n is somewhere around 600...900 of 1000 readings, and timings are very stable.
[attachment deleted by admin]
i would guess the OS is doing something else during the high count tests
Quote from: dedndave on May 19, 2009, 11:58:25 AM
i would guess the OS is doing something else during the high count tests
Combined with flushing the cache...?
is that a rhetorical question Jochen, or am i supposed to make up an answer ?
Quote from: dedndave on May 19, 2009, 12:43:32 PM
is that a rhetorical question Jochen, or am i supposed to make up an answer ?
If you find an answer... but don't tell me it's 42 (http://theanswers42.blogspot.com/) :wink
Quote from: dedndave on May 19, 2009, 11:58:25 AM
i would guess the OS is doing something else during the high count tests
Hello,
I have a simple graphics program that prints out the iterations
per second. It works in an XP full screen command prompt, but
if run longer than a few seconds, it slows significantly (about half
speed). I have seen another DOS graphics program slow drastically
after running a few seconds, though that one is not very repeatable.
Maybe boot to safe mode, or DOS?
FWIW,
Steve N.
What ? :bg
It's not a DOS program !
it could be a dos graphics program under XP
i have a similar one i wrote years ago for DOS
it uses the 320x200 256 color graphics mode
although, i have not seen this slow-down problem with it
i did notice that it seems disk-intensive
DOS-mode disk I/O running under Windows seems to be a poor mix
I have often noticed that apps slow down while the fan starts spinning faster. Maybe I've misunderstood what SpeedStep (TM) really means - I thought apps should run faster when they ask for it :bg
In the meantime, I might have found an answer; the following shows how many readings in 1,000 are low, i.e. 17-18 cycles for 10 push/pop reg32 pairs:
LOOP_COUNT = 10000000
average with cutoff 18 cutoff at n=523 for ct higher than 23 cycles
average w/o cutoff 29
difference raw-cutoff: 11
LOOP_COUNT = 5000000
average with cutoff 17 cutoff at n=639 for ct higher than 23 cycles
average w/o cutoff 26
difference raw-cutoff: 9
LOOP_COUNT = 2000000
average with cutoff 18 cutoff at n=898 for ct higher than 23 cycles
average w/o cutoff 20
difference raw-cutoff: 2
LOOP_COUNT = 1000000
average with cutoff 17 cutoff at n=977 for ct higher than 23 cycles
average w/o cutoff 18
difference raw-cutoff: 1
LOOP_COUNT = 500000
average with cutoff 17 cutoff at n=970 for ct higher than 22 cycles
average w/o cutoff 17
difference raw-cutoff: 0
See the pattern? It seems that if LOOP_COUNT is too high, the OS is forced to do something else, eating up cycles for half of the readings. Probably the time slice is too short to perform the full test then?
Anyway, the cutoff method works fine, and is easier to handle than JimG's copying scheme. But that still doesn't answer the question why the OS behaves like that more often if the same identical and identically aligned code sits a few bytes further down...
that is what i was saying in the other thread
i think the test "burst" needs to fall within some predifined window of cycle counts
too few cycles - unrepeatable results due to test/overhead ratio
too many cycles - can't avoid the OS "hic-ups"
i think that window is fairly large and should be easy to hit
once the window has been defined, testing any algo should first adjust it's "burst" length to fit the window - then run the tests
i am working on a program that does it this way - if i can just get finished with the cpu identification nightmare - lol
i am also trying to accomodate a wide variety of machines with multi-cores
example: some off-the-wall guy has a machine with a prescott and 2 norwoods in it
we can acquire information for the prescott and norwood from a single run
I originally recommended a high loop count because it produced better run-to-run consistency. Then some time later I discovered that adding a 3-4 second delay after the program was launched and before entering the first test loop eliminated the need for the high loop count. Running on my P3, with the delay and a loop count of 1000, and testing code that executes somewhere around 20-300 cycles, the run-to-run variation is near zero.
good info to have Michael
i would not have thought to look for a high start-up usage by the OS
probably building all the structures that you "might" refer to in function calls - lol - what a waste
it would make more sense to me if the OS set everything up - then run the app - what do i know
i wonder if this affects windows apps as well as console apps - probably does
also, i wonder if this is somehow related Jochen's mystery cycle-count issue
FWIW,
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3) (XP x64)
Testing timings with cutoff
0:79 1:79 2:79 3:79 4:79 5:79 6:79 7:79 8:79 9:79 10:79
...
701:79 702:79 703:79 704:79 705:79 706:79 707:79 708:79 709:79 710:79 711:79 712
:79 713:79 714:79 715:79 716:79 717:79 718:79 719:79 720:79 721:79 722:79 723:79
724:79 725:79 726:79 727:79 728:79 729:79 730:79 731:79 732:79 733:79 734:79 73
5:79 736:79 737:79 738:79 739:79 740:79 741:79 742:79 743:79 744:79 745:79 746:7
9 747:79 748:79 749:79 750:79 751:79 752:79 753:79 754:79 755:79 756:79 757:79 7
58:79 759:79 760:79 761:79 762:79 763:79 764:79 765:79 766:79 767:79 768:79 769:
79 770:79 771:79 772:79 773:79 774:79 775:79 776:79 777:79 778:79 779:79 780:79
781:79 782:79 783:79 784:79 785:79 786:79 787:79 788:79 789:79 790:79 791:79 792
:79 793:79 794:79 795:79 796:79 797:79 798:79 799:79 800:79 801:79 802:79 803:79
804:79 805:79 806:79 807:79 808:79 809:79 810:79 811:79 812:79 813:79 814:79 81
5:79 816:79 817:79 818:79 819:79 820:79 821:79 822:79 823:79 824:79 825:79 826:7
9 827:79 828:79 829:79 830:79 831:79 832:79 833:79 834:79 835:79 836:79 837:79 8
38:79 839:79 840:79 841:79 842:79 843:79 844:79 845:79 846:79 847:79 848:79 849:
79 850:79 851:79 852:79 853:79 854:79 855:79 856:79 857:79 858:79 859:79 860:79
861:79 862:79 863:79 864:79 865:79 866:79 867:79 868:79 869:79 870:79 871:79 872
:79 873:79 874:79 875:79 876:79 877:79 878:79 879:79 880:79 881:79 882:79 883:79
884:79 885:79 886:79 887:79 888:79 889:79 890:79 891:79 892:79 893:79 894:79 89
5:79 896:79 897:79 898:79 899:79 900:79 901:79 902:79 903:79 904:79 905:79 906:7
9 907:79 908:79 909:79 910:79 911:79 912:79 913:79 914:79 915:79 916:79 917:79 9
18:79 919:79 920:79 921:79 922:79 923:79 924:79 925:79 926:79 927:79 928:79 929:
79 930:79 931:79 932:79 933:79 934:79 935:79 936:79 937:79 938:79 939:79 940:79
941:79 942:79 943:79 944:79 945:79 946:79 947:79 948:79 949:79 950:79 951:79 952
:79 953:79 954:79 955:79 956:79 957:79 958:79 959:79 960:79 961:79 962:79 963:79
964:79 965:79 966:79 967:79 968:79 969:79 970:79 971:79 972:79 973:79 974:79 97
5:79 976:79 977:79 978:79 979:79 980:79 981:79 982:79 983:79 984:79 985:79 986:7
9 987:79 988:79 989:79 990:79 991:79 992:79 993:79 994:79 995:79 996:79 997:80 9
98:80 999:81 1000:109
MyTest: subcounter= 1000
average with cutoff 79 cutoff at n=1000 for ct higher than 99 cycles
average w/o cutoff 78
difference raw-cutoff: -1
Having played with enough algos over time, there does seem to be a pattern in the example I posted where setting the alignment to 16 suddenly makes the algo a lot slower, sometimes over 50% slower. It would seem that if the algorithm is dealing with byte data (IE processed at 1 byte at a time) then it is more succeptable to alignment variations that if the algo deals with DWORD sized data.
Ther is possibly another factor with the "ALIGN 16" directive, it can be up to 15 bytes long usually done with two NOP style instructions and while they should never be executed as they are before the label in a procedure, I have a sneaking suspicion that very long instructions stored near the front of the code cache can also have unusual effects in terms of timing.
well - i always try to use a simple instruction (or two) at a label
that gives the eip something to chew on while the queue fills up
like....
branch:
cld
mov eax,edx
(i.e. use instructions that access no memory)
i have always used a similar method all the way back to 8086
with that processor, the queue was 6 bytes
but, it was capable of pre-fetching and executing at the same time (second generation uP they called it i think)
Quote from: hutch-- on May 20, 2009, 02:29:19 AM
I have a sneaking suspicion that very long instructions stored near the front of the code cache can also have unusual effects in terms of timing.
suspicion ? is someone here read what i've wrote ? i've spoken of uops for a good reason... always remember that alignment (made by us) is to make instructions decoding easier for the cpu... BUT it's not always sufficient... take a piece of paper and see the possibilities in the case of a 6 or 4 uops cpu, for example with a 7 bytes instruction... you will certainly understand... anyway, it's not something you can modify by coding, so why bother with that ?
NightWare,
Tolerate us mrer mortals and explain what you have mentioned. I commented on something I have found over time that is not mentioned in any of the optimisation documentation I have ever seen over the last 20 years, effectively a NON-EXECUTED instruction sequence provided by the assembler for the purposes of code alignment that appears to effect the timing of the executed code that follows it.
Now while waiting for such pearls of wisdom, here is some interum testing that seems to effect the timing change with different code alignment.
align 16
mov esp, [esp+edx*8+128] ; non executed instruction 7 bytes
REPEAT 64
int 3
ENDM
; retn
align 4
The align 16, 7 byte instruction and a number of others produced the slowdown by over 50%. Large paddings of nops (DB90) did not effect the change but the use of a RETN in some instanced removed the slowdown as did a padding in INT 3 instructions.
I don't have a handy explanation but the effect is reproducable.
I was wondering...with our new you beaut cpu's and variations with cache and windows with its zillion threads, could the cache get 'saturated'?
Garbage collection?
Quote from: sinsi on May 20, 2009, 06:59:52 AM
I was wondering...with our new you beaut cpu's and variations with cache and windows with its zillion threads, could the cache get 'saturated'?
Garbage collection?
As far as I've understood, you get a fresh cache every time the OS hands over a time slice - but maybe I am wrong.
Quote from: hutch-- on May 20, 2009, 06:34:32 AM
The align 16, 7 byte instruction and a number of others produced the slowdown by over 50%. Large paddings of nops (DB90) did not effect the change but the use of a RETN in some instanced removed the slowdown as did a padding in INT 3 instructions.
I don't have a handy explanation but the effect is reproducable.
Lingo has the habit to start procs with e.g.
align 16
db 0, 9, 3, 99
LingoCode proc ....
Maybe he has found the secret :wink
The only possibility I see is that the CPU erroneously tries to interpret non-executed code
before the proc start, in order to "optimise" the real code - branch prediction etc... ::)
Anyway, for analysis we should keep separate two issues:
1. The alignment that influences speed all the time, e.g. a beneficial
nop before a speed-critical loop
2. The observation that the same identical and identically aligned code runs half as fast if placed 16 bytes lower.
For point 2, see my code attached above. The algo runs x% at full speed, and 100-x% at half speed, most probably because the OS takes over occasionally to do other things. I have further tested the phenomenon, and my "wrappers" for Michael's macros can effectively avoid it. Even with a modest invoke Sleep, 20 only and LOOP_COUNT = 10000, I get a rock solid timing of 151 cycles for mov eax, len(offset Tst50) (a 50-byte string). The non-cutoff mode timings instead waddle like a dog's tail, and the cutoff point varies, i.e. there is always a number of readings that are totally out of order.
Quote from: hutch-- on May 20, 2009, 03:47:15 AM
I commented on something I have found over time that is not mentioned in any of the optimisation documentation I have ever seen over the last 20 years, effectively a NON-EXECUTED instruction sequence provided by the assembler for the purposes of code alignment that appears to effect the timing of the executed code that follows it.
you consider a cpu like a robot doing always the same thing (where alignment should fit well), but it's NOT the case. the cpu behave differently depending of the instructions that have been executed previously.
when YOU enter an instruction, you "control" the alignment for the cpu. BUT, you must remember that your app/code IS NOT the first and unique app executed. when you launch an app (your code), others instructions have been used before that have filled the code/trace cache, and those instructions will not be cleaned just because you want or think it should (you must use instructions more often (during a delay) to replace them, it's the principle of the cache).
then, to replace the instructions in the cache you replace them from MEMORY (and HERE the alignment is usefull).
now, here we entering in HOW the cpu decode the instructions from the cache, and remember the cache is only few kb (so no space to loose). the instructions (or parts of algos) follow others instructions (or parts of alos), to avoid "pollution".
but of course to obtain that the cpu use internal location to decode the instructions (we are NOT in memory alignment logic anymore). but alignment INSIDE the code/trace cache, and it's something you CAN'T modify, it's decided by the cpu, NOT YOU.
and yes, since we are in ANOTHER alignment, there is speed difference directly linked to this (and the number of uops your cpu can deal with become essential).
jj, stop moving up/down the instructions to obtain speed up, you must only do that to avoid stall, nothing else. otherwise don't be surprised if the cache give you a BIG slowdown when your aligned code suddenly become totally unaligned.
Quote from: jj2007 on May 19, 2009, 02:56:12 PM
But that still doesn't answer the question why the OS behaves like that more often if the same identical and identically aligned code sits a few bytes further down...
I'm having a problem understanding the "identically aligned code sits a few bytes further down" part. If you are using align 16, then any code that sits "a few bytes further down" is not identically aligned. If it is identically aligned, then it must be sitting some multiple of 16 bytes further down.
that seems to be the nature of what JJ is describing, Michael
if he uses align 16 - one result
moves it 16 bytes down - different result
it seems to happen on some processors - not others
it is sometimes difficult to reproduce
Exactly the reason why I posted the following example is that you CAN CHANGE the effect of large slowdowns by avoiding potentially long instructions (7 byte) just prior to the code that is run. While NOPs did not work, a RETN in some circumstances works OK and padding with INT 3 also seems to work in some instances.
I do not routinely run 16 byte alignment as it has tested slow on many occasions where the ALIGN 4 directive rarely ever causes a problem.
align 16
mov esp, [esp+edx*8+128] ; non executed instruction 7 bytes
REPEAT 64
int 3
ENDM
; retn
align 4
Without having a definitive answer, I suggest it has something to do with the long instruction prior to the aligned code causing problems with the code cache but at least there is a way to change it with a bit of experimentation. What will vary from processor to processor is the block read size for the code cache so it would mean object module level alignment to match each processor level 1 cache read size to fully avoid the problem which then ends up impractical for final code size problems.
it would be nice to know which processors do what
for most code, it can be overlooked, perhaps
but if there is a real time-crunching routine, it would be possible to load the algo dynamically at runtime, according the host cpu
i suppose you could test the alignment at run-time as well, rather than IDing the cpu
clearly an advanced technique - we better leave it to the experts rofl
Dave,
What you have mentioned is reasonably simple to do, allocate executable memory large enough to hold the code plus the maximum alignment required then copy the algo to the alignment you want at runtime. Its an old technique to bypass API wrappers for other functions at lower level in the API function set.
Quote from: MichaelW on May 21, 2009, 01:21:09 AM
I'm having a problem understanding the "identically aligned code sits a few bytes further down" part. If you are using align 16, then any code that sits "a few bytes further down" is not identically aligned. If it is identically aligned, then it must be sitting some multiple of 16 bytes further down.
This is indeed what I meant - sorry for not being crystal clear. Simplified:
if MakeSlow
REPEAT 16
nop
ENDM
endif
align 16
MyTest proc
ret
MyTest endp
And (@Nightware) it has nothing to do with shifting lines up or down to speed up the code. It is the same identical algo that produces different results. Note that you
can eliminate that effect by using the sort & eliminate outliers technique described further up. In real apps, however, the effect might be present, depending on things you cannot easily influence, i.e. where that algo happens to be inserted.
Now one thing I have not yet tested is whether the effect was produced by the code immediately before the algo, i.e. the one that according to our most recent theory :wink might influence the code cache (branch prediction whatever) behaviour...
This avenue looks promising. If Hutch finds that a
retn immediately before the start of the algo speeds it up, it makes sense: Even a not-so-intelligent CPU should know that there is nothing to optimise...
Other candidates for non-optimisable instructions are int3, cpuid, rdtsc...
yes - cpuid seems like a good choice, as it forces serialization of instructions
Good and bad news.
First the good one: I am able to nail down the speed difference to
one byte:
Quote
if MakeSlow
print chr$(10, "Code sizes:", 13, 10)
; 341 cycles for addx$
; 584 cycles for add4$, old syntax
; 324 cycles for add4$, short syntax
else
print chr$(10, 10, "Code sizes:", 13, 10)
; 341 cycles for addx$
; 531 cycles for add4$, old syntax
; 192 cycles for add4$, short syntax
endif
Timings are rock solid.
Now the bad news:
SLOW version
0040193C ? 5E pop esi
0040193D ? 5F pop edi
0040193E ? C2 0800 retn 8
00401941 . 8DA424 00000000 lea esp, [local.0]
00401948 . 8DA424 00000000 lea esp, [local.1]
0040194F ? 90 nop
00401950 > CC int3 <<<<<<<<<<< algo start
00401951 ? 57 push edi
00401952 ? 56 push esi
*** print chr$(10, "Code sizes:", 13, 10)
00401805 . 40 inc eax ; Arg1 => ASCII 0A,"Code sizes"
00401806 ? 00E8 add al, ch ;
00401808 ? 90 nop ;
00401809 ? 0100 add dword ptr [eax], eax ;
0040180B ? 00CC add ah, cl
0040180D ? 68 59C04000 push offset AddStrB.0040C059 ; ASCII 0A,"Code sizes"
00401812 ? E8 89030000 call 00401BA0 ;
00401817 ? 68 68C04000 push offset AddStrB.0040C068 ; ASCII "szCatStr2 = "
0040181C ? E8 7F030000 call 00401BA0
FAST version
0040193C ? 5E pop esi
0040193D ? 5F pop edi
0040193E ? C2 0800 retn 8
00401941 . 8DA424 00000000 lea esp, [local.0]
00401948 . 8DA424 00000000 lea esp, [local.1]
0040194F ? 90 nop
00401950 > CC int3 <<<<<<<<<<< algo start
00401951 ? 57 push edi
00401952 ? 56 push esi
00401953 . 8B7C24 0C mov edi, dword ptr [arg1]
*** print chr$(10, 10, "Code sizes:", 13, 10)
00401805 . 40 inc eax ; Arg1 => ASCII LF,LF,"Code sizes:",CR,LF
00401806 ? 00E8 add al, ch ;
00401808 ? 90 nop ;
00401809 ? 0100 add dword ptr [eax], eax ;
0040180B ? 00CC add ah, cl
0040180D ? 68 59C04000 push offset AddStrB.0040C059 ; ASCII LF,LF,"Code sizes:",CR,LF
00401812 ? E8 89030000 call 00401BA0 ;
00401817 ? 68 69C04000 push offset AddStrB.0040C069 ; ASCII "szCatStr2 = "
0040181C ? E8 7F030000 call 00401BA0
The algo and its location are identical.Needless to say that I fumbled a lot with the align 16 plus specific fillbytes (retn, cpuid, ...) options. They have no effect, at least on the Celeron M where I could test this.
[attachment deleted by admin]
Alright, I need to ask a n00B question - why should any timing code be trusted which can't disable the interrupts? Every 15ms (at least) the executing thread grinds to a halt to service the kernel, which high performance counter wouldn't compensate for. Additional hardware events and DPCs would compound the problem even more, right? I was under the tacit assumption that any code timing, unless done in real mode with interrupts disabled, was AT BEST a good estimation. Can someone explain why this is not so?
-r
well - they have been playing with different methods for a while in here
QueryPerfomanceCounter, RDTSC, as well as SetPriorityClass and SetProcessAffinity
they have an idea of what some simple strings of code should return - so they can test their methods
but, the popular approach seems to be to run the test code about 10,000 times and pick one
of the lower result values, assuming that nothing else was going on during that run
it seems to return fairly repeatable results on single-core CPU's
i guess you'd say we are still fine-tuning the measurement methods
to obtain more reliable results on a wider variety of machines
i am playing with some code along that line, myself
Its something like being a lone voice crying in the wilderness but until real time testing becomes the norn on large enough samples to emulate real time situations, the results will continue to be all over the place like a mad woman's sewerage.
Short looped test conditions do have their place but it tends to be in the area of instruction sequence testing, not in larger scale algorithm testing where cache effects, memory paging and bare data read/write times take effect. As hardware gets further away from the old pre-pipelined single instruction munchers, many of these assumptions are simply out of date. The only factor that remains constant is real time, the rest vary with hardware changes.
lol "mad womans sewerage" - do i wanna know what that is ? - reminds me of the ex
i think there is still a chance to get some meaningful data on algos
so long as the data set you select emulates the real world
some of us are not all that interested in "how many ticks does it take"
i am more interested in obtaining some kind of usable comparisons
if things are set up right, and you take into account that different algos work differently on different data sets
you should be able to get enough information to help select the right algo for the job
this algo handles short strings better but is slow on long strings
that algo doesn't care how long the string is, it's not fast anywhere
these generalizations should help you decide what is best for a given application
if you can write the algo 3 ways - it is a matter of copy/paste in the finished program to help decide
i do think there is a lot of room for improvement in the test methods, however
it will never be perfect, of course - but it can be better than it is - we will get there
EDIT: i was playing with some code today and getting nice continuous time values on a dual-core, at least
Quote from: redskull on May 22, 2009, 02:09:53 AM
Alright, I need to ask a n00B question - why should any timing code be trusted which can't disable the interrupts? Every 15ms (at least) the executing thread grinds to a halt to service the kernel
The TIMER macros take account of that by waiting until a new time slice is available (@MichaelW: please correct me if that is imprecise). All readings are done in well under 15ms (actually, this test with 1,000 readings takes 15 ms including all overheads). And readings are very consistent - 967 out of 1,000 tests are in a range of +- 2 cycles around the average:
Quote0:744 1:744 2:744 3:744 4:744 5:744 6:744 7:744 8:744 9:744 10:744
...
951:746 952:746 953:746 954:746 955:746 956:746 957:746 958:747 959:751 960:756
961:780 962:807 963:819 964:834 965:839 966:845 967:971 968:973 969:974 970:976
971:986 972:989 973:989 974:990 975:991 976:991 977:992 978:992 979:993 980:994
981:997 982:999 983:1000 984:1001 985:1002 986:1011 987:1016 988:1025 989:1025
990:1030 991:1033 992:1033 993:1035 994:1044 995:1045 996:1052 997:1056 998:1111
999:1136 1000:1209
average with cutoff 746 cutoff at n=967 for ct>=931 cycles
average w/o cutoff 754
difference raw-cutoff: 8
Here is the same for the fast version - identical code at identical position, see earlier post above:
Quote0:259 1:259 2:259 3:259 4:259 5:259 6:259 7:259 8:259 9:259 10:259
...
951:261 952:261 953:261 954:261 955:261 956:261 957:261 958:261 959:261 960:261
961:261 962:261 963:261 964:261 965:261 966:261 967:261 968:261 969:261 970:261
971:261 972:261 973:261 974:261 975:261 976:261 977:261 978:261 979:261 980:261
981:262 982:262 983:262 984:262 985:262 986:267 987:323 988:374 989:422 990:488
991:488 992:491 993:492 994:500 995:506 996:510 997:511 998:523 999:530 1000:758
MyTest
average with cutoff 259 cutoff at n=988 for ct>=324 cycles
average w/o cutoff 262
difference raw-cutoff: 3
Note these timings are for a P4, so it is not even CPU-specific. But test yourself - I have attached two executables now.
[attachment deleted by admin]
QuoteThe TIMER macros take account of that by waiting until a new time slice is available (@MichaelW: please correct me if that is imprecise).
Only the second set of macros (counter2.asm in counter2.zip) do this. These macros also capture the lowest cycle count that occurs in any loop.
Quote from: MichaelW on May 22, 2009, 09:12:12 AM
QuoteThe TIMER macros take account of that by waiting until a new time slice is available (@MichaelW: please correct me if that is imprecise).
Only the second set of macros (counter2.asm in counter2.zip) do this. These macros also capture the lowest cycle count that occurs in any loop.
Oops, I stand corrected - counter2.asm, lines 57 and 106:
invoke Sleep, 0 ;; Start a new time slice
If that is sufficient to guarantee a new slice, it can be done with a
QuotecyctCutFast = 10
cyctSleepMs = 0
See in cyct_begin:
ifdef cyctSleepMs
invoke Sleep, cyctSleepMs
endif
And unfortunately, it has no influence on the observed speed difference, so that does not seem to be the culprit...
Re lowest cycle count, my wrapper uses CombSortA to eliminate the outliers. On some occasions, I saw negative cycle counts, too.
AMD x64 4000+ / WinXP Pro x64
AddStrBSlow:
average with cutoff 215 cutoff at n=995 for ct>=269 cycles
average w/o cutoff 222
difference raw-cutoff: 7
AddStrBFast:
average with cutoff 230 cutoff at n=987 for ct>=288 cycles
average w/o cutoff 239
difference raw-cutoff: 9
Another cute example (d7s8=destination aligned 16+7, source 16+8):
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Algo MemCo1 MemCo2 MemCo5 MemCopy _imp__str
Description rep movsd movdqa movdqa Masm32 CRT
dest-al src-al A src-al B library strcpy
Code size 63 201 270 ? ?
----------------------------------------------------------------
2048, d0s0 729 603 602 1113 2389
2048, d7s8 1585 1220 831 1615 5430
2048, d7s9 1586 1226 831 4438 4948
Same code with MakeSlow = 1:
Algo MemCo1 MemCo2 MemCo5 MemCopy _imp__str
Description rep movsd movdqa movdqa Masm32 CRT
dest-al src-al A src-al B library strcpy
Code size 63 201 270 ? ?
----------------------------------------------------------------
2048, d7s8 1562 7280 7257 1622 5423
2048, d7s9 1562 7284 7271 5237 4923
if MakeSlow ; used exactly once, before the table
print chr$(13, 13)
endif
Good to know that others have the same problem:
I'm a developer for the x264 video encoder. (http://software.intel.com/en-us/forums/watercooler-catchall/topic/56271/) When Loren Merritt, another developer, was testing SAD assembly operation performance, he noticed that the number of clocks required varied wildly. In particular, when the unaligned data to be loaded crossed a cache line boundary, which occurred 25% of the time for blocks of width 16 bytes, up to eight times as many clock cycles were required for the entire operation, depending on the processor. This issue was found to exist not only for SSE2 load operations like MOVDQU, but also for MMX, and on processors going all the way back to the Pentium 3 (though obviously the cache line intervals varied). No older processors were tested than the Pentium 3. Page lines were also encountered every 64 cache lines, i.e. 1/256th of the time. Data crossing a page line resulted in up to 100 times slower performance;
thousands of clock cycles for a single 16x16 SAD operation that normally took 48. However, when testing the same operations on AMD processors, we noticed there was no measurable penalty at all for unaligned loads across cache lines, and furthermore, the page line penalty was a mere 5-10%.
[attachment deleted by admin]
JJ,
To force SleepEx to actually take a time slice set it to over 20 ms and you are reasonably safe, set to 0 it can immediately return without reference to a timeslice.
I usually use,
invoke SleepEx,100,0
By doing it this wy you get the fresh boost directly after the task switch and its very uniform from one instance of SleepEx() to another.
Quote from: hutch-- on June 09, 2009, 02:01:20 PM
invoke SleepEx,100,0
By doing it this wy you get the fresh boost directly after the task switch and its very uniform from one instance of SleepEx() to another.
Thanks, Hutch. I'll implement it next time, although it seems to have no influence on the timings. In the attachment above, I use the cutoff method, i.e. I take a thousand timings and use the average from n=10 to the first one that is 25% slower than cycles(10). The cutoff usually takes place at >990 for non-P4 and between 500...950 for P4. The resulting timings are very stable.
The problem here is not the measurement. That code is really slow, under specific conditions, and some call it "Cacheline splits aka Intel hell (http://x264dev.blogspot.com/2008/05/cacheline-splits-aka-intel-hell.html)" :toothy
P.S.: The code posted above has a new set of macros, wrapping the well-known TIMER.ASM macros by MichaelW. The table above was produced with this code (apologies to the macro haters :wink):
.code
MyString db "A 16 bytes text.", 0
start:
invoke ShowCpu, 1 ; prints the brand string and returns SSE level in eax
mov ebx, ByteCount/16+1
.Repeat
invoke lstrcat, addr src16, offset MyString
dec ebx
.Until Zero?
cyctProcs equ MemCo1, MemCo2, MemCo5, MemCopy, crt_strcpy
cyctTable "2048, d0s0", offset dest16, offset src16, 2048
cyctTable "2048, d8s8", offset dest8, offset src8, 2048
cyctTable "2048, d7s7", offset dest7, offset src7, 2048
cyctTable "2048, d7s8", offset dest7, offset src8, 2048
cyctTable "2048, d7s9", offset dest7, offset src9, 2048
cyctTable "2048, d9s7", offset dest9, offset src7, 2048
cyctTable "2048, d9s8", offset dest9, offset src8, 2048
cyctTable "2048, d9s9", offset dest9, offset src9, 2048
inkey chr$(13, 10, 10, "--- ok ---", 13)
exit
:U@echo off
\masm32\bin\ml.exe /FoMemCopySSE2.obj /nologo MemCopySSE2.asm
pause
Assembling: MemCopySSE2.asm
MemCopySSE2.asm(3) : fatal error A1000: cannot open file : \masm32\Gfa2Masm\Gfa2
Masm.inc
Press any key to continue . . .
Quote from: UtillMasm on June 09, 2009, 02:27:39 PM
fatal error A1000: cannot open file : \masm32\Gfa2Masm\Gfa2Masm.inc
Oops... download new version above, or change lines 2+3 as follows:
include \masm32\include\masm32rt.inc
; include \masm32\Gfa2Masm\Gfa2Masm.inc ; needed only for deb macro
hey Jochen,
I was looking at this post you refered to...
http://software.intel.com/en-us/forums/watercooler-catchall/topic/56271/
they are using "ffmpegs bench.h" to time algos
do you know where to find that?
i looked high and low for it
Quote from: dedndave on June 09, 2009, 04:00:09 PM
hey Jochen,
I was looking at this post you refered to...
http://software.intel.com/en-us/forums/watercooler-catchall/topic/56271/
they are using "ffmpegs bench.h" to time algos
do you know where to find that?
i looked high and low for it
Apparently ffmpegs is a suite for Linux that includes also a -bench option...
well - i found ffmpeg.org - they make an open source mpeg player/editor/whatever
i was interested in their timing code
the file i was trying to find was named "bench.h"
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
Algo MemCo1 MemCo2 MemCo5 MemCopy _imp__str
Description rep movsd movdqa movdqa Masm32 CRT
dest-al src-al A src-al B library strcpy
Code size 63 201 270 ? ?
----------------------------------------------------------------
2048, d0s0 549 424 361 546 2094
2048, d8s8 585 465 403 555 2100
2048, d7s7 593 478 411 1390 2088
2048, d7s8 834 562 564 1055 2172
2048, d7s9 834 564 564 1187 2175
2048, d9s7 833 564 565 1561 2173
2048, d9s8 834 562 566 1216 2172
2048, d9s9 575 472 410 1399 2091
Got some weird timings again:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCo5 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 2591 566 363 363 373 363 560
2048, d1s1-0 2598 619 421 423 444 423 1726
2048, d7s7-0 2584 619 419 421 446 421 1735
2048, d7s8-1 2782 1525 1090 441 956 1033 1527
2048, d7s9-2 2788 1526 1090 441 957 1033 1755
2048, d8s7+1 2786 1309 1090 683 809 815 1467
2048, d8s8-0 2581 619 421 423 448 423 560
2048, d8s9-1 2789 1526 1083 441 956 1033 1465
2048, d9s7+2 2790 1309 1081 684 809 815 1802
2048, d9s8+1 2782 1309 1081 684 809 815 1511
2048, d9s9-0 2589 619 421 423 448 423 1737
2048, d15s15 2582 619 423 425 446 424 1723
MemCoP4 for the P4. MemCo2 should be insensitive to the cacheline split trap, as it does no unaligned memory accesses.
MemCoC3 is optimised for Celeron, and pretty fast, but very sensitive to code location.
Grateful for tests on other CPUs, jj - [EDIT] with the new MemCopySSE2c.zip, see two posts further down.
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (SSE4)
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCo5 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 1985 191 204 205 203 175 196
2048, d1s1-0 1992 223 251 248 251 222 244
2048, d7s7-0 1982 227 253 250 248 223 243
2048, d7s8-1 1976 246 592 501 193 211 243
2048, d7s9-2 1986 247 594 501 194 212 243
2048, d8s7+1 1982 249 587 501 193 179 244
2048, d8s8-0 1976 229 255 253 251 226 243
2048, d8s9-1 1983 249 588 502 194 211 243
2048, d9s7+2 1982 248 583 501 193 179 243
2048, d9s8+1 1977 249 583 501 193 179 243
2048, d9s9-0 1984 229 255 253 251 226 244
2048, d15s15 1974 228 256 256 258 226 243
--- ok ---
Core 2 shows dramatic improvements...!
I have renamed MemCo5 to MemCoC2, for Core 2. And I moved the somewhat faster loop for cases d7s8-1 etc. from MemCoP4 to the new MemCoC2.
[attachment deleted by admin]
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 2079 551 359 424 424 359 547
2048, d1s1-0 2084 613 410 473 473 410 1060
2048, d7s7-0 2080 598 412 474 474 411 1059
2048, d7s8-1 2156 853 1016 564 567 569 802
2048, d7s9-2 2162 859 1016 564 567 568 1058
2048, d8s7+1 2172 849 868 564 565 566 804
2048, d8s8-0 2082 603 404 465 465 402 547
2048, d8s9-1 2177 848 995 564 565 568 803
2048, d9s7+2 2156 855 862 581 565 580 1060
2048, d9s8+1 2167 869 878 564 567 566 803
2048, d9s9-0 2090 595 412 472 472 409 1060
2048, d15s15 2064 592 410 470 486 408 1060
Thanks, Mark. Pretty balanced over the MemCoC3 MemCoP4 MemCoC2 columns.
But I am surprised about the cycle counts of ramguru's i7. The Nehalem Wiki (http://en.wikipedia.org/wiki/Intel_Core_3#Performance) is full of overclocking talk, but they seem to bark at the wrong tree: That CPU runs the Masm32 library MemCopy in 240 cycles, instead of the ca. 1700 of Celeron and the over 4000 of the P4, for misaligned data ::)
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (SSE4)
Algo _imp__st MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
strcpy dest-al psllq src+dest dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 1647 360 210 210 209 165 367
2048, d1s1-0 1649 396 259 266 258 219 1876
2048, d7s7-0 1631 402 265 261 261 218 1876
2048, d7s8-1 2159 1332 861 493 658 670 1439
2048, d7s9-2 2188 1338 862 493 658 666 1906
2048, d8s7+1 2151 1328 855 829 701 700 1393
2048, d8s8-0 1639 402 267 262 268 236 365
2048, d8s9-1 2205 1333 854 493 658 667 1290
2048, d9s7+2 2151 1329 849 828 700 701 1877
2048, d9s8+1 2143 1330 849 830 701 701 1471
2048, d9s9-0 1642 403 266 262 268 235 1884
2048, d15s15 1642 404 270 264 265 235 1893
those core duo's are pretty nice, too :U
I collected some evidence that could solve the "location sensitivity" mystery - which seems much more a cacheline split penalty (http://www.google.com/search?q=cacheline+split+penalty) issue. Only Intel CPUs are affected, by the way. It also explains why ramguru's i7 is so blazingly fast...
The attachment contains timings at the bottom. It is RTF, written with RichMasm (http://www.masm32.com/board/index.php?topic=9044.msg73814#msg73814) but WordPad opens it, too.
Cygnus: (http://www.cygnus-software.com/papers/infinitytestresults.html)
Pentium has 3 times penalty for loading across QWORD boundaries.
Pentium has 3 times penalty for loading across cache line boundaries.
Pentium has 3 times penalty for loading across page boundaries.
Pentium has 6 times penalty for saving across QWORD boundaries.
Pentium has 6 times penalty for saving across cache line boundaries.
Pentium has 16 times penalty for saving across page boundaries.
Pentium III has 0.7 times!!! 'penalty' for loading across QWORD boundaries.
Pentium III has 5.5 times penalty for loading across cache line boundaries.
Pentium III has 47 times penalty for loading across page boundaries.
Pentium III has 0.7 times!!! 'penalty' for saving across QWORD boundaries.
Pentium III has 4.5 times penalty for saving across cache line boundaries.
Pentium III has 64 times penalty for saving across page boundaries.
Pentium IV has no penalty for loading across QWORD boundaries.
Pentium IV has 5.5 times penalty for loading across cache line boundaries.
Pentium IV has 20 times penalty for loading across page boundaries.
Pentium IV has no penalty for saving across QWORD boundaries.
Pentium IV has 22 times penalty for saving across cache line boundaries.
Pentium IV has 23 times penalty for saving across page boundaries.
Doom9: (http://forum.doom9.org/archive/index.php/t-131033.html)
The best case would be to compare only two aligned blocks, but we must shift either the source
or the destination block freely, so one of them should be unaligned in most cases. The trouble is,
that only MMX tolerates unaligned access most SSE instructions don't. To make matters worse,
any newer intel chip hat a noticeable penalty for a memory access across physical cache line borders
and an even worse penalty if that also happens to coincide with a memory page border. So there are
unfortunately only two possible options right now: use simple MMX and ignore the penalty (which is
essentially using the old isse functions) or use a work around like in x256sad and align the memory
blocks (this could be done once for each frame if all data is sized in accordingly (padding and the like)).
Compensating on both sides source and destination is only possible for MMX but the overhead should
be far greater than the gain (ok it would be possible but it is certainly slower).
h264 (https://svn.blender.org/svnroot/bf-blender/trunk/blender/extern/x264/common/i386/pixel-sse2.asm)
Core2 (Conroe) can load unaligned data just as quickly as aligned data...
unless the unaligned data spans the border between 2 cachelines, in which
case it's really slow. The exact numbers may differ, but all Intel cpus
have a large penalty for cacheline splits.
(8-byte alignment exactly half way between two cachelines is ok though.)
LDDQU was supposed to fix this, but it only works on Pentium 4.
So in the split case we load aligned data and explicitly perform the
alignment between registers. Like on archs that have only aligned loads,
except complicated by the fact that PALIGNR takes only an immediate, not
a variable alignment.
Phaeron - 12 03 08 - 00:49 (http://www.virtualdub.org/blog/pivot/entry.php?id=190)
On current Intel CPUs, an L1 cache line is fetched and pushed through a shifter, which accesses the
desired words. If the access crosses L1 cache lines, a data cache unit split event occurs (with
associated penalty), bytes are shifted out from the adjacent L1 cache line, and the results are
combined to produce the misaligned words.
x264 (http://www.digital-digest.com/software/x264_history.html)
movaps/movups are no longer equivalent to their integer equivalents on the Nehalem, so that
substitution is removed. Nehalem has a much lower cacheline split penalty than previous Intel CPUs,
so cacheline workarounds are no longer necessary.
...
Rev. 696: avoid memory loads that span the border between two cachelines.
on core2 this makes x264_pixel_sad an average of 2x faster
Nehalem optimizations: the powerful new Core i7 (http://x264dev.multimedia.cx/?cat=1)
The cacheline split problem is basically gone: the penalty is now a mere 2 clocks instead of 12 for a
cacheline-split load. This, combined with the SSE speed improvements, ... took 150 cycles on Penryn
without cacheline split optimization, 111 cycles with, and takes 62 cycles on the Nehalem
...
Intel has finally come through on their promise to make float-ops-on-SSE-registers-containing-integers
have a speed penalty. So, we removed a few %defines throughout the code that converted integer ops
into equivalent, but shorter, floating point instructions. Unfortunately, there seems to be no way to
completely avoid floatops on integer registers, as many of these operations have no integer equivalents.
A classic example is "movhps", which takes an 8-byte value from memory and puts it into the high section
of a 16-byte SSE register. In integer ops, one can only directly move into the low 8-byte section of the register.
Emulating these float ops with complex series of integer ops is far too slow to be worthwhile, ...
[attachment deleted by admin]
ha-ha file in last attachment has .asc extension .. took me some time to figure out that it should be .rtf :lol
Quote from: jj2007 on June 13, 2009, 10:09:19 PM
I collected some evidence that could solve the "location sensitivity" mystery
really ?
>any newer intel chip hat a noticeable penalty for a memory access across physical cache line borders
yep, but it's well known and since a long time, search after "memory aliasing" in docs.
>all Intel cpus have a large penalty for cacheline splits.
same as before...
>If the access crosses L1 cache lines, a data cache unit split event occurs (with
associated penalty), bytes are shifted out from the adjacent L1 cache line, and the results are
combined to produce the misaligned words.
:bdg obviously those guys confuse cache lines and core memory cluster.
>movaps/movups are no longer equivalent to their integer equivalents on the Nehalem
? in virtue of what ? bits are bits !
>float-ops-on-SSE-registers-containing-integers have a speed penalty
same as before, data movments are the same ! and using others float operations has an interest ?
>A classic example is "movhps", which takes an 8-byte value from memory and puts it into the high section
of a 16-byte SSE register. In integer ops, one can only directly move into the low 8-byte section of the register.
Emulating these float ops with complex series of integer ops is far too slow to be worthwhile, ...
pffft... the problem is the cost/manipulations to access to the high part of the register, it the same problem between unpcklps and unpckhps, unpcklbw and unpckhbw, punpckldq and punpckhdq, etc...
Perhaps a (partial) solution would be to use a larger alignment, specifically an alignment that matches the processor's cache line size or some integer multiple of it. And thinking further, you should be able to control the instruction sizes or add padding so that no instruction would split a cache line boundary.
that sounds like a lot of work
perhaps for very intensive speed critical areas
but, i think at some point you have to play the law of averages
that is, in part, what intel uses as a guide-line for design (hopefully)
They are not talking about the instruction cache. The data are the problem.
yes - i understand that
if you open a file buffer and read 64 kb of a file (or more), you are going to play the law of averages
otherwise, you are going to spend more cpu time thrashing things about than you save
if you are talking about declared data, most of that is used a few times during execution
and that of it that isn't may be adjusted
like i say - there is a point where it won't pay to optimize - let the law of averages take over
on certain portions of data, you will get lucky
on other portions, you will not
intel has designed the system so that you are lucky much more often than not
(of course, that's just a theory - lol)
Some thoughts (in regards to the original threadline of both variable timings, and consistent alignment issues)
---
On the issue of those spurious higher clock timings, even when you wait for the start of a timeslice.
Not all interrupts are timer-based. Your ethernet adapter may trigger an IRQ. Your video card may trigger an IRQ. Your disk drives may trigger an IRQ. Your USB ports may trigger an IRQ. Your mouse can trigger an IRQ.
All IRQ's are more important than your thread. When an IRQ happens, your thread is stopped abruptly so that the interrupt handler (drives, etc) may service it.
---
On the issue of caches.
There are plenty of important concepts that need to be considered. Perhaps the most important on Pentiums, but also the most frustrating, is Intel's "trace cache." Here you are dealing with some alignment issues that are completely hidden from you. Instructions within the trace cache have their own alignment which is independent of your binary. The stuff in the trace cache does not have a 1:1 correspondence with the x86 instruction set you are using. Its all in microcode, a language below x86.
Moving along.. physical memory is organized into cache lines, and each line can only be stored in specific cache sets. In a "4-way set associative" cache, a specific cache line only has 4 unique slots within the cache that it can be stored. The worst-case scenario is when you are constantly accessing 5 or more spots of memory that are all assigned the same set. 5 lines cannot fit in the 4 slots.
---
On the issue of code alignment in regards to cache.
Obsessive aligning can be a detriment. Putting code in the cache that will never get executed definately has its downsides. Those dummy instructions were moved along the bus from system ram into L3, then again into L2, and again into L1, only to have a 0% chance of being needed.
---
On the issue of code alignment in regards to branch prediction.
This is the second biggest issue, and also the second most frustrating. This is true for both AMD's and Intels, with the least effect (that I know of) on Core2's .. I am still unfamiliar with I7's
The forward branch prediction logic uses its own little bit of memory that can be considerd a cache. Here is the deal. Every memory address is assigned an index within this branch cache. When a conditional branch is taken, this little branch cache is updated (remembering what happened... YES it was, or NO it wasn't)
Lots of memory addresses share the same index, and this creates contention. Several different conditional branch instructions within your code can interfere with each other in regards to branch prediction. The Core2 luckily remembers the last several branches for a given index, and is able to handle simple patterns pretty well (YES / NO / YES / NO is terrible on AMD64's (always gets it wrong), but not a problem for the Core2)
---
Deal only with real world scenerios or it doesnt count.
Quote from: Rockoon on July 27, 2009, 10:49:23 PM
In a "4-way set associative" cache, a specific cache line only has 4 unique slots within the cache that it can be stored. The worst-case scenario is when you are constantly accessing 5 or more spots of memory that are all assigned the same set. 5 lines cannot fit in the 4 slots.
Stuff for thought. That could be the culprit indeed.
Good to see you back, Rockoon :thumbu
Quote from: jj2007 on July 27, 2009, 11:00:34 PM
Quote from: Rockoon on July 27, 2009, 10:49:23 PM
5 lines cannot fit in the 4 slots.
Stuff for thought. That could be the culprit indeed.
Good to see you back, Rockoon :thumbu
Thing brings to mind a simple test program that can be written. Allocate a nice big buffer of memory and then test various steppings through it. You could even write something like this in visual basic (as I have done previously) and get meaningfull results, since cache misses are extremely exspensive and thus easily measured.