Clock-cycle count macros for Microsoft C

Started by MichaelW, January 11, 2010, 06:09:08 AM

Previous topic - Next topic

MichaelW

The attachment contains my attempt at cycle-count macros for the Microsoft C compilers that do essentially what the second set of MASM macros (counter2.zip) do. I started out basically trying to port a set of macros that I created for FreeBASIC, where my counting loop was a FOR loop started in the CTR_BEGIN macro and closed in the CTR_END macro. This worked well with the FreeBASIC compiler, but the Microsoft compiler would close the loop at the end of the CTR_BEGIN macro and not include the loop code from the CTR_END macro. To solve this problem I ended up using code labels and the dreaded goto. I could not find a way to create unique labels for each macro invocation with the preprocessor, so I solved this problem by adding a parameter to each of the macros so the programmer can supply a unique symbol that gets appended to the labels. And another problem is that compiler optimizations /O1 and /O2 will break something in the macros. In the test app I protected the macros by switching the optimizations off. This will affect the code between the macros, but will not prevent the compiler from optimizing code elsewhere in the source. And note that the second set of MASM macros, and likely these macros, do not work well for the Pentium IV.

Another test, this one to verify that compiler optimization of code outside the area of the macro calls works as expected.

#include <windows.h>
#include <stdio.h>
#include <conio.h>
#include <math.h>
#include "counter_c.c"

//--------------------
// Build with /O2 /G6
//--------------------

//------------------------------------------------------------------
// The memcpy functions are from the source provided with the PSDK.
//------------------------------------------------------------------

void * __cdecl memcpy_opt (
        void * dst,
        const void * src,
        size_t count
        )
{
        void * ret = dst;
        while (count--) {
                *(char *)dst = *(char *)src;
                dst = (char *)dst + 1;
                src = (char *)src + 1;
        }
        return(ret);
}

#pragma optimize( "", off )

void * __cdecl memcpy_no_opt (
        void * dst,
        const void * src,
        size_t count
        )
{
        void * ret = dst;
        while (count--) {
                *(char *)dst = *(char *)src;
                dst = (char *)dst + 1;
                src = (char *)src + 1;
        }
        return(ret);
}

//------------------------------------------------------------
// Switching the optimizations off proved to be necessary for
// the /O1 and /O2 options to prevent the compiler from
// breaking something in the macro code.
//------------------------------------------------------------

int main(void)
{

    #define SZ 64

    char src[SZ];
    char dst[SZ];

    Sleep(4000);

    CTR_BEGIN( 1, 1000, HIGH_PRIORITY_CLASS );
        memcpy_opt( &dst, &src, SZ );
    CTR_END( 1 )
    printf( "%d cycles\n", ctr_cycles );

    CTR_BEGIN( 2, 1000, HIGH_PRIORITY_CLASS );
        memcpy_no_opt( &dst, &src, SZ );
    CTR_END( 2 )
    printf( "%d cycles\n", ctr_cycles );

    CTR_BEGIN( 3, 1000, HIGH_PRIORITY_CLASS );
        memcpy_opt( &dst, &src, SZ );
    CTR_END( 3 )
    printf( "%d cycles\n", ctr_cycles );

    CTR_BEGIN( 4, 1000, HIGH_PRIORITY_CLASS );
        memcpy_no_opt( &dst, &src, SZ );
    CTR_END( 4 )
    printf( "%d cycles\n", ctr_cycles );

    CTR_BEGIN( 5, 1000, HIGH_PRIORITY_CLASS );
        memcpy_opt( &dst, &src, SZ );
    CTR_END( 5 )
    printf( "%d cycles\n", ctr_cycles );

    CTR_BEGIN( 6, 1000, HIGH_PRIORITY_CLASS );
        memcpy_no_opt( &dst, &src, SZ );
    CTR_END( 6 )
    printf( "%d cycles\n", ctr_cycles );

    getch();
    return 0;
}

#pragma optimize( "", on )


eschew obfuscation

alp

why _ctr_overhead_ and ctr_cycles set to 2000000000?                             
                                   
   

MichaelW

Quote from: alp on September 28, 2010, 09:34:35 AM
why _ctr_overhead_ and ctr_cycles set to 2000000000?

I did this so the following statements can capture the lowest values that occur in any loop:

if( _ctr_tsc2_ - _ctr_tsc1_ < _ctr_overhead_ )
  _ctr_overhead_ = _ctr_tsc2_ - _ctr_tsc1_;

if( _ctr_tsc2_ - _ctr_tsc1_ < ctr_cycles )
  ctr_cycles = _ctr_tsc2_ - _ctr_tsc1_;


If the initial values are too low then the comparisons will always return zero, and the assignment statements will never be executed. The 2000000000 was just a convenient value that I knew it would work.
eschew obfuscation

alp

Thanks, does the loop in CTR_BEGIN is also acting cache warm-up for rtdsc? I don't see any rtdsc in _ctr_warmup_();

MichaelW

Quote from: alp on September 29, 2010, 02:12:40 PM
Thanks, does the loop in CTR_BEGIN is also acting cache warm-up for rtdsc? I don't see any rtdsc in _ctr_warmup_();

Thank you for pointing this out. I left the RDTSC instructions out because I was working from memory, and my memory of this detail was bad. Now that I refer to the Intel appnote were I first saw the technique, the warm-up code looks like this:

// Make three warm-up passes through the timing routine to make
// sure that the CPUID and RDTSC instruction are ready
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
// Only the last value of subtime is kept
// subtime should now represent the overhead cost of the
// MOV and CPUID instructions


But as I have stated elsewhere, I have no confidence in this code because the author neglected to control the CPUID function (the value in EAX when the instruction executes), and I know this to have a relatively large effect on the CPUID cycle count. I basically added the warm-up code on the assumption that it would not hurt, and might help.
eschew obfuscation

alp

Thanks again i am following same intel doc for my own routines and was curious about your warmup code.