inline MSVC++ and g++ asm, x86-64.

Started by Falene, May 02, 2012, 02:24:55 PM

Previous topic - Next topic

Falene

Hi There :)

I believe I need to write this fast, as this will be my first post here, and I don't want the admins to think of me as a spambot :p
By no means am I an asm guru... I'm mostly your average C++ geek dreaming of accomplishing grand things with too little time (and too little knowledge).
I think I have a need for inline assembly, though... and I expect to find here a handful of people with fewer allergies to asm code than elsewhere.

What I'm trying to do is code an implementation of fixed point maths in C++, very specifically suited for my needs to 64 bit signed integers with 30bits fractional part.
I also happen to dream of accomplishing those goals, relevant to the problem at hand :
a) I wish I could get my whole project working on both Windows and Linux, using MSVC++ and g++ respectively.
b) I wish I'd be able to optimize my fixed point routines on both systems when architecture is 64 bits. (and I don't really care about other platforms than AMD and Intel)
c) I'd like the implementation of basic mathematic operations to be inlineable in C++ <-- please emphasize this.
d) If possible, I'd like to factorize what is factorizable, but I have no trouble using a couple of #ifdef's...

I have a working C(++) implementation for a multiplication of two of these fixed point thingies. Basically using masks and shifts and several plain C operator *'s.
I have a working C(++) implementation for division using a poorly implemented multiplication with inverse.

I believe using x86-64 mulq and being able to access the high part of the result directly would speed up my multiplication.
I believe making use of x86-64 divq twice and being able to set "rdx" in the rdx:rax operand with the rest from the previous divq would make my division implementation a lot more accurate.

Problems :
-> I know that intrinsics functions exist, at least for MSVC++, which provide inlined versions of the mulq instruction, but no such thing exist for divq.
-> What I've come to realize is, at least MSVC++ does not support inlined asm for x86-64 instruction set... -_- go figure.
-> MSVC++ and g++ have very different inlined asm-block syntax.

How would you gentlemen (and ladies, who knows) handle those problems ? I'm at the edge of leaving those optimizations apart altogether... Yet a primitive profiling of my naive c++ implementations shows that they really could benefit my application.

Many thanks :)

dedndave

welcome to the forum   :U

it doesn't have to be inline, unless you want it to be
it is possible to create a library of math functions in ASM that are callable from C
that makes your ASM code look like ASM code and your C code look like C code   :P

poke around in the 64-bit sub-forum for related posts
start with one function - learn to write it, build it, add it to a LIB, and use it
then, use that as a template to create additional functions

Falene

Thanks for your answer, dedndave :)

Yeah, I suspect removing this inline requirement from my self-imposed specs would make things easier, but I was assuming a performance hit to add a function call to such a basic operation as multiplication.
Maybe i'm wrong in this assumption, though.

Oh, I can read your answer already... "profile your code, son".
*sigh*
I'm going, Pa ! Just a lil minute...

Edit : Oh, and, btw... even with extern asm code, I need to take care of function parameters, which for x64 would be handled differently in MSVC++ and g++, right ?

dedndave

QuoteOh, I can read your answer already... "profile your code, son".

lol - i am the LAST guy you are going to hear that from   :bg

yes, there is a penalty - and yes, stack alignment is an issue with 64-bit calls
however - if you inline some code into a C function and call it - you get the same stuff

now, if you are talking about inlining the multiplication code where it's used and each time it's used - that may be advantageous
with your special fixed point format, your code could get large
it depends on how many times each function will get used, of course

i would first try to "bend" one of the floating point formats to fit my needs
i think double format has 53 fraction bits and a range of something like 2x10-308 to 2x10+308
if that isn't enough range - i guess you could use 2 doubles to extend the range and have some conversion routines
this allows you to use the FPU or, better yet, SSE instructions
SSE is likely to provide you with the fastest code   :U

there is also the extended real format, which has 64 bits of precision and a range of ~3-4932 to 2+4932
the FPU supports this format - i don't think SSE does - haven't got that far, yet

i am not very knowledgable with either 64-bit code or SSE   :(
but there are a few guys in here that post example code that may help you learn

if you write "discrete" code for your own format, it will not be as fast as one of the native formats
not by a long shot

perhaps you could write some conversion code (to and from your format) and use native formats for the calculations

nixeagle

It sucks that MSVC does not support 64bit inline assembler :(. That means you have to use the intrinstics for that.

I do agree that you should instruct the compiler to inline these functions. Now onto my solution ideas!

Since you are targeting two compilers that don't always play nice with each other; you are going to have to use the preprocessor. I would create a header file multiplication.h and do the whole implementation in this file. Start off with a declaration of your function:


#include <inttypes>
typedef uint64_t u64;

inline u64 multiply(u64,u64);


From here, just write two implementations. On GCC you can use inline assembly and on MSVC you can use intrinstics.

P.S. with GCC at least, you can use compiler attributes to instruct it to always inline and to inline as if the function were a macro. I generally define a preprocessor define to handle this:

#ifdef __GNUC__
  #define GNUINLINE __attribute__((always_inline,gnu_inline)) inline
#else
  #define GNUINLINE inline
#endif


Of course, if you can find similar functionality for the microsoft compiler, just replace the portion after #else with the correct code.

dedndave

hmmmm.....
maybe you can use data defines to force it to do 64-bit asm   :P
we sometimes use something like:
        db 83h,0F8h,7      ;cmp eax,7
to hard-code instructions into the code stream

it may not be pretty, but it works
and you can be assured it will generate the same code with different compilers

nixeagle

Quote from: dedndave on May 02, 2012, 06:28:57 PM
hmmmm.....
maybe you can use data defines to force it to do 64-bit asm   :P
we sometimes use something like:
        db 83h,0F8h,7      ;cmp eax,7
to hard-code instructions into the code stream

it may not be pretty, but it works
and you can be assured it will generate the same code with different compilers

From what I understand, you can't do that because there is no (working) asm keyword in MSVC when compiling in 64bit mode. Without that we are unable to escape the C/C++ abstract machine.

jj2007

Quote from: dedndave on May 02, 2012, 03:24:20 PM
it is possible to create a library of math functions in ASM that are callable from C
that makes your ASM code look like ASM code and your C code look like C code   :P

Quote from: Falene on May 02, 2012, 03:49:55 PM
I was assuming a performance hit to add a function call to such a basic operation as multiplication

You would not call a Million times in an innermost loop an asm routine that just multiplies two numbers.
However, as soon as you are able to write the .Repeat ... .Until or whatever loop code in assembler, the performance hit is gone.
Don't think in single instructions, think in innermost loops written in assembler and called from C.

Falene

Wow, lots of replies :) Thanks a lot.

@dedndave : Fixed point add and sub is a plain Integer add and sub, so it stands its ground performance-wise vs the fpu stuff, even when profiling it with uint64_t weirdies and consorts on a 32b arch... yet multiplication has room for improvment, and division is almost a no-go, atm. But amongst my reasons for choosing fixed point over floating point, is the fact that accumulators are ops i'm planning to use far more often than muls in a great number of cases. I could go at great lengths about the other whys of this choice... maybe this is not the place, though... (?) but I could, if you were to insist on that ^^.

@nixeagle : Yup, I could do something like that for multiplication indeed. I was hoping I could also find a similar solution to the the division problem.

@dedndave² : That one seems particularly dirty ;) I like it, tbh :p And this is not the first time someone suggests this kind of hack as a workaround. Yet, a collegue of mine (this is not a pro work, but I know some nice guys :p) raised a concern some days ago about this stuff (as I suspect the implicit next step in your post would be to cast a ptr to this data to a function pointer and call it) : Won't some OS refuse such a run-time mix of data addresses with executable addresses ? on the fear that it would greatly look like some viral code :p
Even if not, I have another concern, which is that this solution implies a function call, afterall :(

@nixeagle² : wow, you might very well be right, in that case this is even worse than I thought at first.

@jj2007 : I think you hit the nail with this solution. This means leaving most operations implemented in plain C, and only optimizing those in need (hear the sound of profiling, anyone... ? gah -_-) together with their surrounding code such as those tight innermost loops.
I was almost going to reply that perfect reproductibility was one of my goal, but this adds nothing to this discussion, as my concern is the reproductibility of a block of code vs exact same block on another platform/run/mode... not the reproductibility of any result after using div() or mul()... so, that may very well be the perfect solution.

Pity me, though... profiling :(

Thanks everyone :)

dedndave

if you are set on fixed-point, i would think SSE would be the way to go
maybe Jochen or one of the other more advanced guys could get you started

oh - i just remembered
Raymond does some fixed-point stuff, too...
http://www.ray.masmcode.com/fixmath.html
at the bottom of the page is a download
maybe you could get some ideas