Is that worth a bug report to the compiler designers?

Started by Gunther, September 10, 2010, 12:57:03 AM

Previous topic - Next topic

Gunther

This thread isn't exclusive for 64 bit assembly language; it also deals with some 32 bit programming questions. The main focus isn't speed, but accuracy.

In the last days, I've bit tested, how the gcc treats floating point values in extended precision (REAL 10, in C/C++ long double). This data type is x86 specific, but in accord with the IEEE publications and standards. They are very useful in the field of numerical linear algebra, for example, to solve large, ill conditioned linear equation systems. There are not much compilers, which support this data type and handle it properly. I know only the PowerBASIC compilers (DOS and Win32 versions) and some older Borland products (TurboC, Turbo Pascal), to name a few.

I've attached the archive pi.zip. It contains 3 small applications for: Linux64, Linux32, and Win32. Full source is included.

The program doesn't do much. It assigns PI to a long double (REAL 10), double (REAL 8), and float (REAL 4) value in 2 different ways and prints out the results. First, the compiler has to do the job and second, it's done via FPU using the inline assembler. Don't be afraid, I've written it in Intel syntax, so it should be readable. For several reasons, there isn't a 64 bit Windows version. At the moment, I haven't Windows 7 running on my machine; on the other hand, the parameter passing mechanism to functions is under 64 bit Windows very different from 64 bit Linux. That makes, in general, programming more complicated. Moreover, if the inline source would be re-written for Win64 parameter passing, it would produce the same garbage by printing long doubles as under Win32 (see below). But the Win32 version will run flawless in the compatibility mode under Windows 7; that's tested.

Here are the test results:

Linux 64 bit, compiled with gcc 4.3.5:

Layout of data types:
----------------------
Long Double (REAL 10)   = 16 Bytes
Double (REAL 8)         = 8 Bytes
Float (REAL 4)          = 4 Bytes

Compiler Results:
-----------------
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = 3.1415926535897931160
PI as Double Value      = 3.1415926535897931
PI as Float Value       = 3.141593

FPU Results:
------------
FPU control word:       = 37F hex
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = 3.1415926535897932385
PI as Double Value      = 3.1415926535897931
PI as Float Value       = 3.141593


Linux 32bit, compiled with gcc 4.2.4

Layout of data types:
----------------------
Long Double (REAL 10)   = 12 Bytes
Double (REAL 8)         = 8 Bytes
Float (REAL 4)          = 4 Bytes

Compiler Results:
-----------------
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = 3.1415926535897931160
PI as Double Value      = 3.1415926535897931
PI as Float Value       = 3.141593

FPU Results:
------------
FPU control word:       = 37F hex
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = 3.1415926535897932385
PI as Double Value      = 3.1415926535897931
PI as Float Value       = 3.141593


The data layout of long doubles is different, as you can see in the screen outputs above. Under 64 bit such a value is stored in 16 bytes (10 used, 6 unused). That's in according with AMDs 64 bit ABI http://www.x86-64.org/documentation/abi-0.99.pdf. Under 32 bit, they are stored in 12 bytes (10 used, 2 unused). Furthermore, the printed FPU control word (37Fh) is the proof, that the FPU is fit to treat REAL 10 values properly. I won't explain here the meaning and setting of every bit; for further information, please read: http://www.website.masmforum.com/tutorials/fptute/index.html. Thank you Raymond for your excellent FPU tutorial.

On the other hand, the screen printings show obviously that the gcc supports the data type but handles it not adequate. It treats them as REAL 8 (double). I've inspected the assembler source of the main procedure and that showed: the 64 bit gcc uses for all floating point operations xmm registers. It is absurd.

Under Windows (both, 64 and 32 bit) the situation is different. Microsoft compilers don't support REAL 10 values. You can have double or float - and that's it. As shown above, the windows gcc supports long double but uses the MS libc and therefore printf can't properly print such values - please have a look at the junk below.

Win32, compiled with gcc 3.4.5 (MinGW)

Layout of data types:
----------------------
Long Double (REAL 10)   = 12 Bytes
Double (REAL 8)         = 8 Bytes
Float (REAL 4)          = 4 Bytes

Compiler Results:
-----------------
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = -88796093704928900000000000000000000000000000.0000000000000000000
PI as Double Value      = 3.1415926535897931
PI as Float Value       = 3.141593

FPU Results:
------------
FPU control word:       = 37F hex
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = -88796093704934495000000000000000000000000000.0000000000000000000
PI as Double Value      = 3.1415926535897931
PI as Float Value       = 3.141593


Summary:

I think that's a serious error in the compiler design of the gcc (probably all versions). I'll try to set up a bug report at their awkward bug tracking system and we'll see what happens.

If there is enough time in the next days, I'll write a similar test program for the 16 bit PowerBASIC compiler. I'm pretty sure, that there are no differences between compiler code and hand written FPU code. I'll post the test suite inside the 16 bit forum.

Gunther   
Forgive your enemies, but never forget their names.

GregL

Gunther,

If I remember correctly, MinGW uses the Microsoft C Run-Time Library, so I don't think it would be a bug since in the Microsoft library a long double is the exactly the same as a double and is documented as such.

Also you said that the REAL10 data type is x86 specific, but you can use REAL10 variables in Windows x64 assembly language code. The problem is, at this time, in x64, I'm not aware of any way to convert a REAL10 to a string to display them.


clive

Microsoft C 8.00c (16-bit) aka MSVC 1.0

C:\MASM\pi\DOS>pi

Layout of data types:
----------------------
Long Double (REAL 10)   = 10 Bytes
Double (REAL 8)         = 8 Bytes
Float (REAL 4)          = 4 Bytes

Compiler Results:
-----------------
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = 3.1415926535897931160
PI as Double Value      = 3.1415926535897930
PI as Float Value       = 3.141593

FPU Results:
------------
FPU control word:       = 1370 hex
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = 3.1415926535897932385
PI as Double Value      = 3.1415926535897930
PI as Float Value       = 3.141593


The real problem here is that it is parsing your "long double" into a "double", you need to put an L at the end so it handles it as a LONG. I'm pretty sure GCC needs you to do that too.

#define M_PI 3.14159265358979323846264338327L

C:\MASM\pi\DOS>pi

Layout of data types:
----------------------
Long Double (REAL 10)   = 10 Bytes
Double (REAL 8)         = 8 Bytes
Float (REAL 4)          = 4 Bytes

Compiler Results:
-----------------
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = 3.1415926535897932385
PI as Double Value      = 3.1415926535897930
PI as Float Value       = 3.141593

FPU Results:
------------
FPU control word:       = 1370 hex
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = 3.1415926535897932385
PI as Double Value      = 3.1415926535897930
PI as Float Value       = 3.141593
It could be a random act of randomness. Those happen a lot as well.

Gunther

Quote from: GregL, September 10, 2010, at 04:56:22 AMIf I remember correctly, MinGW uses the Microsoft C Run-Time Library, so I don't think it would be a bug since in the Microsoft library a long double is the exactly the same as a double and is documented as such.

Yes it uses the Microsoft library; that was my point by starting the thread. But the design bug is possibly not limited to MinGW.

Quote from: clive, September 10, 2010, at 06:19:23 AMThe real problem here is that it is parsing your "long double" into a "double", you need to put an L at the end so it handles it as a LONG. I'm pretty sure GCC needs you to do that too.

Good point. I'll try that.

Gunther
Forgive your enemies, but never forget their names.

Gunther

Clive,

that was it. I've fixed the sources and now the gcc treats long double values. I think it's not a compiler bug; all what we can say is: sometimes are compilers a bit brain damaged. It remains the awful printf behavior under Win32. I think that I can write a workaround for that. Thanks again, clive.

Gunther
Forgive your enemies, but never forget their names.

clive

Gunther,

No problem.

It is good to see that you trying get your EE students to look critically at computer based math and especially the pitfalls in using floating point (precision, non-exact representations, rounding, truncation, etc). In a world where people automate calculations there is often an acceptance that the computer/calculator produces the correct result. It is important for them to understand to computation backward and forward, know what the answer should be at least roughly, and be able to validate the solution with a couple of approaches.

It might be an interesting experiment to compare solutions solved on a PC (x87 floating point) vs a TI-89 (68000 BCD math) vs an 68020/68881 (I'd suggest IDE68K, but it really uses the x87, an actual 68881 or 68882 would be more instructive)

Also you might want to look at the optimization levels used by the compiler, and how compound formulas evaluate vs line-by-line formulas. Even with 64-bit doubles, if you can keep intermediate results within the FPU it will retain 80-bit precision until you write it to memory.

The real crux of the Microsoft Win32 64-bit "long double" stems from the cross-platform compatibility, and data interchangeability issues. The 68881 extended precision for example is not the same as that of the 80387. Personally I still think they could have supported it in some form, even if that meant there were a few caveats, and there were some import/export functions to permit a common binary format on disc. Then again the 80's era, pre IEEE-754, had a whole mess of floating point formats, with several contributed by Microsoft.

Still, I'm not sure the printing is as big an issue as the representation used during the computation. In terms of printing, casting a long double to a double will suffice if you only need 19-20 digits of precision.

Another interesting trick would be take a fractional mantissa (suitably aligned), and keep multiplying the fractional remainder by 10 until all the decimal digits have been expressed.

-Clive
It could be a random act of randomness. Those happen a lot as well.

Rockoon

GCC using XMM registers is a "feature" .. it was discovered that the compiler produced faster code using XMM doubles than it could using the x87 stack (even allowing unrounded 80-bit temporaries) and since XMM doubles are IEEE compliant, the issue becomes that of assumptions rather than technical accuracy.

The main reason that compilers are more efficient with XMM's rather than the x87 stack is that its non-trivial to produce optimized stack usage that also obeys the rounding integrity of the described source code.. the 11 bits of extra mantissa precision could easily not be enough to allow for the compiler to re-order the operations for efficient fpu stack usage, but in xmm registers the order described by the source is easily obeyed at no penalty.

Its trivial to describe a series of additions that can't be re-ordered even with the extra 11 bits of precision over doubles, but even 32-bit floats would have otherwise been enough for an exact answer when not re-ordered.

1.0E+35 + -1.0E+35 + 1.0

..but re-ordering is the only way to really get great performance out of the FPU's stack for highly complex expressions.. which means hand-optimization and knowledge of the actual rounding issues given the scale of the data used.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Antariy

Gunther, MS's CRT library does not designed to handle REAL10 type.
MS's compilers forse any foating data type to be passed as REAL8 - even if this is float, double or long double.
So, MSVCRT is uncompatible with any C compilers, which can use REAL10.

When extremely optimizing is used with MS C compiler, this can produce very undefined results (and danger, if app is critical).

For example, I attach to post your compiled pi.c from your second archive.
Initialization routine is initialize the same location for long double and double, because compiler found, what this variables is used subsequently, and compiler put them in one location.
So, last initialization of long double re-initialize the double also, which is can trash stack and trash double values.

Note: MSVC is use other variation for short ints, I change code that:

short int init_values_32bit(long double *x,double *y,float *z)
{
  short int cw; //FPU control word
  __asm{
fstcw [ebp-4] //store current control word
mov eax,[ebp+8] //eax -> long double
mov ecx,[ebp+12] //ecx -> double
mov edx,[ebp+16] //edx -> float
fldpi //st0 = PI
fst dword ptr [edx] //store float
fst qword ptr [ecx] //store double
fstp tbyte ptr [eax] //store long double
    }
  return cw;
}



Other thing - if extremely optimization used, and inlineing is not off - then init_values_32bit crash or trash app - because it is inlineid, but compiler don't know what code referenced to ebp (and any other references).

So, using MS's C compilers with long double data type is not possible in normal way - this can gets any unpredictable results, crashes etc.

But using of printf is possible via casting long double to double in compilers which support native REAL10.



Alex

clive

Actually a more agnostic approach would be to use

short int init_values_32bit(long double *x,double *y,float *z)
{
  short int cw; //FPU control word
  __asm{
fstcw cw //store current control word
mov eax,x //eax -> long double
mov ecx,y //ecx -> double
mov edx,z //edx -> float
fldpi //st0 = PI
fst dword ptr [edx] //store float
fst qword ptr [ecx] //store double
fstp tbyte ptr [eax] //store long double
    }
  return cw;
}
It could be a random act of randomness. Those happen a lot as well.

Antariy

Quote from: clive on September 11, 2010, 10:13:15 PM
Actually a more agnostic approach would be to use

short int init_values_32bit(long double *x,double *y,float *z)
{
  short int cw; //FPU control word
  __asm{
fstcw cw //store current control word
mov eax,x //eax -> long double
mov ecx,y //ecx -> double
mov edx,z //edx -> float
fldpi //st0 = PI
fst dword ptr [edx] //store float
fst qword ptr [ecx] //store double
fstp tbyte ptr [eax] //store long double
    }
  return cw;
}


Yes, Clive.

But I have habit to write all "__asm" procedures without stack frame (i.e. - naked), and have no habit to rely to compiler's help, because this can produce funny results sometimes.



Alex

Antariy

This is my result, I forgot post it in initial post:


Layout of data types:
---------------------
Long Double (REAL 10)   = 8 Bytes
Double (REAL 8)         = 8 Bytes
Float (REAL 4)          = 4 Bytes

Compiler Results:
-----------------
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = 3.1415926535897931000
PI as Double Value      = 3.1415926535897931
PI as Float Value       = 3.141593

FPU Results:
------------
FPU control word:       = 27F hex
PI                      = 3.14159265358979323846264338327 ...
PI as Long Double Value = -88796093704934495000000000000000000000000000.00000000
00000000000
PI as Double Value      = -88796093704934495000000000000000000000000000.00000000
00000000
PI as Float Value       = 0.000000


I used insane optimization with inlineing turned-off.
So, compilers work is nice - this is seems from "FPU Results" :)
But, in case of optimization - compiler works right - it cannot know, what actions are made in initialization proc.



Alex

GregL

#11
Gunther,

Sorry, I completely misunderstood the point you were trying to make in your first post.


Clive,

Good catch on the #define M_PI 3.14159265358979323846264338327L.



Gunther

Quote from: GregL, September 11, 2010, 11:39:17 pmSorry, I completely misunderstood the point you were trying to make in your first post.

Never mind and no need for excuses. :wink

Quote from: GregL, September 11, 2010, 11:39:17 pmClive,

Good catch on the #define M_PI 3.14159265358979323846264338327L

No, an excellent catch!  :U

Quote from: Antariy, September 11, 2010, 10:44:45 pmWhen extremely optimizing is used with MS C compiler, this can produce very undefined results (and danger, if app is critical).

Yes, of course. Therefore, the gcc provides __asm__ __volatile__, which is also used by the Linux kernel hackers. That means for the compiler: Don't touch that piece of code, we know what we're doing and don't need your optimization effort.

Gunther
Forgive your enemies, but never forget their names.