News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

"Optimizing" Compiler

Started by Neo, May 12, 2010, 12:50:57 AM

Previous topic - Next topic

Neo

(I'm so furious right now that this might get moved to the Colosseum.  My apologies.)

I'm porting some assembly code to C++ with compiler intrinsics, and I just about puked when I saw the disassembly of the release build with full optimization and forced inlining in Visual Studio 2008.  It's got thousands of lines of:
movdqa      xmm0,xmmword ptr [rsp+900h]
pcmpgtd     xmm0,xmmword ptr [rsp+360h]
movdqa      xmmword ptr [rsp+1550h],xmm0
movdqa      xmm0,xmmword ptr [rsp+1550h]
movdqa      xmmword ptr [rsp+370h],xmm0
movdqa      xmm0,xmmword ptr [rsp+370h]
pand        xmm0,xmmword ptr [13FB42390h]
movdqa      xmmword ptr [rsp+1570h],xmm0
movdqa      xmm0,xmmword ptr [rsp+1570h]
movdqa      xmmword ptr [rsp+380h],xmm0
movaps      xmm0,xmmword ptr [rsp+380h]
movaps      xmmword ptr [rsp+950h],xmm0
cvtps2dq    xmm0,xmmword ptr [rsp+900h]
movdqa      xmmword ptr [rsp+1590h],xmm0
movdqa      xmm0,xmmword ptr [rsp+1590h]
movdqa      xmmword ptr [rsp+390h],xmm0
movdqa      xmm0,xmmword ptr [rsp+390h]
paddd       xmm0,xmmword ptr [13FB42390h]
movdqa      xmmword ptr [rsp+15B0h],xmm0
movdqa      xmm0,xmmword ptr [rsp+15B0h]
movdqa      xmmword ptr [rsp+3A0h],xmm0
movaps      xmm0,xmmword ptr [rsp+3A0h]
mulps       xmm0,xmmword ptr [13FB423C0h]
movaps      xmmword ptr [rsp+15D0h],xmm0
movaps      xmm0,xmmword ptr [rsp+15D0h]
movaps      xmmword ptr [rsp+8D0h],xmm0
rsqrtps     xmm0,xmmword ptr [rsp+8D0h]
movaps      xmmword ptr [rsp+15F0h],xmm0
movaps      xmm0,xmmword ptr [rsp+15F0h]
movaps      xmmword ptr [rsp+3B0h],xmm0
rsqrtps     xmm0,xmmword ptr [rsp+3B0h]
movaps      xmmword ptr [rsp+1610h],xmm0
movaps      xmm0,xmmword ptr [rsp+1610h]


It takes a minute to realize just how terrible this is.  It's not just doing 4 unnecessary moves to and from memory for every 1 real instruction.  It's doing the equivalent of "x= xmm0; xmm0 = x;" over and over, using a new variable for each one.  It was even worse before I forced inlining, since it had each one as a function call passing parameters on the stack.  Here's the original assembly for the above (with comments removed & names changed, since we haven't released the source, at least not yet).


pcmpgtd   xmm15,xmm0
pand      xmm15,xmm9
cvtps2dq  xmm1,xmm1
paddd     xmm1,xmm9
mulps     xmm1,xmmword ptr MyFactor
rsqrtps   xmm1,xmm1
rsqrtps   xmm1,xmm1


HOW can they call it an "optimizing" compiler?  Is there some "Don't Suck" compiler option I'm not aware of?  Do you guys have any other ideas on how to get C++-esque code that isn't this bad and works on 32&64-bit Windows&Linux&Mac?

I'll be trying it with G++ tomorrow.  I don't expect it to do the same ridiculous thing, but I don't expect it to be great either, since previous tests I've done give decidedly mixed results for SSE support between VC++ and G++.  :(

hutch--

Neo,

Have a look at the output with the optimisation option turned off, just occasionally it looks better than the optimised output.

Other than that, a 64 bit MASM object module starts looking good.  :bg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

BlackVortex

You can turn off optimizing for some parts of the code. Don't ask me how, you'll have to search for that, it should leave the inline code alone.

ecube

This is why it's disturbing when I hear even seasoned ASM developers make claims like "optimized compilers" are somehow even remotely as good as well hand written optimized ASM. And i've seen this mentioned various times on this very forum!

Neo

The thing that gets me is that there's not even an excuse for it.  A compiler could very easily generate the original code when I give it the intrinsics explicitly like I am, it just doesn't.  Someone made a conscious decision not to implement the simplest, most obvious optimizations, instead restricting themselves to idiocy like only using 1 SSE register.  It's complete with things like when calculating the size of an array of 2*n doubles, it does the n+n to save on multiplying by 2, but then it explicitly multiplies by 8 (i.e. sizeof(double)) using mul plus a bunch of other instructions to check for overflows.  It's honestly like they wrote the entire compiler on paper and never checked their work.  ::)

For optimizations where the compiler really doesn't have sufficient info to easily assert something, like choosing to generate 4 times as many random numbers, I can certainly forgive a compiler for that, but this kind of thing, a compiler really should be able to either figure it out or not have the problem in the first place.

Can anyone recommend a journal or conference that'd still take papers on compiler optimization?  This stuff really needs to be brought to light, especially since compiler optimization has widely (for no good reason) been considered a solved problem since the 70's.

dedndave

#5
the compiler just doesn't "think" as well as it should   :bg
but, if you time that code, i bet it is pretty fast, actually
that doesn't mean it isn't sloppy, nor does it mean it couldn't be better
one of the reasons i have always prefered ASM was, i have seen disassemblies from different compilers
what you are seeing is typical of older compilers, although they did not use SSE, of course

the compiler is only as good as the group of people who wrote it wanted it to be - lol
if they chose to spend more time on certain areas, those areas are improved
i suspect they focus on functions - especially those that use no API's
this is the code they can speed up in order to make their compiler appear to be better than the competition
if the code is not part of an isolated function (or perhaps in a loop), they may figure it is not going to be used over and over
if the function has API calls, they may figure it isn't speed-critical
they probably try to improve the code that will make the end result faster for time-consuming repetitive tasks
i have simplified this a bit, but you get the idea

compilers don't really "think" very well at all
if you write a function that simply sets a variable value to some constant, the compiler will write just that
in ASM, we might choose a macro for this
depending on the compiler, there may be additional overhead, as well

SetVarConst proc

        push    ebp
        mov     epb,esp
        mov     eax,6
        mov     SomeVar,eax
        pop     ebp
        ret

SetVarConst endp

a macro like this might be more appropriate...

        mov dword ptr SomeVar,6

the group that wrote the compiler aren't likely to see how simple the code is or how it is going to be used

if you play card games at all, you may have noticed that they tend to stack the deck to make it more challenging
that is because it is really not that easy to "teach" a program to play a specific game well
it is much easier to write the code to stack the deck   :P
the programmer is likely to spend more time making the cards look nice - lol

i think we have all seen cases where a compiler actually generates some pretty fast code, too
sometimes, they can be a learning tool to writing better ASM
well - you found out where the programmers that wrote the compiler focused their attention

EDIT
here is a simple and fun little exercise   :P
this appears to be some type of "assignment" code (guessing)
try writing the same assignments inside a loop
for the iteration count, use a variable
assign that variable a value of 1 (single loop pass)
then, disassemble it and see if they optimized it differently because it's inside a loop   :U
you could even take the next step and put it in a function as well

this may change the way you write compiled code entirely - lol
maybe you could write a paper and get published !! (Forcing Compiler Optimization)
i will happily take 10% of any profits you might make   :bg

Neo

The code is already in a loop that has no API calls and does up to about 6 trillion iterations over the course of the application, so it needs performance (it already takes days to run on an i7), but it also needs to run on 6 OS's, meaning that maintaining 6 versions in assembly is a pain in the butt.

I'm definitely writing up a paper showing the compiler output for different things I give it plus the performance results.  The 33 memory accesses in 33 instructions versus 1 memory access in 7 instructions probably does make a non-negligible difference in performance, but I'll know later today.  I have to write such that I'm not presupposing anything, so it'll be named something along the lines of "An Evaluation of Compiler Optimization Quality".

...and Dave, I'll give you 500% of the $0 profit.  :bg

hutch--

Neo,

Its appears to be a substantial drop in quality as I have on and off played with the 32 bit output of VC and often its very good, nothing like the mess you have posted but here I am mainly talking about high level source code and the assembler output after the optimiser has finished. Its probably the case that with the later 64 bit compiler, especially with intrinsics has yet to be as well developed as the 32 bit versions. i have generally seen that very recent compiler design is going backwards from the techniques that were being used 6 or 7 years ago. before that they tended to be out of date with technology change in processor design.

i don't know if anyone has butchered it but Wikipedia used to have a lot of very good technical data on compiler optimisation theory and practice but I doubt thats the problem, its probably the case that the instrinsics were tacked on as an afterthought to address the average level of assembler literacy in high level programmers.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

The problem seems too dumb to be intentional. Could it be something that was fixed in a SP or in a later version of the compiler?
eschew obfuscation

Neo

Quote from: MichaelW on May 12, 2010, 04:24:46 PM
The problem seems too dumb to be intentional. Could it be something that was fixed in a SP or in a later version of the compiler?
Well, it's VC++ 2008, and they've had "support" for SSE since VC++ 2003 (which outperformed G++ from 2006 in a few benchmark tests I did back then).  I can try it on VC++ 2010, but I don't think they've done a thing to actually improve the compiler since VC++ 2008.

dedndave

i think Hutch is on the right track
the 64-bit versions just haven't matured, yet
i bet their next release will have a lot of improvements
you could try testing the same type of code in a 32-bit compiler to see what the differences are

as for my idea - well - at least it was an idea - lol
dang - i was hoping you could get published   :'(

Neo

Sure enough, Hutch is probably right.  The 64-bit compiler's code is 10 times slower than the 32-bit compiler's code.  That said, I also discovered that the 64-bit compiler is for Visual Studio 2005, whereas the 32-bit one is for 2008, so I'm installing the 2008 64-bit compiler.  Apparently VS 2010 doesn't support 64-bit at all yet, so that one's not an option.

GregL

#12
Quote from: NeoApparently VS 2010 doesn't support 64-bit at all yet, so that one's not an option.

It sure does support it, I've been using it.

Also, Visual Studio 2005 was at the time they were just getting started with the x64 compiler, I'm not surprised it's terrible.


hutch--

Neo,

If you don't mind bashing out the SSE code in bare mnemonics, the "Intel" option in GAS last time I saw it could probably come very close to using direct MASM mnemonic notation so you may be able to write some common code if its only pointed at x86-64 bit targets.

From memory with GAS in Intel mode you have to use the fully specified QWORD PTR style but you could probably craft both with the same mnemonics and just use different headers and footers for the different platforms.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Neo

Quote from: hutch-- on May 13, 2010, 03:07:14 AMIf you don't mind bashing out the SSE code in bare mnemonics, the "Intel" option in GAS last time I saw it could probably come very close to using direct MASM mnemonic notation so you may be able to write some common code if its only pointed at x86-64 bit targets.

From memory with GAS in Intel mode you have to use the fully specified QWORD PTR style but you could probably craft both with the same mnemonics and just use different headers and footers for the different platforms.
That's what we've got at the moment, but it still means 6 different versions for the 6 platforms, since GAS for Mac and GAS for Linux have different syntax from each other and are incompatible with each other.  Plus, 64-bit Mac uses rip-relative addressing.  They also only partly support Intel syntax, so I have to go through and change every number, register dereference, and variable declaration, manually lay out the stack, and work around a serious bug with encoding conditional jumps... backward? or maybe it was forward.  One of the two is identified as an error in the Linux version of GAS, so I have to switch it back to AT&T syntax for 1 instruction, then switch it back to Intel. etc. etc.

Anyway, the points are that it's a giant pain in the butt for me to update 6 versions of the code when we need it to do new stuff, and I'm already the only one who can modify the code.  :(  The hope is that with a simple C++ wrapper, it'll be almost like writing scalar code in C++... which it is, when the compiler's not being crazy.  :wink