News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Suggestions and improvements for SSE2 code are welcome

Started by Gunther, August 26, 2010, 05:20:06 PM

Previous topic - Next topic

Antariy

Gunther, this is code to make sure, what CPU truely have HTT support:

mov eax,1
cpuid
shr ebx,16
and ebx,255


If EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.



Alex

Antariy

Quote from: hutch-- on September 05, 2010, 04:12:26 AM
Alex,

The level of unrolling depends on the size of the code cache of the particular processor. Some older processors respond well to unrolling where many of the later ones show little change. I have algos written for P4s that are slower relative to a short algo on a later processor. Intel recommend unrolling up to the limit of the code cache but I have found in practice that you try different amounts and don't go more than where the speed picks up.

Yes hutch.
Most code have big dependency from hardware, on which this code is runned.
All CPUs have different schemes, even if they one type and microarchitecture.
This is not wonder, what AMD and Intel have very different results in testings: AMD always make wide solutions, which is use simple but parallel shemes (this is seems from all tests), Intel make deep solutions - which is try use strong prediction and very deep pipelining. But deep pipelining is bad in case of many registers renaming code, and if something prediction is not successful - CPU with deep pipeline have bigger stalls.
This is my thinks, of course, but they are corroborated with many things: code which use parallel and multiple registers runs very good on AMD, AMD CPUs have big heat emission - what can says about really big schemes, which can make really parallel exection, etc.




Alex

Gunther

Quote from: Antariy  September 06, 2010, 10:49:52 pmIf EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.

Alex,

thank you for the hint. I'll inspect your CPU detection method as soon as possible. May be, I can adopt a few ideas (I give credit, that's clear) and make my procedure more robust and reliable.

Gunther
Forgive your enemies, but never forget their names.

Antariy

Gunther, your procedure is good, I only suggest about accurate testing for HTT.
You can see my thread "http://www.masm32.com/board/index.php?topic=14754.0" (CPU identification), on which I post small app, what make sure what CPU is support HTT (cores.zip). It seems to be truely works.



Alex

dedndave

it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all

Rockoon

AMD is rolling out its own version of HT in early 2011 with its first major architectural change since the first Athlons, the new 32nm Bulldozer architecture.

http://techreport.com/articles.x/19514

Expect Intel to try something similar when they roll down to 22nm tech (its too late for their 32nm tech, they are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead)

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Gunther

Quote from: Rockoon September 07, 2010, at 05:55:47 AMthey are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead

Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?

Gunther
Forgive your enemies, but never forget their names.

Antariy

Quote from: dedndave on September 07, 2010, 02:32:03 AM
it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all

Dave, I read many times, what reporting about HTT support can be lie. My BIOS have HTT support, but it not show this option, because BIOS know, what CPU don't support HTT, even it says other things.

Here is EDX result after CPUID EAX=1 for my CPU: BFEBFBFFh.
Binary form:

10111111111010111111101111111111
   ^ this bit says what CPU have HTT, but this is NOT true.


Method which I gives reports about logical/physical cores. If CPUID HTT bit says what CPU supports HTT, but cores count is 1 - this is really funny :) So, this CPU don't have HTT.

In my point - HTT is much commercial advertisement, because "Hey, us CPU have 2 cores!"... But they are logical (i.e. - virtual) and use the same execution units of one physical core.
I saw ~4 years ago true 2-cores Prescott LGA 775. It eat 120 Watts of energy, and it really HOT...
Anybody can say, what 2 VIRTUAL CPUs - is nice, but this is funny :)


EDITED: I suggest treat HTT bit as "This CPU architecture can support HTT...", but due to 1 logical/physical core "... but we economize on its implementation" :)



Alex

Rockoon

Quote from: Gunther on September 07, 2010, 10:32:23 PM
Quote from: Rockoon September 07, 2010, at 05:55:47 AMthey are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead

Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?

Gunther


Sandy Bridge arrives.. maybe in December or January.

Bulldozer arrives a couple months after that.

Its really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Gunther

Quote from: RockoonIts really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process

Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?

Gunther
Forgive your enemies, but never forget their names.

Rockoon

Quote from: Gunther on September 08, 2010, 01:12:20 AM
Quote from: RockoonIts really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process

Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?

Gunther

AVX, AES, SSE4.1 and SSE4.2

Not sure about SSSE3

Even though it will support AVX, it wont be using a 256-bit FPU unit.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Gunther

Quote from: Rockoon September 08, 2010, at 02:47:03 AMNot sure about SSSE3

I hope that's not the point, because SSE3 is available nowadays on AMD chips.

Quote from: Rockoon September 08, 2010, at 02:47:03 AMEven though it will support AVX, it wont be using a 256-bit FPU unit.

That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.

Gunther
Forgive your enemies, but never forget their names.

Rockoon

Quote from: Gunther on September 08, 2010, 01:56:37 AM
Quote from: Rockoon September 08, 2010, at 02:47:03 AMNot sure about SSSE3

I hope that's not the point, because SSE3 is available nowadays on AMD chips.

SSSE3 is not the same as SSE3.

Quote from: Gunther on September 08, 2010, 01:56:37 AM
Quote from: Rockoon September 08, 2010, at 02:47:03 AMEven though it will support AVX, it wont be using a 256-bit FPU unit.

That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.


Yes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Gunther

Quote from: Rockoon, September 08, 2010, at 04:10:59 AMSSSE3 is not the same as SSE3.

Yes, I've overlooked the 3rd S.

Quote from: Rockoon, September 08, 2010, at 04:10:59 AMYes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's

In that sense, the speed advantage for floating point operations is at the Intel side.

Gunther

Forgive your enemies, but never forget their names.

Antariy

Quote from: jj2007 on September 03, 2010, 12:28:53 AM

OK. Here is DotPro18 with code sizes added.
78       bytes for DotXMM1Acc4E
278      bytes for DotXMM1Acc4EJ1
266      bytes for DotXMM1Acc4EJ2
60       bytes for AxDotXMM1_fastcall
120      bytes for DotXMM2Acc16ELingo
183      bytes for DotXMM2Acc32ELingo
129      bytes for DotXMM2Acc16EPaul


Alex, have you tried unrolling a little bit?

Not, at that time is not tried. But, after a big delay I found some time for do this.
Codesize is still the smallest (~106 bytes) from fast, and speed is satisfactory.
Simple unrolling with interleaving of used execution units.
Probably, with a much better modernest CPUs, Paul's code is best due to using access to contiguous memory locations, but on old CPUs using of many equal commands (i.e. - execution units) is not gives anythings useful.

Also changed calling convention (stdcall now).

Test this please, anybody who readed this post (attached archive).

This is my timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
3107    cycles for DotXMM1Acc4E
2864    cycles for DotXMM1Acc4EJ1
2804    cycles for DotXMM1Acc4EJ2
1860    cycles for AxDotXMM1
2246    cycles for DotXMM2Acc16ELingo
1879    cycles for DotXMM2Acc32ELingo
1907    cycles for DotXMM2Acc16EPaul

2965    cycles for DotXMM1Acc4E
2822    cycles for DotXMM1Acc4EJ1
2827    cycles for DotXMM1Acc4EJ2
1818    cycles for AxDotXMM1
2220    cycles for DotXMM2Acc16ELingo
1919    cycles for DotXMM2Acc32ELingo
1818    cycles for DotXMM2Acc16EPaul

2936    cycles for DotXMM1Acc4E
2920    cycles for DotXMM1Acc4EJ1
2818    cycles for DotXMM1Acc4EJ2
1852    cycles for AxDotXMM1
2256    cycles for DotXMM2Acc16ELingo
1898    cycles for DotXMM2Acc32ELingo
1832    cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---




Alex