Suggestions and improvements for SSE2 code are welcome

Antariy · September 06, 2010, 09:49:52 PM

Gunther, this is code to make sure, what CPU truely have HTT support:


mov eax,1
cpuid
shr ebx,16
and ebx,255

If EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.

Alex

Antariy · September 06, 2010, 09:58:25 PM

Quote from: hutch-- on September 05, 2010, 04:12:26 AM
Alex,

The level of unrolling depends on the size of the code cache of the particular processor. Some older processors respond well to unrolling where many of the later ones show little change. I have algos written for P4s that are slower relative to a short algo on a later processor. Intel recommend unrolling up to the limit of the code cache but I have found in practice that you try different amounts and don't go more than where the speed picks up.

Yes hutch.
Most code have big dependency from hardware, on which this code is runned.
All CPUs have different schemes, even if they one type and microarchitecture.
This is not wonder, what AMD and Intel have very different results in testings: AMD always make wide solutions, which is use simple but parallel shemes (this is seems from all tests), Intel make deep solutions - which is try use strong prediction and very deep pipelining. But deep pipelining is bad in case of many registers renaming code, and if something prediction is not successful - CPU with deep pipeline have bigger stalls.
This is my thinks, of course, but they are corroborated with many things: code which use parallel and multiple registers runs very good on AMD, AMD CPUs have big heat emission - what can says about really big schemes, which can make really parallel exection, etc.

Alex

Gunther · September 06, 2010, 11:31:11 PM

Quote from: Antariy September 06, 2010, 10:49:52 pmIf EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.

Alex,

thank you for the hint. I'll inspect your CPU detection method as soon as possible. May be, I can adopt a few ideas (I give credit, that's clear) and make my procedure more robust and reliable.

Gunther

Antariy · September 06, 2010, 11:37:56 PM

Gunther, your procedure is good, I only suggest about accurate testing for HTT.
You can see my thread "http://www.masm32.com/board/index.php?topic=14754.0" (CPU identification), on which I post small app, what make sure what CPU is support HTT (cores.zip). It seems to be truely works.

Alex

dedndave · September 07, 2010, 02:32:03 AM

it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all

Rockoon · September 07, 2010, 04:55:47 AM

AMD is rolling out its own version of HT in early 2011 with its first major architectural change since the first Athlons, the new 32nm Bulldozer architecture.

http://techreport.com/articles.x/19514

Expect Intel to try something similar when they roll down to 22nm tech (its too late for their 32nm tech, they are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead)

Gunther · September 07, 2010, 10:32:23 PM

Quote from: Rockoon September 07, 2010, at 05:55:47 AMthey are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead

Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?

Gunther

Antariy · September 07, 2010, 10:44:28 PM

Quote from: dedndave on September 07, 2010, 02:32:03 AM
it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all

Dave, I read many times, what reporting about HTT support can be lie. My BIOS have HTT support, but it not show this option, because BIOS know, what CPU don't support HTT, even it says other things.

Here is EDX result after CPUID EAX=1 for my CPU: BFEBFBFFh.
Binary form:

Code Select


10111111111010111111101111111111
   ^ this bit says what CPU have HTT, but this is NOT true.

Method which I gives reports about logical/physical cores. If CPUID HTT bit says what CPU supports HTT, but cores count is 1 - this is really funny :) So, this CPU don't have HTT.

In my point - HTT is much commercial advertisement, because "Hey, us CPU have 2 cores!"... But they are logical (i.e. - virtual) and use the same execution units of one physical core.
I saw ~4 years ago true 2-cores Prescott LGA 775. It eat 120 Watts of energy, and it really HOT...
Anybody can say, what 2 VIRTUAL CPUs - is nice, but this is funny :)

EDITED: I suggest treat HTT bit as "This CPU architecture can support HTT...", but due to 1 logical/physical core "... but we economize on its implementation" :)

Alex

Rockoon · September 07, 2010, 11:22:38 PM

Quote from: Gunther on September 07, 2010, 10:32:23 PM
Quote from: Rockoon September 07, 2010, at 05:55:47 AMthey are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead

Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?

Gunther

Sandy Bridge arrives.. maybe in December or January.

Bulldozer arrives a couple months after that.

Its really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process

Gunther · September 08, 2010, 01:12:20 AM

Quote from: RockoonIts really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process

Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?

Gunther

Rockoon · September 08, 2010, 01:47:03 AM

Quote from: Gunther on September 08, 2010, 01:12:20 AM
Quote from: RockoonIts really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process

Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?

Gunther

AVX, AES, SSE4.1 and SSE4.2

Not sure about SSSE3

Even though it will support AVX, it wont be using a 256-bit FPU unit.

Gunther · September 08, 2010, 01:56:37 AM

Quote from: Rockoon September 08, 2010, at 02:47:03 AMNot sure about SSSE3

I hope that's not the point, because SSE3 is available nowadays on AMD chips.

Quote from: Rockoon September 08, 2010, at 02:47:03 AMEven though it will support AVX, it wont be using a 256-bit FPU unit.

That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.

Gunther

Rockoon · September 08, 2010, 03:10:59 AM

Quote from: Gunther on September 08, 2010, 01:56:37 AM
Quote from: Rockoon September 08, 2010, at 02:47:03 AMNot sure about SSSE3

I hope that's not the point, because SSE3 is available nowadays on AMD chips.

SSSE3 is not the same as SSE3.

Quote from: Gunther on September 08, 2010, 01:56:37 AM
Quote from: Rockoon September 08, 2010, at 02:47:03 AMEven though it will support AVX, it wont be using a 256-bit FPU unit.

That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.

Yes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's

Gunther · September 08, 2010, 11:15:26 AM

Quote from: Rockoon, September 08, 2010, at 04:10:59 AMSSSE3 is not the same as SSE3.

Yes, I've overlooked the 3rd S.

Quote from: Rockoon, September 08, 2010, at 04:10:59 AMYes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's

In that sense, the speed advantage for floating point operations is at the Intel side.

Gunther

Antariy · September 28, 2010, 11:24:55 PM

Quote from: jj2007 on September 03, 2010, 12:28:53 AM

OK. Here is DotPro18 with code sizes added.
Code Select Expand
78 bytes for DotXMM1Acc4E 278 bytes for DotXMM1Acc4EJ1 266 bytes for DotXMM1Acc4EJ2 60 bytes for AxDotXMM1_fastcall 120 bytes for DotXMM2Acc16ELingo 183 bytes for DotXMM2Acc32ELingo 129 bytes for DotXMM2Acc16EPaul

Alex, have you tried unrolling a little bit?

Not, at that time is not tried. But, after a big delay I found some time for do this.
Codesize is still the smallest (~106 bytes) from fast, and speed is satisfactory.
Simple unrolling with interleaving of used execution units.
Probably, with a much better modernest CPUs, Paul's code is best due to using access to contiguous memory locations, but on old CPUs using of many equal commands (i.e. - execution units) is not gives anythings useful.

Also changed calling convention (stdcall now).

Test this please, anybody who readed this post (attached archive).

This is my timings:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
3107    cycles for DotXMM1Acc4E
2864    cycles for DotXMM1Acc4EJ1
2804    cycles for DotXMM1Acc4EJ2
1860    cycles for AxDotXMM1
2246    cycles for DotXMM2Acc16ELingo
1879    cycles for DotXMM2Acc32ELingo
1907    cycles for DotXMM2Acc16EPaul

2965    cycles for DotXMM1Acc4E
2822    cycles for DotXMM1Acc4EJ1
2827    cycles for DotXMM1Acc4EJ2
1818    cycles for AxDotXMM1
2220    cycles for DotXMM2Acc16ELingo
1919    cycles for DotXMM2Acc32ELingo
1818    cycles for DotXMM2Acc16EPaul

2936    cycles for DotXMM1Acc4E
2920    cycles for DotXMM1Acc4EJ1
2818    cycles for DotXMM1Acc4EJ2
1852    cycles for AxDotXMM1
2256    cycles for DotXMM2Acc16ELingo
1898    cycles for DotXMM2Acc32ELingo
1832    cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---

Alex

News:

Suggestions and improvements for SSE2 code are welcome