The MASM Forum Archive 2004 to 2012
Welcome, Guest. Please login or register.
August 18, 2022, 04:54:35 PM

Login with username, password and session length
Search:     Advanced search
128553 Posts in 15254 Topics by 684 Members
Latest Member: mottt
* Home Help Search Login Register
+  The MASM Forum Archive 2004 to 2012
|-+  General Forums
| |-+  The Laboratory (Moderator: Mark_Larson)
| | |-+  Suggestions and improvements for SSE2 code are welcome
« previous next »
Pages: 1 ... 6 7 [8] 9 Print
Author Topic: Suggestions and improvements for SSE2 code are welcome  (Read 71956 times)
Antariy
Member
*****
Gender: Male
Posts: 1041


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #105 on: September 06, 2010, 09:49:52 PM »

Gunther, this is code to make sure, what CPU truely have HTT support:
Code:
mov eax,1
cpuid
shr ebx,16
and ebx,255

If EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.



Alex
Logged
Antariy
Member
*****
Gender: Male
Posts: 1041


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #106 on: September 06, 2010, 09:58:25 PM »

Alex,

The level of unrolling depends on the size of the code cache of the particular processor. Some older processors respond well to unrolling where many of the later ones show little change. I have algos written for P4s that are slower relative to a short algo on a later processor. Intel recommend unrolling up to the limit of the code cache but I have found in practice that you try different amounts and don't go more than where the speed picks up.

Yes hutch.
Most code have big dependency from hardware, on which this code is runned.
All CPUs have different schemes, even if they one type and microarchitecture.
This is not wonder, what AMD and Intel have very different results in testings: AMD always make wide solutions, which is use simple but parallel shemes (this is seems from all tests), Intel make deep solutions - which is try use strong prediction and very deep pipelining. But deep pipelining is bad in case of many registers renaming code, and if something prediction is not successful - CPU with deep pipeline have bigger stalls.
This is my thinks, of course, but they are corroborated with many things: code which use parallel and multiple registers runs very good on AMD, AMD CPUs have big heat emission - what can says about really big schemes, which can make really parallel exection, etc.




Alex
Logged
Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #107 on: September 06, 2010, 11:31:11 PM »

Quote from: Antariy  September 06, 2010, 10:49:52 pm
If EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.

Alex,

thank you for the hint. I'll inspect your CPU detection method as soon as possible. May be, I can adopt a few ideas (I give credit, that's clear) and make my procedure more robust and reliable.

Gunther
Logged

Forgive your enemies, but never forget their names.
Antariy
Member
*****
Gender: Male
Posts: 1041


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #108 on: September 06, 2010, 11:37:56 PM »

Gunther, your procedure is good, I only suggest about accurate testing for HTT.
You can see my thread "http://www.masm32.com/board/index.php?topic=14754.0" (CPU identification), on which I post small app, what make sure what CPU is support HTT (cores.zip). It seems to be truely works.



Alex
Logged
dedndave
Member
*****
Posts: 12523


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #109 on: September 07, 2010, 02:32:03 AM »

it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all
Logged
Rockoon
Member
*****
Gender: Male
Posts: 612


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #110 on: September 07, 2010, 04:55:47 AM »

AMD is rolling out its own version of HT in early 2011 with its first major architectural change since the first Athlons, the new 32nm Bulldozer architecture.

http://techreport.com/articles.x/19514

Expect Intel to try something similar when they roll down to 22nm tech (its too late for their 32nm tech, they are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead)

Logged

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.
Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #111 on: September 07, 2010, 10:32:23 PM »

Quote from: Rockoon September 07, 2010, at 05:55:47 AM
they are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead

Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?

Gunther
Logged

Forgive your enemies, but never forget their names.
Antariy
Member
*****
Gender: Male
Posts: 1041


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #112 on: September 07, 2010, 10:44:28 PM »

it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all

Dave, I read many times, what reporting about HTT support can be lie. My BIOS have HTT support, but it not show this option, because BIOS know, what CPU don't support HTT, even it says other things.

Here is EDX result after CPUID EAX=1 for my CPU: BFEBFBFFh.
Binary form:
Code:
10111111111010111111101111111111
   ^ this bit says what CPU have HTT, but this is NOT true.

Method which I gives reports about logical/physical cores. If CPUID HTT bit says what CPU supports HTT, but cores count is 1 - this is really funny :) So, this CPU don't have HTT.

In my point - HTT is much commercial advertisement, because "Hey, us CPU have 2 cores!"... But they are logical (i.e. - virtual) and use the same execution units of one physical core.
I saw ~4 years ago true 2-cores Prescott LGA 775. It eat 120 Watts of energy, and it really HOT...
Anybody can say, what 2 VIRTUAL CPUs - is nice, but this is funny :)


EDITED: I suggest treat HTT bit as "This CPU architecture can support HTT...", but due to 1 logical/physical core "... but we economize on its implementation" :)



Alex
Logged
Rockoon
Member
*****
Gender: Male
Posts: 612


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #113 on: September 07, 2010, 11:22:38 PM »

Quote from: Rockoon September 07, 2010, at 05:55:47 AM
they are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead

Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?

Gunther


Sandy Bridge arrives.. maybe in December or January.

Bulldozer arrives a couple months after that.

Its really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process
Logged

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.
Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #114 on: September 08, 2010, 01:12:20 AM »

Quote from: Rockoon
Its really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process

Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?

Gunther
Logged

Forgive your enemies, but never forget their names.
Rockoon
Member
*****
Gender: Male
Posts: 612


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #115 on: September 08, 2010, 01:47:03 AM »

Quote from: Rockoon
Its really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process

Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?

Gunther

AVX, AES, SSE4.1 and SSE4.2

Not sure about SSSE3

Even though it will support AVX, it wont be using a 256-bit FPU unit.
Logged

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.
Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #116 on: September 08, 2010, 01:56:37 AM »

Quote from: Rockoon September 08, 2010, at 02:47:03 AM
Not sure about SSSE3

I hope that's not the point, because SSE3 is available nowadays on AMD chips.

Quote from: Rockoon September 08, 2010, at 02:47:03 AM
Even though it will support AVX, it wont be using a 256-bit FPU unit.

That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.

Gunther
Logged

Forgive your enemies, but never forget their names.
Rockoon
Member
*****
Gender: Male
Posts: 612


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #117 on: September 08, 2010, 03:10:59 AM »

Quote from: Rockoon September 08, 2010, at 02:47:03 AM
Not sure about SSSE3

I hope that's not the point, because SSE3 is available nowadays on AMD chips.

SSSE3 is not the same as SSE3.

Quote from: Rockoon September 08, 2010, at 02:47:03 AM
Even though it will support AVX, it wont be using a 256-bit FPU unit.

That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.


Yes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's
Logged

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.
Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #118 on: September 08, 2010, 11:15:26 AM »

Quote from: Rockoon, September 08, 2010, at 04:10:59 AM
SSSE3 is not the same as SSE3.

Yes, I've overlooked the 3rd S.

Quote from: Rockoon, September 08, 2010, at 04:10:59 AM
Yes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's

In that sense, the speed advantage for floating point operations is at the Intel side.

Gunther

Logged

Forgive your enemies, but never forget their names.
Antariy
Member
*****
Gender: Male
Posts: 1041


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #119 on: September 28, 2010, 11:24:55 PM »


OK. Here is DotPro18 with code sizes added.
Code:
78       bytes for DotXMM1Acc4E
278      bytes for DotXMM1Acc4EJ1
266      bytes for DotXMM1Acc4EJ2
60       bytes for AxDotXMM1_fastcall
120      bytes for DotXMM2Acc16ELingo
183      bytes for DotXMM2Acc32ELingo
129      bytes for DotXMM2Acc16EPaul

Alex, have you tried unrolling a little bit?

Not, at that time is not tried. But, after a big delay I found some time for do this.
Codesize is still the smallest (~106 bytes) from fast, and speed is satisfactory.
Simple unrolling with interleaving of used execution units.
Probably, with a much better modernest CPUs, Paul's code is best due to using access to contiguous memory locations, but on old CPUs using of many equal commands (i.e. - execution units) is not gives anythings useful.

Also changed calling convention (stdcall now).

Test this please, anybody who readed this post (attached archive).

This is my timings:
Code:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
3107    cycles for DotXMM1Acc4E
2864    cycles for DotXMM1Acc4EJ1
2804    cycles for DotXMM1Acc4EJ2
1860    cycles for AxDotXMM1
2246    cycles for DotXMM2Acc16ELingo
1879    cycles for DotXMM2Acc32ELingo
1907    cycles for DotXMM2Acc16EPaul

2965    cycles for DotXMM1Acc4E
2822    cycles for DotXMM1Acc4EJ1
2827    cycles for DotXMM1Acc4EJ2
1818    cycles for AxDotXMM1
2220    cycles for DotXMM2Acc16ELingo
1919    cycles for DotXMM2Acc32ELingo
1818    cycles for DotXMM2Acc16EPaul

2936    cycles for DotXMM1Acc4E
2920    cycles for DotXMM1Acc4EJ1
2818    cycles for DotXMM1Acc4EJ2
1852    cycles for AxDotXMM1
2256    cycles for DotXMM2Acc16ELingo
1898    cycles for DotXMM2Acc32ELingo
1832    cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---



Alex

* 2DotProduct.zip (33.64 KB - downloaded 441 times.)
Logged
Pages: 1 ... 6 7 [8] 9 Print 
« previous next »
Jump to:  

Powered by MySQL Powered by PHP The MASM Forum Archive 2004 to 2012 | Powered by SMF 1.0.12.
© 2001-2005, Lewis Media. All Rights Reserved.
Valid XHTML 1.0! Valid CSS!