News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

my official o/t thread P: ???

Started by daydreamer, September 26, 2008, 08:50:24 AM

Previous topic - Next topic

daydreamer

http://www.flipcode.com/archives/Raytracing_Topics_Techniques-Part_6_Textures_Cameras_and_Speed.shtml
this is really a nightmare when looking from a assemblerprogrammers perspective, two arccos per pixels+ a sine etc
there isnt even a arccos instruction, you calculate it with help of arctan+ more stuff that makes it even slower than arctan is really slow
first this fetch texel

Color Texture::GetTexel( float a_U, float a_V )
{
// fetch a bilinearly filtered texel
float fu = (a_U + 1000.5f) * m_Width;
float fv = (a_V + 1000.0f) * m_Width;
int u1 = ((int)fu) % m_Width;
int v1 = ((int)fv) % m_Height;
int u2 = (u1 + 1) % m_Width;
int v2 = (v1 + 1) % m_Height;
// calculate fractional parts of u and v
float fracu = fu - floorf( fu );
float fracv = fv - floorf( fv );
// calculate weight factors
float w1 = (1 - fracu) * (1 - fracv);
float w2 = fracu * (1 - fracv);
float w3 = (1 - fracu) * fracv;
float w4 = fracu *  fracv;
// fetch four texels
Color c1 = m_Bitmap[u1 + v1 * m_Width];
Color c2 = m_Bitmap[u2 + v1 * m_Width];
Color c3 = m_Bitmap[u1 + v2 * m_Width];
Color c4 = m_Bitmap[u2 + v2 * m_Width];
// scale and sum the four colors
return c1 * w1 + c2 * w2 + c3 * w3 + c4 * w4;
}

inline it offcourse
this is where SSE PS instructions can really shine if texels are closeby with help of make use of as many general regs as possible as indirection so you can unroll the process to several MOVUPS/MULPS biliniar texturefiltering
you know all about u2+v1*m_width texture sized ^2 optimizations, well this can be done with Donkey's trick in parallel, with SSE2's PSUB XMM, constansts or PADD
but best would be to have the right algo to shoot the rays in right order, so it follows textures scanlines as much as possible, so datacache is much more effective

you could have an algo that checks for the two symmetrylines in a sphere and that way only need 1/4 times costly pixellookups calculations


Mark_Larson

#1
  thanks for contributing!

  Intel released a library of trig functions for SSE.  I'll post the link later.  I can't remember off the top of my head.  But I used it a few times, and it flew.  I am pretty sure they use Taylor to approximate.  The other cool thing is you can compute a single trig value or 4 floating point trig values in parallel, or 2 double precision floating point numbers in parallel. The timing for doing multiple trig functions is the same as doing a single trig function.

does anyone remmeber the name of the library or where to find it?  If not I'll search for it later.

EDIT: also movups is a really slow instruction, don't use it.  It takes 10 cycles on a P4. ewww.  Just make sure the data is 16 byte aligned, that is what I do.

EDIT2: It is called the Approximate Math Library.  I had an extremely hard time finding it.  So I am guessing it won't be up there too much longer.  So I recommend you get it now.  I tried posting it to the forum but it was too big.

http://www.intel.com/design/pentiumiii/devtools/AMaths.zip
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

daydreamer

Quote from: Mark_Larson on September 26, 2008, 12:43:27 PM
  thanks for contributing!
  Intel released a library of trig functions for SSE. 
EDIT: also movups is a really slow instruction, don't use it.  It takes 10 cycles on a P4. ewww.  Just make sure the data is 16 byte aligned, that is what I do.
youre welcome
yes, you could make your own macros as well
this is a specialcase where you can choose wether have 16bytes/pixel and movaps , which mean extra unnesserary data everywhere you do not use those 4 bytes for some other data, which make an impact on reading memory and fills the cache with unnesserary bytes everywhere you do not make use of alphachannel, can make the difference if a texture+your other data fits in cache or dont will make big impact on performance
vs me choosing movups can be used on both 12byte pixels or custom data like greyscale for light if you choose to make use of that




Mark_Larson

#3
 it is actually easy to do movaps and 12-byte RGB.  You just have to do 3 special cases.

Code:


movaps       xmm0,[src]       ;read R1, G1, B1, R2
movaps       xmm1,[src+16]    ;read G2, B2, R3, G3
movaps       xmm2,[src+32]    ;read B3, R4, G4, B4


EDIT:  Those 3 instructions all semi-execute in parallel on a core 2 duo in 4 processor cycle.  How you ask?  Easy.

THe movaps on a core 2 duo WITH MEMORY, has a 2 cycle latency but a 1 throughput.  That means you can put 3 back to back and if their aren't any dependencies, it will all run in 4 cycle.  The movups has a worse time pairing because of the it's 2 cycle recip throughput, and it's latency is 2-4 cycles.  So let's assume 3.  So that means 3 movups back to back with memory would take 7 cycles.  So it is twice as fast to use MOVAPS. 

let's look at it graphically.

here is the movups.  It takes 3 cycles to run.  You cannot start another movups until 2 cycles have overlapped.  The 1, 2, and 3, are the first, second and third movups respectively.  Each line represents 1 cycle.  If you count the number of lines you will find that it is 7, which corresponds to 7 cycles.
1
1
1 2
  2
  2 3
    3
    3

and movaps with memory.  Keep in mind that 3 MOVAPS back to back without memory all execute in 1 processor cycle.  Again each line represents one cycle.  So that takes 4 cycles.

1
1 2
  2 3
    3
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

daydreamer

I think I gonna code that kinda solution and see how it goes

Mark_Larson

we're 10 posts away from passing the power basic forum!  every just post random stuff! ;)

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

BlackVortex

Quote from: Mark_Larson on September 27, 2008, 07:53:58 PM
we're 10 posts away from passing the power basic forum!  every just post random stuff! ;)


+1 post for raytracing !!!

Mark_Larson

Quote from: BlackVortex on September 28, 2008, 04:56:28 PM
Quote from: Mark_Larson on September 27, 2008, 07:53:58 PM
we're 10 posts away from passing the power basic forum!  every just post random stuff! ;)



+1 post for raytracing !!!

thanks! +2
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

hutch--

 :bg

Damn, I will have to whip up some PowerBASIC code.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mark_Larson

Quote from: hutch-- on September 28, 2008, 10:13:00 PM
:bg

Damn, I will have to whip up some PowerBASIC code.

Rofl.  Your post makes it 
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

Quote from: hutch-- on September 28, 2008, 10:13:00 PM
:bg

Damn, I will have to whip up some PowerBASIC code.

ROFL.  Your post makes it 56, and mine makes it 57! :)
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

BlackVortex

Lol, yoda thinks the weed is strong in this thread.

daydreamer

LOL how ironic you previously made 6 edits in this forum to save posts and ...

Mark_Larson

yea!!!  we just PASSED powerbasic! thanks guys :)
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm