my official o/t thread P: ???

daydreamer · September 26, 2008, 08:50:24 AM

http://www.flipcode.com/archives/Raytracing_Topics_Techniques-Part_6_Textures_Cameras_and_Speed.shtml
this is really a nightmare when looking from a assemblerprogrammers perspective, two arccos per pixels+ a sine etc
there isnt even a arccos instruction, you calculate it with help of arctan+ more stuff that makes it even slower than arctan is really slow
first this fetch texel

Code Select


Color Texture::GetTexel( float a_U, float a_V )
{
	// fetch a bilinearly filtered texel
	float fu = (a_U + 1000.5f) * m_Width;
	float fv = (a_V + 1000.0f) * m_Width;
	int u1 = ((int)fu) % m_Width;
	int v1 = ((int)fv) % m_Height;
	int u2 = (u1 + 1) % m_Width;
	int v2 = (v1 + 1) % m_Height;
	// calculate fractional parts of u and v
	float fracu = fu - floorf( fu );
	float fracv = fv - floorf( fv );
	// calculate weight factors
	float w1 = (1 - fracu) * (1 - fracv);
	float w2 = fracu * (1 - fracv);
	float w3 = (1 - fracu) * fracv;
	float w4 = fracu *  fracv;
	// fetch four texels
	Color c1 = m_Bitmap[u1 + v1 * m_Width];
	Color c2 = m_Bitmap[u2 + v1 * m_Width];
	Color c3 = m_Bitmap[u1 + v2 * m_Width];
	Color c4 = m_Bitmap[u2 + v2 * m_Width];
	// scale and sum the four colors
	return c1 * w1 + c2 * w2 + c3 * w3 + c4 * w4;
}

inline it offcourse
this is where SSE PS instructions can really shine if texels are closeby with help of make use of as many general regs as possible as indirection so you can unroll the process to several MOVUPS/MULPS biliniar texturefiltering
you know all about u2+v1*m_width texture sized ^2 optimizations, well this can be done with Donkey's trick in parallel, with SSE2's PSUB XMM, constansts or PADD
but best would be to have the right algo to shoot the rays in right order, so it follows textures scanlines as much as possible, so datacache is much more effective

you could have an algo that checks for the two symmetrylines in a sphere and that way only need 1/4 times costly pixellookups calculations

Mark_Larson · September 26, 2008, 12:43:27 PM

thanks for contributing!

Intel released a library of trig functions for SSE. I'll post the link later. I can't remember off the top of my head. But I used it a few times, and it flew. I am pretty sure they use Taylor to approximate. The other cool thing is you can compute a single trig value or 4 floating point trig values in parallel, or 2 double precision floating point numbers in parallel. The timing for doing multiple trig functions is the same as doing a single trig function.

does anyone remmeber the name of the library or where to find it? If not I'll search for it later.

EDIT: also movups is a really slow instruction, don't use it. It takes 10 cycles on a P4. ewww. Just make sure the data is 16 byte aligned, that is what I do.

EDIT2: It is called the Approximate Math Library. I had an extremely hard time finding it. So I am guessing it won't be up there too much longer. So I recommend you get it now. I tried posting it to the forum but it was too big.

http://www.intel.com/design/pentiumiii/devtools/AMaths.zip

daydreamer · September 26, 2008, 11:08:53 PM

Quote from: Mark_Larson on September 26, 2008, 12:43:27 PM
thanks for contributing!
Intel released a library of trig functions for SSE.
EDIT: also movups is a really slow instruction, don't use it. It takes 10 cycles on a P4. ewww. Just make sure the data is 16 byte aligned, that is what I do.

youre welcome
yes, you could make your own macros as well
this is a specialcase where you can choose wether have 16bytes/pixel and movaps , which mean extra unnesserary data everywhere you do not use those 4 bytes for some other data, which make an impact on reading memory and fills the cache with unnesserary bytes everywhere you do not make use of alphachannel, can make the difference if a texture+your other data fits in cache or dont will make big impact on performance
vs me choosing movups can be used on both 12byte pixels or custom data like greyscale for light if you choose to make use of that

Mark_Larson · September 26, 2008, 11:49:12 PM

it is actually easy to do movaps and 12-byte RGB. You just have to do 3 special cases.

Code:

Code Select


movaps       xmm0,[src]       ;read R1, G1, B1, R2
movaps       xmm1,[src+16]    ;read G2, B2, R3, G3
movaps       xmm2,[src+32]    ;read B3, R4, G4, B4

EDIT: Those 3 instructions all semi-execute in parallel on a core 2 duo in 4 processor cycle. How you ask? Easy.

THe movaps on a core 2 duo WITH MEMORY, has a 2 cycle latency but a 1 throughput. That means you can put 3 back to back and if their aren't any dependencies, it will all run in 4 cycle. The movups has a worse time pairing because of the it's 2 cycle recip throughput, and it's latency is 2-4 cycles. So let's assume 3. So that means 3 movups back to back with memory would take 7 cycles. So it is twice as fast to use MOVAPS.

let's look at it graphically.

here is the movups. It takes 3 cycles to run. You cannot start another movups until 2 cycles have overlapped. The 1, 2, and 3, are the first, second and third movups respectively. Each line represents 1 cycle. If you count the number of lines you will find that it is 7, which corresponds to 7 cycles.
1
1
1 2
2
2 3
3
3

and movaps with memory. Keep in mind that 3 MOVAPS back to back without memory all execute in 1 processor cycle. Again each line represents one cycle. So that takes 4 cycles.

1
1 2
2 3
3

daydreamer · September 27, 2008, 07:43:25 PM

I think I gonna code that kinda solution and see how it goes

Mark_Larson · September 27, 2008, 07:53:58 PM

we're 10 posts away from passing the power basic forum! every just post random stuff! ;)

BlackVortex · September 28, 2008, 04:56:28 PM

Quote from: Mark_Larson on September 27, 2008, 07:53:58 PM
we're 10 posts away from passing the power basic forum! every just post random stuff! ;)

+1 post for raytracing !!!

Mark_Larson · September 28, 2008, 06:08:30 PM

Quote from: BlackVortex on September 28, 2008, 04:56:28 PM
Quote from: Mark_Larson on September 27, 2008, 07:53:58 PM
we're 10 posts away from passing the power basic forum! every just post random stuff! ;)

+1 post for raytracing !!!

thanks! +2

hutch-- · September 28, 2008, 10:13:00 PM

:bg

Damn, I will have to whip up some PowerBASIC code.

Mark_Larson · September 29, 2008, 12:15:12 AM

Quote from: hutch-- on September 28, 2008, 10:13:00 PM
:bg

Damn, I will have to whip up some PowerBASIC code.

Rofl. Your post makes it

Mark_Larson · September 29, 2008, 12:17:15 AM

Quote from: hutch-- on September 28, 2008, 10:13:00 PM
:bg

Damn, I will have to whip up some PowerBASIC code.

ROFL. Your post makes it 56, and mine makes it 57! :)

BlackVortex · September 29, 2008, 03:42:42 AM

Lol, yoda thinks the weed is strong in this thread.

daydreamer · September 29, 2008, 08:44:59 AM

LOL how ironic you previously made 6 edits in this forum to save posts and ...

Mark_Larson · September 29, 2008, 11:42:11 AM

yea!!! we just PASSED powerbasic! thanks guys :)

News:

my official o/t thread P: ???