optimizing raytracing code

Mark_Larson · September 26, 2008, 09:05:00 PM

Quote from: dr_eckI know this is a MASM forum, but it would be helpful for those of us more interested in ray tracing than ASM to see code in either inlined assembler or intrinsics. To be very specific, here are three functions:
Code Select Expand
inline real Len2() const {return x*x+y*y+z*z;} inline real Len() const {return sqrt(x*x+y*y+z*z);} inline Vec3& operator!() {real L = this->Len(); if(L>0) *this/=L; return *this;} // normalize this
What would the normalization operator!() look like using a reciprocal square root intrinsic? Is the L=0 case properly handled?

I appreciate your efforts!

how is your C? I will eventually convert the C code to ASM. But not right away. Here is my C code for your stuff. I am also including the definitions of the different data types.

Code Select

typedef struct {
	float x,y,z;
// float w;							//we will add 'w' later when we start doing vector SIMD
}Vector;

typedef struct {
	Vector		Center;
	float			Radius;
// float 		one_over_radius;	
}Sphere;

#define	MAX_SPHERES		1

Sphere	Spheres[MAX_SPHERES];


typedef struct {
	Vector	Origin;
	Vector	Direction;
}Ray;


//forcing it inline makes it a macro.
FINLINE Set_Vector(Vector &P, float x,y,z) {
	*(P.x) = x;
	*(P.y) = y;
	*(P.z) = z;
}

//forcing it inline makes it a macro.
FINLINE Sub_Vector(Vector &P1, Vector P2) {
	*(P1.x) -= P2.x;
	*(P1.y) -= P2.y;
	*(P1.z) -= P2.z;
}

//forcing it inline makes it a macro.
FINLINE Add_Vector(Vector &P1, Vector P2) {
	*(P1.x) += P2.x;
	*(P1.y) += P2.y;
	*(P1.z) += P2.z;
}

//	float Length() { return (float)sqrt( x * x + y * y + z * z ); }
//forcing it inline makes it a macro.
FINLINE float Length_Vector(Vector P) {
		return sqrtf( P.x * P.x + P.y * P.y + P.z * P.z );
}

REPLACE ME WITH A REVERSE SQUARE ROOT INTRINSIC, AND GET RID OF THE DIVIDE

//	void Normalize() { float l = 1.0f / Length(); x *= l; y *= l; z *= l; }


//precalc this code.
//  Normalize( Direction - Origin )
//forcing it inline makes it a macro.
FINLINE Normalize_Vector(Vector &P) {
	const float l = 1.0f / Length_Vector(); *(P.x) *= l; *(P.y) *= l; *(P.z) *= l;
}

//	float SqrLength() { return x * x + y * y + z * z; }
//forcing it inline makes it a macro.
FINLINE float SqrLength_Vector(Vector P) {
		return P.x * P.x + P.y * P.y + P.z * P.z;
}

//	float Dot( vector3 a_V ) { return x * a_V.x + y * a_V.y + z * a_V.z; }
//forcing it inline makes it a macro.
FINLINE float Dot_Vector(Vector P1, P2) {
		return P1.x * P2.x + P1.y * P2.y + P1.z * P2.z;
}

You will notice in the above code, I never return a vector, that is really slow. You would be returning 12 bytes per call, which is slow. Instead I modify one of the vectors passed in, which is a lot faster, and you don't have to return a single thing.

dr_eck · October 07, 2008, 03:07:54 PM

@Mark_Larson: I'm not so sure about returning a vector being slow. My understanding is that C++ always returns a pointer to a class object rather than the object, so the return would only be 4 bytes.

Beyond that, if the result of a function is used immediately, particularly if the function in inlined, wouldn't the return value be kept in registers? This would, of course, depend on the compiler, but I expect compilers to produce efficient code (with the possible exception of Microsoft's).

My expectation is that the only thing that would make an operation slow is if it involves writing to (or reading from) DRAM. In the case of writing, doesn't the value first get written to cache and then transferred to DRAM while the processor is off doing other things? If this is true, then the only cardinal sins are branches and cache misses. Is this correct?

Mark_Larson · October 07, 2008, 05:37:42 PM

Quote from: dr_eck on October 07, 2008, 03:07:54 PM
@Mark_Larson: I'm not so sure about returning a vector being slow. My understanding is that C++ always returns a pointer to a class object rather than the object, so the return would only be 4 bytes.

ah, slight misunderstanding. I meant that the calling procedure always has to push the THIS pointer when calling, which actually causes 1 memory read, and 1 memory write ( you have to read the value from memory, and save it in a register, and then push it on the stack). In C you never have to do that. And memory operations are expensive.

EDIT: so if in C you wanted to modify 3 of the parameters, you would only have to send back 12 bytes. In C++ you would have to use 16 bytes, for the THIS pointer. Make more sense?

Quote from: dr_eck on October 07, 2008, 03:07:54 PM
Beyond that, if the result of a function is used immediately, particularly if the function in inlined, wouldn't the return value be kept in registers? This would, of course, depend on the compiler, but I expect compilers to produce efficient code (with the possible exception of Microsoft's).

under MSVC which you use you need to use __forceinline, I never trust compilers. I always tell them what to do, so I can guarantee certain results.

hehee, I disagree with the comment about compilers. But I'm biased :bg If you want to send me some code I can compile it under ICC, and then show you how I can speed it up using asm. It would need to be a part of the code that I could actually compile all by itself and run.

Quote from: dr_eck on October 07, 2008, 03:07:54 PM
My expectation is that the only thing that would make an operation slow is if it involves writing to (or reading from) DRAM. In the case of writing, doesn't the value first get written to cache and then transferred to DRAM while the processor is off doing other things? If this is true, then the only cardinal sins are branches and cache misses. Is this correct?

no. There is actually a lot of stuff happening behind the scenes. A lot of hardware doing stuff. Short list

Quote
Interrupts triggering and causing the processor to leave your code, and probably to and from memory as well as a hardware device\
Memory cache misses
TLB - translation lookaside buffer. The TLB is cache that is used to improve performance of the translation of a virtual memory address to a physical memory address by providing fast access to page table entries

so TLB is a cache, and cache misses for it are just as bad as normal processor cache misses

L1 and L2 cache and memory. L1 cache is a lot faster than L2 cache. L2 cache is a lot faster than memory. So ideally you want to make sure your data stays in the L1 cache. But your tables are bigger than 8k? AH, easy. You do cache blocking. You break up your code to handle multiple things on the 8k of data at a time. Doing it this way is really fast for memory accesses. You handle 8k at a time, and do multiple tasks on that 8k.

Code Select Expand
do_task1_on_same_8k do_task2_on_same_8k do_task3_on_same_8k

other tasks getting time to run, ewwwwwww, say it isn't so. They might pollute are cache!, so how would we minimize that? boost are tasks priority so it gets the lion's share of the processor. :bg

you can also avoid cache misses by doing a prefetch instruction. It will get the data into cache before you need to use it. There is an intrinsic for this under MSVC.

you can get rid of CONDITIONAL branches. There special instructions that do stuff based on the conditional flags. Look at CMOVeq in the Intel manual. It only does the move if the equal flag is set. You can't control this in MSVC. All of the other stuff I listed above, can be done using C. The one exception is getting rid of branches.

however there is one thing you can do in C that is branch related. The default logic for branch prediction is that backward CONDITIONAL branches ( loops in C) are always taken and forward CONDITIONAL branches are not. So how does that work? Well in your standard C loop, every single loop will be correctly predicted with the branch prediction logic except that last one, since that one falls through.

you also need to look at vectorizing the code you have of tbp's. I'll do that in the other thread.

News:

optimizing raytracing code

Mark_Larson

dr_eck

Mark_Larson