News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Bounding Box Intersection

Started by dr_eck, September 18, 2008, 02:57:32 PM

Previous topic - Next topic

dr_eck

I've seen a couple of options for bounding box intersection code.  One is very long and relies on "early outs" for speed.  Here's an example:
real aabb::Intersect(const Ray &r) const
{
real t1, t2, tmp;
real tfar = ray_inf;
real tnear = -ray_inf;
// check X slab
if (r.d.x == 0) {
if (r.o.x > p2.x || r.o.x < p1.x)
return ray_inf; // ray is parallel to the planes & outside slab
} else {
tmp = 1.0 / r.d.x;
t1 = (p1.x - r.o.x) * tmp;
t2 = (p2.x - r.o.x) * tmp;
if (t1 > t2)
Swap(t1, t2);
if (t1 > tnear) tnear = t1;
if (t2 < tfar) tfar = t2;
if (tnear > tfar || tfar < 0.0)
return ray_inf; // ray missed box or box is behind ray
}
// check Y slab
if (r.d.y == 0) {
if (r.o.y > p2.y || r.o.y < p1.y)
return ray_inf; // ray is parallel to the planes & outside slab
} else {
tmp = 1.0 / r.d.y;
t1 = (p1.y - r.o.y) * tmp;
t2 = (p2.y - r.o.y) * tmp;
if (t1 > t2)
Swap(t1, t2);
if (t1 > tnear) tnear = t1;
if (t2 < tfar) tfar = t2;
if (tnear > tfar || tfar < 0)
return ray_inf; // ray missed box or box is behind ray
}
// check Z slab
if (r.d.z == 0) {
if (r.o.z > p2.z || r.o.z < p1.z)
return ray_inf; // ray is parallel to the planes & outside slab
} else {
tmp = 1.0 / r.d.z;
t1 = (p1.z - r.o.z) * tmp;
t2 = (p2.z - r.o.z) * tmp;
if (t1 > t2)
Swap(t1, t2);
if (t1 > tnear) tnear = t1;
if (t2 < tfar) tfar = t2;
if (tnear > tfar || tfar < 0)
return ray_inf; // ray missed box or box is behind ray
}
if (tnear > 0)
return tnear;
else
return tfar;
}


Another approach is to forget about the "early outs" and avoid all of the slow conditionals, allowing the optimizer to do its best.  Here's code from Thierry Berger-Perrin:
bool aabb::Intersects(const Ray &r) const
{ // allow IEEE754 to handle case where ray is parallel to slab
   const Pnt3 l1((p1 - r.o) / r.d);
   const Pnt3 l2((p2 - r.o) / r.d);
   real tmax = __min(pmax(l1, l2).hmin(), r.tmax);
   real tmin = __max(pmin(l1, l2).hmax(), r.tmin);
   return (tmax >= tmin) && (tmax >= 0.f);
}

Note that it does not return the intersection distance, but the distance is rarely used anyway.  Note that pmin() returns the closer of the two points and hmin (horizontal minimum) returns the smallest of the point's coordinates.  r.tmax and r.tmin are the far and near endpoints of the ray.

The problem I have is that bounding box intersection is chewing up 2/3 of my cycles and the two functions are about the same speed.  Any suggestions as to how to speed things up?

Mark_Larson

Quote
I've seen a couple of options for bounding box intersection code.  One is very long and relies on "early outs" for speed.  Here's an example:

Another approach is to forget about the "early outs" and avoid all of the slow conditionals, allowing the optimizer to do its best.  Here's code from Thierry Berger-Perrin:

bool aabb::Intersects(const Ray &r) const
{ // allow IEEE754 to handle case where ray is parallel to slab
   const Pnt3 l1((p1 - r.o) / r.d);
   const Pnt3 l2((p2 - r.o) / r.d);
   real tmax = __min(pmax(l1, l2).hmin(), r.tmax);
   real tmin = __max(pmin(l1, l2).hmax(), r.tmin);
   return (tmax >= tmin) && (tmax >= 0.f);
}

Note that it does not return the intersection distance, but the distance is rarely used anyway.  Note that pmin() returns the closer of the two points and hmin (horizontal minimum) returns the smallest of the point's coordinates.  r.tmax and r.tmin are the far and near endpoints of the ray.

The problem I have is that bounding box intersection is chewing up 2/3 of my cycles and the two functions are about the same speed.  Any suggestions as to how to speed things up?



Welcome to the board Steve  :bg

have you timed either code?  If so which is fastest?  tbp's computes the slabs in parallel.  You do yours one at a time.  So I am guessing his is faster.  So let's look at getting rid of the DIV, and see what happens

can you post your Ray class, so people on the board who are new to raytracing will understand what r.d.x is?  thanks:)

We cast a ray for each x,y value on the screen.  If you precompute all the 1.0f/r.d.x for the entire x,y screen, that will give you a big speed up.

the same trick works with sqrt.

does that make sense?
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

more detailed explanation:

p1 = point 1
p2 = point 2

The 'Ray' has 2 values in it, an 'origin' and a 'direction'.  The origin is usually the camera, and is usually fixed from scene to scene.  The direction changes for every pixel.  The x,y value of the current screen coordinate makes up the 'direction' and a fixed 'z' value that doesn't change from scene to scene.  So here is a example
definition of Ray.

typedef Pnt3

typedef struct sRay {
  Pnt3 Origin;
  Pnt3 Direction;
}Ray;





bool aabb::Intersects(const Ray &r) const
{ // allow IEEE754 to handle case where ray is parallel to slab

The following line gets done in parallel using SSE2

l1 = xmm register
p1 = Pnt3 = x,y,z values, 3 real4 values
r.o = Pnt3 = x,y,z values, 3 real4 values

   const Pnt3 l1((p1 - r.o) / r.d);
   const Pnt3 l2((p2 - r.o) / r.d);
   real tmax = __min(pmax(l1, l2).hmin(), r.tmax);
   real tmin = __max(pmin(l1, l2).hmax(), r.tmin);
   return (tmax >= tmin) && (tmax >= 0.f);
}


I'll convert this to ASM for the rest of the board tomorrow
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson


I am going to put off converting it to ASM.  It is a bit advanced for newbies.  So I am going to instead write a tutorial, which I am working on right now.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

dr_eck

Mark,
  I have timed my code both ways and for some scenes, tbp's method is faster while for others, the sequential slab test is.  I suspect that the reason lies in whether the "early outs" are effective in a particular scene.  The other salient fact is that I'm using MS VC9, which is reported by some to generated inferior code.  It would not surprise me if gcc and/or Intel's compiler generate faster code for tbp's method.

Here's my ray class:
class Ray
{
public:
Pnt3 o; // origin of the ray
Vec3 d; // direction of the ray
real tmin, tmax;// min & max distance along the ray to consider
Spectrum spec; // the spectral content of the ray
Spectrum trans; // the cumulative transmission along the ray
void* mat; // a pointer to the material this ray is in (0 for air)

public:
Ray(void) : tmin(ray_epsilon), tmax(ray_inf), trans(1.,1.,1.), mat(0) {}
Ray(const Pnt3& oo, const Vec3& dd, const real tn, const real tx,
const Spectrum& s, const Spectrum& t=Spectrum(1.0,1.0,1.0))
: o(oo), d(dd), tmin(tn), tmax(tx), spec(s), trans(t), mat(0) {}
~Ray(void) {}
inline Pnt3 operator()(const real t) const {return o+t*d;}
};


And, no, I can't predivide 1.0/ray.d.x for the entire scene because it changes every time the ray hits something.  I also do some photon mapping, in which rays are randomly generated from each source in the scene and then traced.

I should also have stated the definition of pmin() more precisely.  Given two points in 3D, it returns one point containing the minimum value for each of the 3 coordinates.

Thanks for your suggestions.

Mark_Larson

#5
Quote from: dr_eck on September 22, 2008, 09:47:19 PM
Mark,
  I have timed my code both ways and for some scenes, tbp's method is faster while for others, the sequential slab test is.  I suspect that the reason lies in whether the "early outs" are effective in a particular scene.  The other salient fact is that I'm using MS VC9, which is reported by some to generated inferior code.  It would not surprise me if gcc and/or Intel's compiler generate faster code for tbp's method.

ICC has a few problems with optimizing for the core 2 duo, but I'd recommend you trying it, since MSVC won't have any processor specific optimizations.

I don't use MSVC to compile my code.  I just use it for the IDE.  I use a batch file, that way I have better control of what flags are being used to compile the code.

I strongly recommend that the only optimization flags that you use for MSVC are these


/GL /fp:fast /Ox

I can speed up your conditional code.  I thought tbp's would be a LOT faster, that is interesting.

I am in the middle of writing a tutorial on a simple raytracer for the rest of the board so they can start catching up.  I also asked our math guru, Raymond, to lend a hand.

EDIT: msvc is good at optimizing for older processors, just not core 2 duo.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson


hey big guy, since you are using msvc, it doesn't do a good job of using SSE unless you use it explicitly.  You should be able speed up your code by explicitly using scalar SSE intrinsics.  Second look at doing a Taylor series for the division and see if you can do less precision and see if you get a speed up.  I can't speed up tbp's code any in C++, but I can definitely speed yours up in C++.

I'll look at speeding it up after I finish the tutorial.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

#7
 I just remembered, there is a reciprocal instruction.  RCPSS is the scalar version.  It also runs in 3 processor cycles on core 2 duo just like reciprocal square root.  It computes 1/x, which is what you are doing.  Try this and tell me what kind of speedup you are getting?  Here is the intrinsic. 

scalar
RCPSS __m128 _mm_rcp_ss(__m128 a)

Vector
RCCPS __m128 _mm_rcp_ps(__m128 a)


TBP's code
bool aabb::Intersects(const Ray &r) const
{ // allow IEEE754 to handle case where ray is parallel to slab
CHANGED ME TO reciprocal 1/x and multiply
   const Pnt3 l1((p1 - r.o) / r.d);
CHANGED ME TO reciprocal 1/x and multiply
   const Pnt3 l2((p2 - r.o) / r.d);
   real tmax = __min(pmax(l1, l2).hmin(), r.tmax);
   real tmin = __max(pmin(l1, l2).hmax(), r.tmin);
   return (tmax >= tmin) && (tmax >= 0.f);
}
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm