Register
Site Login
Site Search
Forums
Advertisement
Welcome to PocketMatrix. PocketMatrix is dedicated to providing the best online community for mobile device developers and enthusiests. What's new?

Optimizing Division...


Optimizing Division...

Postby drgoldie » Aug 30, 2006 @ 3:57pm

hi all,

I need to divide two numbers by the same value (all fixed-point). It is obvious that doing the inversion first and then applying it twice is faster than two division. So this is my current code (BITS=8)
Code: Select all



I64 rev = ((I64)1<<(BITS+32))/h;
= (I32)((hx*rev)>>32);
= (I32)((hy*rev)>>32);
3 lines; 0 keywds; 4 nums; 34 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East  

Yet, i'd need to speed this up more since this routine is called quite often in my code.

Does anybody know a fast (accuracy is important) inversion routine (1/value)?

thx,
Daniel
drgoldie
pm Member
 
Posts: 330
Joined: Jan 10, 2003 @ 10:46am
Location: Vienna


Postby refractor » Aug 30, 2006 @ 5:32pm

What range does h have? If it's small then a LUT would be suitable. If not, we've covered division/reciprocal routines in ARM code <a href="http://www.pocketmatrix.com/forums/viewtopic.php?t=11940&start=0">here </a>before
User avatar
refractor
pm Insider
 
Posts: 2304
Joined: Feb 5, 2002 @ 1:12pm
Location: Luxembourg


Postby drgoldie » Aug 30, 2006 @ 5:34pm

h is usually something like 300-800, but it is difficult to say for sure...

Daniel
drgoldie
pm Member
 
Posts: 330
Joined: Jan 10, 2003 @ 10:46am
Location: Vienna


Re: Optimizing Division...

Postby hm » Sep 6, 2006 @ 6:55am

drgoldie wrote:hi all,

I need to divide two numbers by the same value (all fixed-point). It is obvious that doing the inversion first and then applying it twice is faster than two division. So this is my current code (BITS=8)
Code: Select all



I64 rev = ((I64)1<<(BITS+32))/h;
= (I32)((hx*rev)>>32);
= (I32)((hy*rev)>>32);
3 lines; 0 keywds; 4 nums; 34 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East  

Yet, i'd need to speed this up more since this routine is called quite often in my code.

Does anybody know a fast (accuracy is important) inversion routine (1/value)?

thx,
Daniel

Daniel,

have you checked the code in the Vincent lib? The inversion routine can easily be adjusted to different fixed point positions...

- HM
User avatar
hm
pm Member
 
Posts: 201
Joined: Dec 28, 2003 @ 8:47pm
Location: Seattle, WA


Postby drgoldie » Sep 6, 2006 @ 7:34am

i was trying code very similar to that in Vincent (more or less identical) and it was a lot slower than the compiler's division routine (and less accurate).

i had to do 4 iterations to get enough accuracy but even with 2 iterations it was much, much slower...

i didn't do any benchmarks but the difference in my application was so big that i just switched back.

Daniel
drgoldie
pm Member
 
Posts: 330
Joined: Jan 10, 2003 @ 10:46am
Location: Vienna


Postby drgoldie » Sep 6, 2006 @ 8:18am

Ok, i couldn't let that post stand as it is, so here's some benchmarks i did:

Performing 3 million divides with this routine (nCount=300, doing a sum to prevent compiler from optimizing the division away):

Code: Select all









10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
int benchDiv(int nCount)
{
    int j=nCount+1,
        sum=0;

    while(--j)
    {
        int i = 10001;
        while(--i)
        {
#ifdef _DO_ORIGINAL_
            sum += (1<<16)/i;
#else
            sum += EGL_Inverse(i);
#endif
        }
    }

    return sum;
}
20 lines; 10 keywds; 5 nums; 37 ops; 0 strs; 0 coms    Syntactic Coloring v0.4 - Dan East  


Results from benching on an HTC Tornado Smartphone (200 MHz ARM CPU, VS2005 with /O2) are:

- compiler divides: 1020ms
- EGL_Inverse with 2 iterations: 1450ms
- EGL_Inverse with 4 iterations: 1930ms

Even the version with 4 iterations does not always give complete acurate results.

Am I doing something wrong or is the compiler's code just better (which i have to admit is what i expect from a good compiler)?

Daniel
drgoldie
pm Member
 
Posts: 330
Joined: Jan 10, 2003 @ 10:46am
Location: Vienna


Postby drgoldie » Sep 6, 2006 @ 12:39pm

BTW: i also benched the gppInvLP_16_32s and gppInvHP_16_32s functions from the Intel GPP (on an XScale device though...) and they are faster then the C code (EGL_Inverse) but still slower then the compiler's divide routine.

For me it seems as if the time for custom divide functions is over, isn't it?

Daniel
drgoldie
pm Member
 
Posts: 330
Joined: Jan 10, 2003 @ 10:46am
Location: Vienna


Postby mm40 » Sep 6, 2006 @ 6:46pm

drgoldie wrote:h is usually something like 300-800, but it is difficult to say for sure...

Daniel



If you have that small of a range then try a 512 entry LUT and use it if your value is in that range. Set the range of your LUT to your most commonly used values.
User avatar
mm40
pm Member
 
Posts: 135
Joined: Feb 21, 2003 @ 9:11pm


Postby drgoldie » Sep 6, 2006 @ 7:41pm

the range isn't small since accuracy is important and those values are fractional rather than integer.

i'll have to check the actual accuracy requirements. maybe it is enough to make a table from 300-800 with 0.1 stepsize (would be 100kb) and do a normal division for all other cases. maybe it is also worth interpolating between LUT values to gain more accuracy.

i'll try that and report when i have results...

Daniel
drgoldie
pm Member
 
Posts: 330
Joined: Jan 10, 2003 @ 10:46am
Location: Vienna


Postby Presto » Sep 6, 2006 @ 10:02pm

drgoldie wrote:the range isn't small since accuracy is important and those values are fractional rather than integer.

You said it's usually 300-800, which is only about 500 different possibilities, which is a small range. You could pre-calculate 500 floats or doubles and use multiplication instead of division, which would still be faster. Then you could do something like:
Code: Select all









10 
//Pre-calculate values
float hlookup[512];
for(int i=0;i<512;i++)
  hlookup[i] = 1.0/(300+i);

//Use the lookup if "h" is in the normal range (300-811)
//Otherwise, fall back to using division
I64 rev = ((< 300) || (> 811)) ? ((I64)1<<(BITS+32))/: ((I64)1<<(BITS+32))*hlookup[(h-300)];
= (I32)((hx*rev)>>32);
= (I32)((hy*rev)>>32);
10 lines; 3 keywds; 15 nums; 81 ops; 0 strs; 3 coms    Syntactic Coloring v0.4 - Dan East  
Using fixed-point, there should be a more processor-friendly way to handle this, but I don't have much experience with fixed-point. I've glanced at it in the Allegro source though, which is rather nifty.

Or did you mean that "h" is already a floating-point variable? If that's the case, ignore everything I just typed.
User avatar
Presto
pm Insider
 
Posts: 763
Joined: Jan 20, 2003 @ 5:51am
Location: Kalesian Archipelago


Postby drgoldie » Sep 7, 2006 @ 10:12am

yes, _h is already fixed-point.
and the data range is actually from 100-2500.

so i created a table from 0 to 2500 with a step size of 1/16 (4 fractional bits source accuracy).

i tweaked the numeric range of the stored data to optimally fit into 16 bits. this results in 80KB memory consumption for the table. if this turns out to be a problem in the future i can reduce the size at the upper end since accuracy is quite reduced there anyways. in the lower part of the table accuracy is very good...

the result is pretty impressive. although a had only a single (fixed-point) division in my code, the performance went up from 40 to 68 image analysis per second (computer vision algorithm) on the HTC tornado (and from 200 to 235 on the X50v at 728MHz). this also means that the part of the code that was the major bottleneck until now is no longer a major one and it makes sense for me now to look more into other code sections too...

so thanks to everybody who contributed to this improvement (which opens oportunities for even more improvements...)!

Daniel
drgoldie
pm Member
 
Posts: 330
Joined: Jan 10, 2003 @ 10:46am
Location: Vienna


Postby hm » Sep 11, 2006 @ 4:06am

drgoldie wrote:For me it seems as if the time for custom divide functions is over, isn't it?

The code in Vincent is not particularly "custom"; it's a 16-entry table lookup plus 2-4 iterations of RN. I understand it is code that traces back to the standard math library on SGIs. Any chance to re-engineer the source for those faster routines and make them available in a similar form? That's would really come in handy for specific versions (like the 8.24 -> 24.8 division I need at some point in my code).

- hm
User avatar
hm
pm Member
 
Posts: 201
Joined: Dec 28, 2003 @ 8:47pm
Location: Seattle, WA


Return to Windows Mobile


Sort


Forum Description

A discussion forum for mobile device developers on the Windows Mobile platform. Any platform specific topics are welcome.

Moderators:

Dan East, sponge, Digby, David Horn, Kevin Gelso, RICoder

Forum permissions

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

cron