
Posted: 
Oct 30, 2006 @ 7:27pmby Digby
				You might want to investigate reading a DWORD instead of WORD each time through the loop.  You pay the same cost on ARM for doing 16 bit ops as you do for 32 bits. 
I'd also look at the generated assembly and make sure that the compiler has inlined GetWidth and GetHeight or isn't doing something else dumb like reloading red/green/blue since they are typed as const.  Or you could just read the width once into x then while(x--) instead of the for loop.
When I develop routines like this that don't do much with the per-pixel data, I try to quickly determine how fast the routine would run with a "best case" implementation.  I do this by commenting out sections of the code with the reasoning that if I did a really, really great job of optimizing, it would never get any faster than not having to execute the code at all.  If, after commenting out the code, it didn't make an appreciable difference in the overall performance of the *app*, then I'd spend my time elsewhere.  You have to be careful about doing this as the compiler will sometimes turn your entire function into a NOP so you need to look at the generated assembly language to determine the routine is still doing the portion of work that you expect.
With the code that you've posted below, you'll probably find that the time it takes to read and write the pixel data is more than it takes to execute those few bit shifts and addition operations.