Page 1 of 3

assembly bitmap scaler

PostPosted: Aug 12, 2005 @ 6:29pm
by mm40

PostPosted: Aug 12, 2005 @ 7:10pm
by refractor

PostPosted: Aug 12, 2005 @ 8:02pm
by mm40

PostPosted: Aug 18, 2005 @ 5:49am
by mm40
Sorry for the delay, I have created a testbed and C function scaler, the testbed uses PocketFrog, you can download it here

http://pocketfrog.droneship.com/download.html

use EVC 3.0
1) download and unzip pocketfrog
2) open the pocketfrog workspace
3) replace the example blit.cpp file with the one attached
3) compile the PocketFrog lib
4) compile Blit and run it

You could convert the whole blitStretch function and its sub function, but I think that might be a waste of time at it is doing some simple clipping and probably won't yeild must of an increase in preformance.

The workhorse function is blitStretchClippedFixed and the one which should be tuned in ASM if possible.

I have some benchmark results in the file. Let me know if you need me to explain anything else or need some code changes or comments. I think the blitStretchClippedFixed function is about as optimized as it can be in C.

PostPosted: Aug 18, 2005 @ 6:42am
by Andy

PostPosted: Aug 18, 2005 @ 1:35pm
by Dan East

PostPosted: Aug 18, 2005 @ 3:48pm
by mm40

PostPosted: Aug 18, 2005 @ 8:28pm
by refractor
It's looking fairly bus-bound so I'm not expecting a huge gain in performance from an ARM version, but I'll do it tomorrow anyway. :mrgreen:

You should definitely sort out the blitStretch function though -- ideally you should do all of that in fixed point (the float divisions will cost a lot).

PostPosted: Aug 19, 2005 @ 1:20pm
by refractor
Right.

Attached is a blit.exe with blitStretchClippedFixed coded in ARM (the original function, not Andy's). I'll post the source for the ARM function in a few hours after I've tidied it up a bit.

I haven't had time to properly speed-test it yet (I just wrote it and made sure it ran). As I said before, it looks pretty bus-bound and I wouldn't be surprised if there was little to no gain (other than giving me something interesting to do during my lunch hour).

I could probably kill another cycle or two out of the setup, but the loops are looking pretty tight (tighter than the compiler's version at least). I could also modify the ARM based on Andy's suggestion and see if that helped at all.

PostPosted: Aug 19, 2005 @ 4:24pm
by mm40
I get the same as results as Andy's function, but this is on a very fast device (x50v) on a slow CPU device I think this may make a big difference, I will test that specifically in the future.

PostPosted: Aug 19, 2005 @ 6:16pm
by refractor

PostPosted: Aug 21, 2005 @ 2:52am
by mm40

PostPosted: Aug 23, 2005 @ 10:30pm
by joshbu [MSFT]

PostPosted: Aug 25, 2005 @ 12:55am
by mm40
Hi joshbu, thanks for your suggestions.

I tried a 32 bit implementation and it provided no speed improvement at all, plus you had the mess of making sure everything was aligned and cleaning up if it wasn't, didn't seem to be worth it.

To ensure scan lines are cache aligned wouldn't that be specific for each type of processor or device? Or do they all have the same cache sizes? I'm not sure this would matter since my data is all in a tightly packed array, so it should already be only missing the cache the minimal amount.

Adding special cases for certain scaling would speed it up a lot, but is so rare that it would hit those specific sizes its not worth doing (of course this depends a lot on your app).

PostPosted: Aug 25, 2005 @ 6:42am
by refractor