Sorry for the delay, I have created a testbed and C function scaler, the testbed uses PocketFrog, you can download it here
http://pocketfrog.droneship.com/download.html
use EVC 3.0
1) download and unzip pocketfrog
2) open the pocketfrog workspace
3) replace the example blit.cpp file with the one attached
3) compile the PocketFrog lib
4) compile Blit and run it
You could convert the whole blitStretch function and its sub function, but I think that might be a waste of time at it is doing some simple clipping and probably won't yeild must of an increase in preformance.
The workhorse function is
blitStretchClippedFixed and the one which should be tuned in ASM if possible.
I have some benchmark results in the file. Let me know if you need me to explain anything else or need some code changes or comments. I think the blitStretchClippedFixed function is about as optimized as it can be in C.