
I have now checked multiple cases of C to ASM code (using both EVC and VSNET) and came to the conclusion that the compiler is pretty smart.
Both my suggestions for improval above were totally irrelevant. The compiler will always output optimized code in these cases.
Also, I actually managed to get templates to work in my own implementation (despite all the different options in every blit function). It was a really smart move you did by passing a struct containing the operator() as a parameter. If I haven't seen your solution with structs I would never have finished my own solution this fast. Right now I have three operators: SourceColor (source bitmap/colorfill), PixelVerifications (check mask or always copy) and PixelOperations (copy/alpha/alphafast). I also made a bunch of others such as coordinate transforms (fixed point etc). Basically I was able to cut back all code to only one loop for each operation, which at compile time expands to up to 12 different combinations. Studying the ASM output the compiler actually does a 1:1 job of compiling the expanded templates as compared to expanding all loops in code.
I'll post my template findings in a new post when I have cleaned up the design a bit. I will also look into double pixel processing (something which is currently not done and should improve performance considerably in all operations).
By the way. Check out this screen shot. With the new optimized "templated" code I managed to squeeze over 140 alpha blended sprites in my demo (170 sprites without opacity).

<img src="http://www.gapidraw.com/demo_alpha.jpg">
/Johan