 Alpha Blending
Alpha Blending

 <br><br>About LUT's: If you have an array like this: int a[256][5], and you want item [10][1], this does not become *(a + 1 + 10 * 256), but rather *(a + 1 + (10 <<
<br><br>About LUT's: If you have an array like this: int a[256][5], and you want item [10][1], this does not become *(a + 1 + 10 * 256), but rather *(a + 1 + (10 <<  ). It's just a matter of picking your array sizes smart. If you are uncertain about these compiler optimizations, simply always build 1D arrays and do the shift yourself.<br><br>Final note, about LUT's and the cache: I found that on the PC an integer multiply is almost always better than a lookup in a huge table (where 'huge' is 32K or above). On the PocketPC, I have no idea how this relates.
). It's just a matter of picking your array sizes smart. If you are uncertain about these compiler optimizations, simply always build 1D arrays and do the shift yourself.<br><br>Final note, about LUT's and the cache: I found that on the PC an integer multiply is almost always better than a lookup in a huge table (where 'huge' is 32K or above). On the PocketPC, I have no idea how this relates.  <br><br>Greets,<br>- Jacco.
<br><br>Greets,<br>- Jacco.

 I don't know why, and I left it because I was too busy with other things. It does give a good idea of the complexity involved in packing / unpacking though.<br><br>// Generate gaps<br>LONG p = color & (REDMASK|BLUEMASK);<br>p = (p << 10)|((color & GREENMASK) >> 6);<br>// Multiply<br>p *= factor;<br>// Pack to 565<br>LONG result = (p >> 15) & (REDMASK|BLUEMASK);<br>result |= (p & (63 << 4)) << 2;<br><br>So if anyone can see the problem here... I would like to hear it.
 I don't know why, and I left it because I was too busy with other things. It does give a good idea of the complexity involved in packing / unpacking though.<br><br>// Generate gaps<br>LONG p = color & (REDMASK|BLUEMASK);<br>p = (p << 10)|((color & GREENMASK) >> 6);<br>// Multiply<br>p *= factor;<br>// Pack to 565<br>LONG result = (p >> 15) & (REDMASK|BLUEMASK);<br>result |= (p & (63 << 4)) << 2;<br><br>So if anyone can see the problem here... I would like to hear it.  <br><br>By the way, the gapped version contains the components in the order GRB, since that was faster, and it doesn't matter for the multiply.<br><br>- Jacco.<br><br>
<br><br>By the way, the gapped version contains the components in the order GRB, since that was faster, and it doesn't matter for the multiply.<br><br>- Jacco.<br><br>
 That's one of his schemes. I just modified his C code so that it would handle the unpack/multiply/pack properly.<br><br>Proper alpha blending requires an addition as well and that should be done after the components are unpacked (post-multiply).<br><br>If you look at my post where I described the alpha blending equation with the shortcut:<br><br>dC = sC + ((dC * A) >> 8 )<br><br>His code can get you everything to the righthand side of the '+' in that equation. (the alpha is restricted to 5 bits though, and so the shift will change to >>5).  The wRGB parameter is the color of the destination pixel before the blend, and nFactor is the inverse alpha.<br><br><br>The more I've thought about this, the more I think you should try to eliminate all of this pack/unpack overhead as much as possible.  Depending on your content, you can get big wins by RLE encoding runs of transparent or opaque pixels.  At a minimum, you should investigate adding a test in the inner loop.  If the pixel is transparent, then you do nothing. If the pixel is opaque, then you just copy the source pixel to the destination. If the pixel is translucent then do the blend.  I'm really reluctant to add code to my inner loops but in this case I believe it's warranted.<br><br>
 That's one of his schemes. I just modified his C code so that it would handle the unpack/multiply/pack properly.<br><br>Proper alpha blending requires an addition as well and that should be done after the components are unpacked (post-multiply).<br><br>If you look at my post where I described the alpha blending equation with the shortcut:<br><br>dC = sC + ((dC * A) >> 8 )<br><br>His code can get you everything to the righthand side of the '+' in that equation. (the alpha is restricted to 5 bits though, and so the shift will change to >>5).  The wRGB parameter is the color of the destination pixel before the blend, and nFactor is the inverse alpha.<br><br><br>The more I've thought about this, the more I think you should try to eliminate all of this pack/unpack overhead as much as possible.  Depending on your content, you can get big wins by RLE encoding runs of transparent or opaque pixels.  At a minimum, you should investigate adding a test in the inner loop.  If the pixel is transparent, then you do nothing. If the pixel is opaque, then you just copy the source pixel to the destination. If the pixel is translucent then do the blend.  I'm really reluctant to add code to my inner loops but in this case I believe it's warranted.<br><br>
 Thanks a lot for that. Added it to the Nutcracker right away; the new code is more than twice as fast as my old per-component approach.<br><br>Here's the final code:<br><br>inline unsigned short scalecolor( unsigned short c, int mul )<br>{<br>  unsigned int p = (((c&0xF81F)<<1)|((c&0x7E0)>>5)>>5))*mul;<br>  return (((p>>16)&0xF81F)|(p&0x7E0));<br>}<br><br>This code scales one color to a percentage of the original color. 'Mul' must be a value from 0..31. I have removed all references to #defines that I use, so this should compile without problems for everyone.<br><br>Bonus code: Fast additive blending with saturation. This code can be used to add two pixels, and have the result clipped to pure white. Very useful for lighting effects.<br><br>#define SHFTMASK ((15<<11)+(31<<5)+15)<br>inline unsigned short addblend( unsigned short c1, unsigned short c2 )<br>{<br>  unsigned short c = (unsigned short)(((c1>>1)&SHFTMASK)+((c2>>1)&SHFTMASK));<br>  if (c&0x8410)<br>  {<br>    if (c&0x8000) c=(c&0x7FF)+0x7800;<br>    if (c&0x400) c=(c&0xF81F)+0x3E0;<br>    if (c&0x10) c=(c&0xFFE0)+0xF;<br>  }<br>  return (unsigned short)(c<<1);<br>}<br><br>Cool thing about this code is that it has only one 'if' to determine if there has been an overflow in any of the components. If so, it fixes the components one by one. Bad thing is that you loose a bit of precision.<br><br>- Jacco.
 Thanks a lot for that. Added it to the Nutcracker right away; the new code is more than twice as fast as my old per-component approach.<br><br>Here's the final code:<br><br>inline unsigned short scalecolor( unsigned short c, int mul )<br>{<br>  unsigned int p = (((c&0xF81F)<<1)|((c&0x7E0)>>5)>>5))*mul;<br>  return (((p>>16)&0xF81F)|(p&0x7E0));<br>}<br><br>This code scales one color to a percentage of the original color. 'Mul' must be a value from 0..31. I have removed all references to #defines that I use, so this should compile without problems for everyone.<br><br>Bonus code: Fast additive blending with saturation. This code can be used to add two pixels, and have the result clipped to pure white. Very useful for lighting effects.<br><br>#define SHFTMASK ((15<<11)+(31<<5)+15)<br>inline unsigned short addblend( unsigned short c1, unsigned short c2 )<br>{<br>  unsigned short c = (unsigned short)(((c1>>1)&SHFTMASK)+((c2>>1)&SHFTMASK));<br>  if (c&0x8410)<br>  {<br>    if (c&0x8000) c=(c&0x7FF)+0x7800;<br>    if (c&0x400) c=(c&0xF81F)+0x3E0;<br>    if (c&0x10) c=(c&0xFFE0)+0xF;<br>  }<br>  return (unsigned short)(c<<1);<br>}<br><br>Cool thing about this code is that it has only one 'if' to determine if there has been an overflow in any of the components. If so, it fixes the components one by one. Bad thing is that you loose a bit of precision.<br><br>- Jacco.