I'm unsure about your all in one approach as it implies a fixed opacity.
My code for blend of two pixels is based on a 32 levels of mix and 565 RGB which is very common.
int mask=0x07e0f81f; // This is My favorite number.
int c1=*frontbuffer++; // you need 32 bits it's a conversion anyway
int c2=*backbuffer++; // Don't know if it complains without (int)
int opacity=13; // whatever 0-32
c1|=c1<<16;
c1 =c1&mask;
c2|=c2<<16;
c2 =c2&mask;
c1 =c1*opacity;
c1+=c2*(32-opacity);
c1=(c1>>5)&mask;
c1|=c1>>16;
*anotherbuffer++=c1; // Don't know if they are both created each frame
// Can't just keep adding to one or it'll white out
Do above (320*240) times in a loop.
Im sure it's signed - unsigned has hick ups, but its the rough version.
First optimisation would be to try to write in words, so pairs of above and in sequential memories. And with so much effort do some tricks with the opacity level. Thiner at the edges to make beveled glass make it earn that many CPU cycles. Don't allow the opacity to <0 or >32