fastest sprite routine

Posted:
Jun 12, 2003 @ 10:57am
by StephC
Hello,
I've spent some time to write a fast sprite routine in ARM assembly with RLE encoding and alpha blending.
Using the 32x32 animated sprite found in gapidraw and warmi's demos it seems that I get more than 1000 sprites displayed at 30FPS, same results on my ipaq 3600 and my Zaurus SL-5500.
I've also a sprite routine optimised for XScale (with PLD instruction to preload cache lines), but no XScale to test it for now...
My bench only consist of :
- blit a background image on the whole screen (memcpy)
- blit all the sprites
- update their coordinates
- update screen (rotate blit on my ipaq)
I don't know if this bench is acceptable for comparison with others demos.
If it's OK, what I want to know is : is there a faster sprite routine out there ?
--
StephC / int13 production

Posted:
Jun 12, 2003 @ 3:37pm
by warmi
Ha , 1000 of them ?
I think I got up to 700 color keyed sprites ( the ZSurface demo is capped at 500) on my Z so getting aditional 300 with RLE would make perfect sense.
But if you were able to get 1000+ alpha-blended sprites at 30 fps then , hell, I don't know what kind of secrets you found about ARM CPU but that would be a huge number ( even considering RLE - after all you are paying penalty for blending process itself and 32x32 sprites used in GapiDraw demos doesn't have that many "empty" pixels anyway.)
BTW . what kind of alpha blending is that – 50/50 or arbitrary ?

Posted:
Jun 12, 2003 @ 7:37pm
by StephC
I'am using arbitrary alpha (0-32) and I think the speed is mainly due to my RLE encoding, source data is always 32 bits aligned when it's possible, and I can blit as much as 16 pixels at a time if there is no empty pixels.
For the bench process, I remember that you got differents results on ipaq and Zaurus, I think it's due to a flaw in your bench on Zaurus...

Posted:
Jun 12, 2003 @ 8:18pm
by StephC
Worst case for RLE encoding is a grid of alternate opaque and transparents pixels, which is highy improbable.
Most of the sprites have empty pixels at their borders, and that is the case where RLE encoding is fast.
my RLE encoding translates the image data to a stream of 32bits aligned segments of the form :
<pixel|skip|runcode><data>
where skip is the number of transparent pixels to skip, runcode is the number of pixels to blit,
pixel is the first pixel in case of odd segment
and data are the pixels themselves.
runcode also contains special markers for end of line, end of sprite and alpha level.
I even think that this approache is as fast as compiled sprites on this architecture.

Posted:
Jun 12, 2003 @ 8:24pm
by StephC
I have 3 mult by pixels for alpha blending, maybe this code is not optimal...
[code]
;// r6 = pixel 2
;// r4 = pixel 1
;// r5 = alpha
and r7, r6, #0xF800 ;// p1M = p1 & REDMASK
and r8, r4, #0xF800 ;// p2M = p2 & REDMASK
sub r8, r8, r7 ;// p2M - p1M
mul r9, r8, r5 ;// * alpha
and r11, r6, #0x07E0 ;// p1M = p1 & GREENMASK - precalc to avoid pipeline stall
and r8, r4, #0x07E0 ;// p2M = p2 & GREENMASK - precalc to avoid pipeline stall
add r7, r7, r9, lsr #5 ;// >>5 + p1M
and r10, r7, #0xF800 ;// & REDMASK r10 = temp result
sub r8, r8, r11 ;// p2M - p1M
mul r9, r8, r5 ;// * alpha
and r8, r4, #0x001F ;// p2M = p2 & BLUEMASK - precalc to avoid pipeline stall
and r7, r6, #0x001F ;// p1M = p1 & BLUEMASK - precalc to avoid pipeline stall
add r11, r11, r9, lsr #5 ;// >>5 + p1M
and r11, r11, #0x07E0 ;// & GREENMASK
orr r10, r10, r11 ;// | temp result
sub r8, r8, r7 ;// p2M - p1M
mul r9, r8, r5 ;// * alpha
add r7, r7, r9, lsr #5 ;// >>5 + p1M
and r7, r7, #0x001F ;// & BLUEMASK
orr r10, r10, r7 ;// | temp result
;// r10 = blended pixel
[/code]