gapidraw doesn't have a performance loss because all of their source images are pre-rotated, so when you blit it to the screen buffer it is simply a copy, rather than having to jump around in memory. this method has one serious negative aspect, you have to code all of your routines 4 different ways.
a much better way to do it, IMO, is to do it as you are, only rotate on the final blit, BUT, a huge performance increase can be gained by doing a few things:
1) read sequentially from the source image and copy non-sequentially to the display, this is much faster because reads are cached but writes are not, using this method you can obtion near pre-rotated speeds.
2) use forward caching by preloading lines into the cache, there are several other posts about this, although i never obtained any noticable results with it.
3) use memcpy when possible, it will be optimized on some devices.
on a side note when something is running at 190fps on a mobile phone, there is little need to make it faster before your game is done, because i highly doubt it will be a bottleneck compared to other things you will be doing in your game.