The game is using DirectDraw and with the HI_RES_AWARE flag set, therefore running in 640x480.
After further research it seems the video memory is indeed doing some kind of caching! A test that sets all pixels in the video memory to a certain value (so that no read cache is involved) runs at double speed if writes are sequential, as in a memset.
So the solution seems to be tile blitting: rotate-copy 16x16 pixels blocks one at the time, so that the involved read/write cache misses are minimized. And it works very well: frame rate jumped to about 20 FPS (from 11.7) which is much more acceptable.
It is weird that video memory is being cached: maybe it's not the real video memory and DirectDraw is doing some magic mapping(?) there...
The only improvement I can think of is reading 2 pixels/32 bits at the time, maybe that's a little faster...
Thanks and Best Regards,
Jorge Diogo
http://Ludimate.com