Page 1 of 2

Rotated Blits

PostPosted: Jan 12, 2002 @ 9:39pm
by Digby
<< Note: Even though we started this discussion in Phantom's forum under the "Easy CE 2.0" topic, I decided to start a new topic on this issue because it makes it easier for people to find during a search. It's also more of a general programming topic rather than something specific to Jacco's work, or Easy CE 2.0.>>

I've been playing around with blitting code lately in an attempt to determine the penalty for performing a rotated blit per frame. I've got some interesting numbers to share with you guys.

I'm able to do a 240x320 -> 320x240 rotated blit in 5.57 ms. That's close to 180 fps. This is 20 or so lines of C code that I could probably rewrite in ARM assembly to get closer to 200 fps.

Rewriting the routine so that it writes to sequential addresses to avoid tearing results in a severe perf penalty since the cache is getting thrashed. This dropped the throughput to approximately 22 ms per frame (45 fps).

The secret here is that the display memory is not cached. Therefore you don't pay any penalty for writing to non-contiguous addresses to display RAM. The back buffer however IS cached since it's in system memory. You certainly want to read sequentially from your back buffer to take advantage of the cache.

One other interesting thing: I tried these routines writing both to display memory as well as a system memory surface. If you're reading and writing to contiguous addresses, it's really fast. I was seeing 1.77 ms to memcpy from a 320x240 back buffer to another 320x240 offscreen buffer. That same operation when the target was display memory was 4.5 ms. That's the write cache in action I suppose. However, if I try the rotated blit code mentioned above to blit to a system memory surface instead of display memory, the elapsed time jumps to 17.57 ms! This is because the puny cache on the StrongARM is getting thrashed as the routine jumps across cache lines in order to write the data.

Just for the record I also tested GDI's BitBlt from a 240x320 DIB section selected into an off-screen memory DC blitted to the screen DC. It was about 36 ms (27 fps).

Interesting article

PostPosted: Jan 12, 2002 @ 10:43pm
by Johan

PostPosted: Jan 12, 2002 @ 11:46pm
by Digby

PostPosted: Jan 13, 2002 @ 1:18am
by MirekCz

Hello + Blit performances

PostPosted: Jan 13, 2002 @ 5:48am
by Kzinti
Hello everyone, I just found this wonderfull site and I am impressed to see that there are many knowledgeable programmers around here. You might know me as being the author of the "GAPI emulator".

These days, I'm working on my PocketPC as I figured out that while I may never write a good game for the PC (lack of time / resources), it might be possible to do so on my PocketPC. The discussions you have here about "rotating" blits is really interesting. I've come to similar conclusions about the cache, but I have a few other points of information I wish to share with you guys.

As Digby already said, the video RAM is not cached by the memory controller or CPU, and the reason for this is simple: the video RAM is not normal RAM, it is internal to the display chip. This memory is mapped in the address space of applications so that you can access it (memory mapped I/O).

I am not sure that writting to uncached memory is slower then cached memory, the thing is that when you write to cached memory, you are invalidating cache lines that are in use for reading, thereby killing the reading performance, not the writting. Nevertheless, the result is the same: you get faster blits writting directly to the display then to a memory buffer.

Another thing that I haven't seen adressed yet is the use of DMA transfers by certain devices. You all know that the iPAQ devices allows direct access to the display memory (no caching), so in this case the fastest possible way to "flip" is to blit directly to the display. On some other devices like my Casio E-125, the GAPI buffer is trsnaferred to the display using some kind of DMA. What this means is that the actual blitting is done in the back while the processor is free to do other things. That's also the reason that the Casio E-125 gets decent performances with a slower processor and with an intermediate buffer. (Note that most games use an intermediate buffer on iPAQ anyways, while this is not needed on the E-125.)

Direct access is also possible on the Casio E-125, but from the tests I have conducted, the frame rate drops. There is at least 2 reasons for this: 1) The DMA transfert is not used, 2) The MIPS is a 32/64 bits processor internally, but the external bus is only 16 bits.

There is also another benefit to DMA transfers: it is most likely synchronized with the "refresh" of the display as to reduce tearing effects.

I've conducted a test using a "graphic-bandwidth" limited program:

Direct access --> 24 fps
DMA --> 28 fps

Understand that the test is really poorly written and doesn't do anything else but GXBeginDraw()-Blits-GXEndDraw(). In an actual game, you would do many things after the GXEndDraw() and before the next GXBeginDraw(). Even with all that, you can see a considerable gain in performances.

Another important pointer is that if you want to get the most out of the DMA transfert, you have to minimize the time period spent between GXBeginDraw() and GXEndDraw(). As soon as GXEndDraw() is called, the DMA transfert starts, so this is where you game logic has to take place.

Another thing you can see is that updating part of the screen (1/2 or 1/4) doesn't affect the performances much. I think that the reason for that is that like most DMA controllers, doing the actual setup takes more time then the transfert.

PostPosted: Jan 13, 2002 @ 5:52am
by Kzinti
Oups, forgot to include the performance when updating part of the screen, here goes:

Direct access --> 24 fps
DMA --> 28 fps
DMA + Update 1/2 screen --> 29 fps
DMA + Update 1/4 screen --> 30 fps

PostPosted: Jan 15, 2002 @ 3:56am
by simonjacobs
Digby:

Hi. Inspired by your post I have also managed to get a rotated blit working in 5.57 ms!

I wrote mine in assembler, if you get a big speed increase from redoing yours in assembler please let us know.

PostPosted: Jan 15, 2002 @ 4:44am
by Digby
I managed to get the C code to run at 5.3ms by unrolling the inner loop a bit. I looked at the code the compiler was generating and nothing stood out as looking overly heinous so I didn't bother jumping into an assembly language coding fury and instead enjoyed my weekend. <g>

Since the 1:1 memcpy with an aligned back buffer is around 4.5ms, it's going to be hard to beat that if you're having to rotate the pixels during the blt. That being said I'm pretty much done with this stuff. I wasn't going to do the rotation in the "flip" anyway. My driver does the swizzle during lock/unlock so that everything is aligned to the native orientation. I like to deal with hard numbers in making decisions though rather than just going with "oh, it's a LOT faster". I hope my perf numbers of the various blits have been useful for some of you as well.

I did look at the tearing issue with doing a rotation and it's pretty hideous. Of course it all depends on what you're changing per frame. If you don't change much, it might not be too bad. You end up with a diagonal tear instead of the less intrusive vertical tear. It would really be nice to be able to know when the LCD controller is beginning its scan. That would probably go a long way towards minimizing this tearing if you could sync up with the controller's display memory access cycle.

Hard numbers

PostPosted: Jan 15, 2002 @ 5:02am
by Kzinti
Digby: I understand that you'd rather work with hard numbers, but when you deal with DMA, it is not as easy as using a performance counter around the GXBeginDraw() / GXEndDraw() pair.

If one was to do that, he would be able to measure the time it takes to initiate the DMA transfer, not the time taken to do the blit. You could of course measure the maximum theorical frame rate by continuously doing GXBeginDraw()->Blit()->GXEndDraw(), but this is almost meaningless when using DMA since the real cost is the time it takes to initiate the DMA transfer. The actual blit will be done in the background while the CPU continues executing your game code.

PostPosted: Jan 15, 2002 @ 5:07am
by simonjacobs
Plan9:

That is not true on the iPaq at least. We are trying to measure the time to write to the screen memory which is done with the processor, not dma.

PostPosted: Jan 15, 2002 @ 5:22am
by Digby
My comment about hard numbers is that unless you've measured what you are trying to improve, or compare with, you're just making blind assumptions.

The numbers I quoted only refer to the iPaq 3670. I have no idea of what it takes to rotate or blit on any other device. What I'm trying to do is encourage people to post hard numbers (as you've done) instead of subjective data such as "it won't be any faster" or "it will be a lot faster." Make sense?

I agree with you that on devices with hardware assisted blits, the actual blit time doesn't matter if it's happening asynchronously. However in the case of the iPaq the numbers that I've posted are quite meaningful since everything is performed by the host CPU.

PostPosted: Jan 15, 2002 @ 9:16am
by Phantom
Digby,

It's clear, and useful :). It's also good to know that a rotate is only slightly slower than a simple blit (excuse the subjective wording ;) ), at least in my book. 4.5 / 5.3 wouldn't influence my decision very much.

- Jacco.

PostPosted: Jan 15, 2002 @ 10:47am
by Johan
<Edited>
I had some ideas on altering the tearing effect. After thinking a bit more on it I came to the conclusion that it would probably provide an even more disturbing effect than the original one. :)
My idea was basically to read the source image non-sequentially, providing an effect similar to that with interlaced video (sacrificing some speed). But after I did some tests in Photoshop I found out it would probably look even more strange than the original effect... :)

The tearing effect is probably the most serious issue with using a buffer in a format not compatible with the display surface... While it could be eliminated with Jaccos previous suggestion of using an extra buffer (memory->rotate->buffer->copy->screen), the time in ms to rotate (memory->rotate->memory) is too high to make it feasible. :/

Syncing to the lcd update cycle would surely be a nice solution for many. This was actually what I thought Compaq had done for the 3800 series (when everyone complained they could not get more than 30FPS from it). However, counting the number of negative replies from developers wanting more than 50FPS in their game, forcing them to sync to the display update cycle could probably cause a similar reaction.

Best regards,
Johan

...

PostPosted: Jan 15, 2002 @ 7:42pm
by Kzinti

PostPosted: Jan 15, 2002 @ 11:12pm
by Digby