I'm just skimming through my newly arrived copy of Intel's "Programming with Intel Wireless MMX Technology". Nice reading, by the way...
Anyway, so there is this chapter on "Accelerating Graphics Applications", which has a section on triangle rasterization for 3D applications. In there, there are a bunch of paragraphs discussing the (well-known) problem of memory latency during rasterization, which could be overcome by issueing appropriate prefetching instructions (PLD) to bring data from main memory into cache before it would be used lateron. And that's about where the discussion ends... well, they finish by saying: "However, the biggest challenge is [the] computation of the address to be pre-fetched."
So, my question: Has anybody tried this type of prefetching as part of an ARM/XScale based 3D-engine? How much do you gain? Where would you typically determine and issue the prefetch? How would you determine what to prefetch? Is there a tradeoff?
Thanks,
HM