This has been discussed on <a href="http://slashdot.org/articles/02/06/22/185245.shtml?tid=100">SlashDot</a>, too. The article by "jeffmock" covers it all, really.
Essentially, Intel didn't bother taking existing code into account when they designed the XScale, because:
a mispredicted branch can cost *5* extra cycles compared to a SA
loads now seem to have a 3 cycle latency, even from the cache (yuck!)
using a result of the previous operation in an operation with a shift incurrs an extra cycle penalty
somebody said that multiple loads and stores are slower on the XScale, too, but I can't work out why at the moment 'cos from the datasheet they should be the same as the SA. It could be because the StrongARM has an automatic read-buffer .. and the XScale hasn't - you have to ask it to pre-load a cache line.
So, in general, code written/compiled for the StrongARM is going to run like shite on an XScale. Which is great~. Trust Intel to "screw up" something that used to be nice. Of course, the XScale is nice in different ways... but the StrongARM is/was a lovely little beasty and the XScale is kind of kludgey from what I can see. (But I'm not a processor designer and I presume that Intel had reasons for "botching"/"removing" some of the strengths of the StrongARM).
(On an slightly unrelated note, I see that the XScale does implement the Thumb instruction set in at least some variants - I see GBA emulators on the horizon
