
fused:
It doesn't have /large/ issues with pipeline stalls, but your code there does seem to use the same target and destinations registers all over the place.
In general:
don't use the result of an instruction in the next instruction
(or maybe the next two - look at the StrongARM datasheet - the result cycle counts are in there) - you're stalling the pipeline all over the place. For the same reason, don't use a target register as an address register (it might be alright, actually, but it's still not nice):
LDR R0,[R0]
DO use the conditional execution - "skipped" instructions will lose you only one cycle, IIRC, no matter what they are. Having said that, branches on the StrongARM are pretty fast.
ROB:
ARMv4s have a 32x32 bit multiply /instruction/. UMULL. If only you could embed assembler in eVC++, eh.