by Tala » Dec 12, 2004 @ 1:44pm
I think your routine is optimal and Kaks suggestions should not help.
Lets look what the compiler is doing with the inner loop:
|$L28333|
ldr r0, [r2], #4
//the compiler is filling the delay slot(s) by moving the x -= 4; to here
sub r4, r4, #4
cmp r4, #0
ldr r1, [r2], #4
//again the compiler fills delay slots with &, this makes the & basically free
//also note, that the constant is already in R5
and r0, r0, r5
orr r1, r0, r1, lsl #16
str r1, [r3], #4
bhi |$L28333|
As far as the cache is concerned, it wont help you here, since you are only reading and writing data. Caching only helps if you read data in one iteration to be used in other iterations, which is not the case here.
With other words, compilers are pretty clever these days. Even with asm you wont get any faster.
Tala