Hello.
Just thought I'd clear some things up (mainly regarding the cache hit issue I mentioned).
Firstly, for a majority of cases, the speed loss of a regular linked-list *really* isn't going to matter. If you're writing a text editor or whatever, then a few cache misses aren't going to matter a great deal... if, however, you're writing a game or something that *needs* speed, then it's more of a problem.
Regarding Pam's question, Digby's reply is correct. A list of pointers into the data doesn't really help all that much - you're just increasing the chances that the next pointer to the next data is in the cache... but the data you *need* from the object may not be - there's no point looking up a pointer if you always have to use some of the object to figure out if it's the one you want.
To best organise the linked list on a processor, you want to put all of the data used in your traversal comparison into the cache-line width of that processor, and align data to a cache-line boundary.
So, on the StrongARM, for example, has a 32-byte cache line width.
If I request a single byte of data, 32-bytes will be requested and loaded into the cache (if it's in a cacheable area - let's pretend it is

).
If my object is, say, 64- bytes long, structured like this:
long l_my_id
char c_filler[56]
pointer p_next
And my (nasty pseudo) code goes:
obj = head_object
// we want to find object 63
while obj.l_my_id <> 63 and obj.p_next!=null
obj = boj.p_next
loop
First the processor will load the first 32-bytes of the object (to read l_my_id)... then it'll read the next 32-bytes to read the p_next.
If I reorganise my data *slightly*:
long l_my_id
pointer p_next
char c_filler[56]
... then I'm only ever going to bother loading *one* cache line per object for the traversal. This means I load 1/2 as much data, which increases the chances of things hitting the cache.
You could also store the "traversal" data (the pointer and the items used in the comparison) in one area along with a pointer to the data you rarely use.
Sometimes optimisation is more about thinking about clever data placement rather than making the code itself faster
MirekCz is correct too, if you're allocating data using malloc for each small structure, you're going to suffer from memory fragmentation, and nasty speed issues. If you allocate a block of memory (say 64 at a time), then you can just use that as a set of empty links. When you want to add an object, you simply pick the next new object in your array. When they're all used, allocate another block. That means you only hit the malloc() function once every 64 objects.
Cheers,
Refractor.