by Digby » Mar 8, 2002 @ 6:15am
I wrote one for my iPaq a while back. If you compile with the /Gh switch, the compiler will insert a prolog function call to _penter() at the beginning of each of your functions in your source file. I wrote an implementation of _penter() that munges the stack so that when the instrumented function returns it will call my _pexit() routine. Since I have a function that's called at the beginning of each function and one at the end I can use QueryPerformanceCounter to read the system counter and write the values to a buffer, along with the function's start address. When the buffer is full, or the app exits I write the buffer to a file. From there an app running on my PC looks at the times and function addresses and can build a call graph. The ImageHlp APIs can get you the symbol name associated with an address.
I added additional functions so that you can add tags around code sections and that will show up on the call graph as a measured "region". There are also tags for display individual values in the call graph (like current free memory).
The calls to _penter() generated by the compiler is very uninstrusive as far as the compiler's optimizer is concerned. The compiler's optimizer (at least on the ARM) will generate the same code in your function because _penter takes no arguments and doesn't return a value. Sure, your code will have to deal with the overhead of whatever happens inside _penter (and _pexit) but the optimzer has access to the same number of registers with and without the instrumentation. Other methods of profiling (like the tags stuff I added) will affect the optimizer and the code you measure won't be the same as the code generated without the instrumentation. Just something to consider if you decide to roll your own profiler.
A profiler such as the one I wrote works pretty well if you aren't familiar with the code you're trying to measure (like porting someone else's game). However, if it's your code, you should have a pretty good idea of where to profile and that's why I haven't bothered to tidy up my profiler. These days, I just throw in a couple of calls to QueryPerformanceCounter() around the routines I want to optimize and just output the elapsed time to the debugger.