Benchmarking the LPC1343 in code using DWT
In code perfomance testing allows you to dynamically measure the number of cycles per function
The ARM Cortex-M3 Technical Reference Manual describes an optional block called the Data Watchpoint and Trace Unit (DWT). If you've ever wondered how you could determine exactly how many cycles a specific function or piece of code was using, this is what you need.
DWT Support in the LPC1343 Code Base: The DWT registers and macro in 'core/cpu/cpu.h' were only added in v0.86 of the
LPC1343 Code Base. You will need to make sure that your version of cpu.h and lpc134x.h are at least from that version or higher or manually add the registers yourself.
Using the DWT Cycle Counter
DWT includes a cycle counter that allows you to count the number of cycles that have passed in the core. By resetting the counter register just before a function, executing your function, and then reading the counter register again you can determine exactly how many cycles the function or code took. Simply divide the number of cycles by the CPU frequency (ex. 72,000,000 for 72MHz) to determine a delay in seconds.
To make this easier, a simple macro has been added to 'core/cpu/cpu.h' to reset the the cycle counter to 0:
#define CPU_RESET_CYCLECOUNTER do { SCB_DEMCR = SCB_DEMCR | 0x01000000; \
DWT_CYCCNT = 0; \
DWT_CTRL = DWT_CTRL | 1 ; } while(0) In order to use the macro and determine how many cycles a specific function of piece of code takes, you simply need to reset the counter, run your code, and then check the DWT_CYCCNT register as follows:
CPU_RESET_CYCLECOUNTER;
.. do something ..
int cycles = DWT_CYCCNT;
One problem that you'll noticed, though, is that it you attempt to measure a very small operation like a 'nop' in assembly (1 cycle long), you'll get an unexpected value. Depending on a number of factors, measuring a single 'nop' will likely report somewhere between 5 and 10 cycles, despite the certainty that only one cycle was used for the command. The reason for this is that by the time you go back and read CYCCNT the clock has already move on a few ticks. Thankfully, we can easily compensate for this for reasonably accurate results even on extremely small operations
Calculating the Cycle Counter Offset
You can calculate the 'offset' for DWT_CYCCNT readings using a single 'nop' operation as follows:
int offset, cycles;
CPU_RESET_CYCLECOUNTER;
__asm volatile("nop");
cycles = DWT_CYCCNT;
offset = cycles - 1;
// Display the results
printf("1x nop = %d Cycles\r\n", cycles);
printf("DWT Cycle Counter offset set to %d\r\n", offset); At this point, you simply need to deduct 'offset' from any future readings. For example, if we execute 10 'nop's, we should get a value of 10 cycles from DWT_CYCCNT after compensating with the offset we calculated above:
CPU_RESET_CYCLECOUNTER;
__asm volatile("nop");
__asm volatile("nop");
__asm volatile("nop");
__asm volatile("nop");
__asm volatile("nop");
__asm volatile("nop");
__asm volatile("nop");
__asm volatile("nop");
__asm volatile("nop");
__asm volatile("nop");
cycles = DWT_CYCCNT - offset;
// Display the results
// This should be equal to 10
printf("10x nop = %d Cycles\r\n", cycles); This should display 10 cycles for the 10 'nop' operations. (Note: This may still be off by 1 cycle, but using this offset method will still be more accurate for extremely brief operations than the first example we showed).
Putting it to use
In-code benchmarking and performance testing can be extremely useful in a number of circumstances, particular where critical timing is required or when you need to determine exactly how long each iteration of a repetitive task will take including any background interrupts that may delay the task (however briefly). Looking at the compiled assembly for a project you can determine on paper how many cycles a function will take, but if you also have a systick timer running the background, USB, a watchdog timer, etc., all of these will also consume cycles that won't be visible in the assembly code. By measuring the clock cycles with the DWT cycle counter and averaging the results over a number of readings, you can get a far more accurate idea of what kind of performance you can really expect from your system.