Optimizing multiply-accumulate (RMPA) with GCC

I figured some of the contest entries might be doing some DSP with GNURX, so I put together some sample code that shows how a portable C implementation of MAC compares to the RX version of MAC (via RMPA):

rx-elf-gcc -g -O3 -MMD   -c -o main.o main.c
rx-elf-gcc -g -O3 -MMD   -c -o mac-c.o mac-c.c
rx-elf-gcc -g -O3 -MMD   -c -o data.o data.c
rx-elf-gcc main.o mac-c.o data.o -o mac-c.x -msim
rx-elf-run mac-c.x
4480 clocks (28.0 per datum)
29920 result
rx-elf-gcc -g -O3 -MMD   -c -o rmpa.o rmpa.c
rx-elf-gcc main.o rmpa.o data.o -o rmpa.x -msim
rx-elf-run rmpa.x
800 clocks (5.0 per datum)
29920 result

Note that my examples use the GNU rx simulator's built-in TPU0 simulation, which of course doesn't exist on the RX/62N (that uses MTU instead of TPU), but you can adapt them :-)

I put a zip file on my web server with all the sample code.

As you can see from the sample output above (simulated), using the RMPA opcode instead of portable C results in a MUCH faster loop (5.6x faster!), using only 5 clock cycles per MAC instead of 28.