I started my adventure with RZA1/H. I bought the starter kit and installed e2studio 6.2 with KPIT 16.01. When I was enough familiar with the examples I decided to check the CPU power and as the simplest test I ran the loop with million NOP instructions. The code was executed with disabled interrupts and before FreeRTOS kernel starts, just after init functions invocation so that nothing can break the test loop. Before and after the loop I toggled one of LED to check time of execution using oscilloscope. I was wondering that 400 MHz CPU needs ~170 ms to execute the loop. I expected a bit more than 2.5 ms. Later I discovered that with optimization -o3 the loop is executed in less than 3ms, so I thought that everything works fine, but in fact wasn't. The optimization -o3 showed me that when CPU accesses only its registers to manage the loop counter in them, execution is as expected, but when I change it to even -o2 or disable it completely -o0, than I see that the loop counter is realized as a variable in the SRAM memory. I think that the problem depends somehow on the access to the SRAM. So I went through all the examples to find any difference in CPU initialization code, but they all the same. I tried of course to measure time execution of more complicated code, but the results were always the same, execution is terribly slow and is more or less 60 times slower than expected. For example, switching the context in FreeRTOS, less than 100 asm instructions, takes ~22 us, regardless of optimization. The code is executed from the SRAM, loaded either from J-Link or bootoader, hardware debug or release version, it does change nothing.
Did anybody have similar problems? Executions works like expected, I mean program works as designed, but 60 or more times slower than expected.
You may have already done this but did you check the clock frequency settings thru the frequency control registers? However, please wait until RZ experts respond to your post. Thank you.
JBRenesasRulz Forum Moderator
In reply to JB:
In reply to hilson:
In reply to michael kosinski:
> Also I tried the L1 with or without the L2 cache, but always with disabled MMU, that can only complicates the issue more, I think.
If you are trying to measure performance, you should at least read and understand the manual of CPU core (ARM Cortex - A9) properly.For Cortex-A9, L1 cache is disabled if MMU is disabled.
Do not forget to enable branch prediction as well as L1 instruction/data cache.
In reply to Pecteilis:
> I checked and MMU is managed by the startup files and is enabled.
So, I would like to confirm:
Since you enabled the MMU and level-1 instruction/data caches,- you have setup the SCTLR of cp15 correctly- you have created the proper address translation table and set its start address to the TTBR0/TTBR1 registers of CP15- you have set the memory attributes of internal RAM as: - cacheability: outer and inner write-back, no write-allocate - memory type: normal memory - shareability: not-shareable
In other words,
· What was the value of SCTLR (System Control Register)? What are the values of M, C, Z, I, TRE and AFE bits of the SCTLR register?
· What was the value of the TTBR0 and TTBR1 registers? What were the values of the translation table entries corresponding to the internal RAM area (area of physical address 0x20000000 to 0x209 FFFFF) in the primary translation table pointed to by TTBR0? What are the values of C, B, Domain[3:0], AP[2:0], TEX[2:0], S, nG bits in the translation table entry corresponding to the internal RAM area?
Also, from your first post, you seem to assume that you get the performance with a cache hit ratio of 100%,so when you measure performance, you have some way to load all instructions and data of your programinto the level-1 caches.
What kind of method did you use?
# for now, we will postpone confirming whether branch prediction is enabled...