RZA1H extremly slow code execution

Hi,

I started my adventure with RZA1/H. I bought the starter kit and installed e2studio 6.2 with KPIT 16.01. When I was enough familiar with the examples I decided to check the CPU power and as the simplest test I ran the loop with million NOP instructions. The code was executed with disabled interrupts and before FreeRTOS kernel starts, just after init functions invocation so that nothing can break the test loop. Before and after the loop I toggled one of LED to check time of execution using oscilloscope. I was wondering that 400 MHz CPU needs ~170 ms to execute the loop. I expected a bit more than 2.5 ms. Later I discovered that with optimization -o3 the loop is executed in less than 3ms, so I thought that everything works fine, but in fact wasn't. The optimization -o3 showed me that when CPU accesses only its registers to manage the loop counter in them, execution is as expected, but when I change it to even -o2 or disable it completely -o0, than I see that the loop counter is realized as a variable in the SRAM memory. I think that the problem depends somehow on the access to the SRAM. So I went through all the examples to find any difference in CPU initialization code, but they all the same. I tried of course to measure time execution of more complicated code, but the results were always the same, execution is terribly slow and is more or less 60 times slower than expected. For example, switching the context in FreeRTOS, less than 100 asm instructions, takes ~22 us, regardless of optimization. The code is executed from the SRAM, loaded either from J-Link or bootoader, hardware debug or release version, it does change nothing.

Did anybody have similar problems? Executions works like expected, I mean program works as designed, but 60 or more times slower than expected.

  • Hi Hilson,

    You may have already done this but did you check the clock frequency settings thru the frequency control registers? However, please wait until RZ experts respond to your post. Thank you.

    JB
    RenesasRulz Forum Moderator

    https://renesasrulz.com/
    https://academy.renesas.com/
    https://en-us.knowledgebase.renesas.com/

  • In reply to JB:

    Hi JB,

    Thank you for your quick reply,

    this was the first I suspected, but the startup files set registers to achive 400 MHz correctly. Commenting this few lines I left CPU with default 100 MHz, but still this didn't resolve the problem, other combinations too. Finally I left the original code and used the frequency to supply the SPI module with the clock. Running simple transmission and watching CLK line on oscilloscope proved that there must be 400 MHz inside.

    I hope that some remedy exists.
  • In reply to hilson:

    Hi Hilson,
    The compiler optimization -o3 would map the counter variable as a register. This can be observed in the disassembler window. Have you counted the number of assembler instructions that are used per loop? This should include the loop maintenance and nops.
    In addition is the L1 Cache and MMU enabled?
  • In reply to michael kosinski:

    Hi Michael,
    first of Thank you for your advise,
    I observed disassembly code, with -o3 the loop counter was decremented for every 8 nop instructions, with -o0 or others for every instruction, this loop had 6 times more asm instructions than the optimized one, but I understand this and it is not the issue.
    Also I tried the L1 with or without the L2 cache, but always with disabled MMU, that can only complicates the issue more, I think.
  • In reply to hilson:

    Hi Hilson,

    > Also I tried the L1 with or without the L2 cache, but always with disabled MMU, that can only complicates the issue more, I think.

    If you are trying to measure performance, you should at least read and understand the manual of CPU core (ARM Cortex - A9) properly.
    For Cortex-A9, L1 cache is disabled if MMU is disabled.

    Do not forget to enable branch prediction as well as L1 instruction/data cache.

    BR,

  • In reply to Pecteilis:

    Hi Pecteilis,
    Thank you very much,
    at the beginning I took all initialization from the examples, and compare them each other. I thought they are all ok and set properly the CPU, but I will look on them one more time, maybe there is something about MMU, but I had overlooked it before.
  • In reply to hilson:

    Hi,
    I checked and MMU is managed by the startup files and is enabled.
  • In reply to hilson:

    Hi Hilson,

    > I checked and MMU is managed by the startup files and is enabled.

    So, I would like to confirm:

    Since you enabled the MMU and level-1 instruction/data caches,
    - you have setup the SCTLR of cp15 correctly
    - you have created the proper address translation table and set its start address to the TTBR0/TTBR1 registers of CP15
    - you have set the memory attributes of internal RAM as:
        - cacheability: outer and inner write-back, no write-allocate
        - memory type: normal memory
        - shareability: not-shareable

    In other words,

    · What was the value of SCTLR (System Control Register)?
        What are the values of M, C, Z, I, TRE and AFE bits of the SCTLR register?

    · What was the value of the TTBR0 and TTBR1 registers?
        What were the values of the translation table entries corresponding to the internal RAM area
        (area of physical address 0x20000000 to 0x209 FFFFF) in the primary translation table
        pointed to by TTBR0?
        What are the values of C, B, Domain[3:0], AP[2:0], TEX[2:0], S, nG bits in the translation table entry
        corresponding to the internal RAM area?

    Also, from your first post, you seem to assume that you get the performance with a cache hit ratio of 100%,
    so when you measure performance, you have some way to load all instructions and data of your program
    into the level-1 caches.

    What kind of method did you use?

    # for now, we will postpone confirming whether branch prediction is enabled...

    BR,

  • In reply to Pecteilis:

    Hi Pecteilis,
    I went through startup code to answer your questions

    1) SCTLR is cleared at the begining, i.e. bit I, C, M and V, later A
    2) SCTLR has set bit M, I, D, Z, C in further steps

    Recently I have thought about A9, whether it can be used as a standard microcontroller with a huge SRAM, but without L1, TTB and so on, as my code is executed directly from SRAM, data is also in the SRAM, so basically what is a point of caching code and data that is in the same memory?

    Regarding to TTBR0 I found the snippet of code with the comment

    init_TTB:
    /*******************************************************************************
    * Cortex-A9 MMU Configuration *
    * Set translation table base *
    *******************************************************************************/
    /* Cortex-A9 supports two translation tables */
    /* Configure translation table base (TTB) control register cp15,c2 */
    /* to a value of all zeros, indicates we are using TTB register 0. */
    MOV r0,#0x0
    MCR p15, 0, r0, c2, c0, 2 /* TTBCR */

    /* write the address of our page table base to TTB register 0 */
    /* start of table from .ld file */
    LDR r0,=ttb_mmu1_base

    /* RGN=b01 (outer cacheable write-back cached, write allocate) */
    MOV r1, #0x08

    /* S=0 (translation table walk to non-shared memory) */
    /* IRGN=b01 (inner cacheability for the translation table walk is
    Write-back Write-allocate) */
    ORR r1,r1,#0x40
    ORR r0,r0,r1

    /* TTBR0 */
    MCR p15, 0, r0, c2, c0, 0

    /*******************************************************************************
    * PAGE TABLE generation
    * Generate the page tables
    * Build a flat translation table for the whole address space.
    * ie: Create 4096 1MB sections from 0x000xxxxx to 0xFFFxxxxx
    * 31 20|19 18|17|16| 15|14 12|11 10|9|8 5|4 |3 2|1 0|
    * |base address | 0 0|nG| S|AP2|TEX |AP |P|Domain|XN|CB |1 0|
    *
    * Bits[31:20] - Top 12 bits of VA is pointer into table
    * nG[17]=0 - Non global, enables matching against ASID in the TLB when set.
    * S[16]=0 - Indicates normal memory is shared when set.
    * AP2[15]=0
    * TEX[14:12]=000
    * AP[11:10]=11 - Configure for full read/write access in all modes
    * IMPP[9]=0 - Ignored
    * Domain[5:8]=1111 - Set all pages to use domain 15
    * XN[4]=0 - Execute never disabled
    * CB[3:2]= 00 - Set attributes to Strongly-ordered memory.
    * (except for the descriptor where code segment is based,
    * see below)
    * Bits[1:0]=10 - Indicate entry is a 1MB section
    *******************************************************************************/