ux_network_driver/NetX crash because of GCC optimization

Hi,

I've run into an issue when I tried to create my own project based on example given in this forum at thread: 

https://renesasrulz.com/synergy/f/synergy---forum/14365/usbx-rndis-initialization

The code is here:

https://renesasrulz.com/cfs-file/__key/communityserver-discussions-components-files/206/SSP_5F00_1_5F00_6_5F00_0_5F00_S7G2_5F00_SK_5F00_RNDIS_5F00_Device.zip

The issue is that soon after start the MCU ends up in exception handler loop:

void Default_Handler (void)
{
/** A error has occurred. The user will need to investigate the cause. Common problems are stack corruption
* or use of an invalid pointer.
*/
BSP_CFG_HANDLE_UNRECOVERABLE_ERROR(0);

And the exception seems to be caused by the internal IP instance thread on return from _ux_network_driver_entry() function.

The function seems to return after it processed NX_LINK_DEFERRED_PROCESSING command.

At this moment the main thread ("New Thread" in the example) is suspended in call: g_status = nx_ip_status_check(&g_ip0,  NX_IP_LINK_ENABLED, &actual_status, TX_WAIT_FOREVER);

I found that this happens only of GCC optimization is -O0, -O2 or -O3.

If the optimization is -Og, -O1 or -Os then there is no exception and both the original example and my project works OK.

Increase of the IP instance thread stack does not solve the issue.

To me it looks like corruption of the function return address located on stack because of wrong access to a variable located on stack. Optimization level can affect on layout of the stack frames, so at some levels the corruption of the stack can fall not to the return address but to some other data. In this case the corruption may be unnoticeable.

The original project is set to -Og (working), while e2studio by default uses -O2 which creates none-working code.

It can also be just a compiler bug.

I hope the Express Logic folks take a look on this and eventually fix the issue.

I am using e2 v7.3.0 and SSP1.6.

At next step I am going to experiment with IAR compiler.

While I can make my project to work by selecting suitable optimization level I worry about the SW reliability and possible side effects of the supposed bug.

BR,

Mikhail

  • Hello, mevst.

    > It can also be just a compiler bug.

    I don't think it's a bug, but just need more considerations.

    Raising optimization level, --gc-sections option will be added and it eliminates codes and data not statically linked with other codes to shrink the size.
    I suppose some sections were erased by that.
    Please compare map files between each -O options to check which sections had been erased.
    You may need to add KEEP keyword in linker script to prevent deleting sections you really need.

    Functions or symbols referred via table or pointer, not statically linked could also be discarded, too.
    Declaration with __attribute__ (used) keyword would save such cases.

  • In reply to Okra:

    Thanks Okra for taking a look on this.

    I could try to study which sections get removed as a result of optimization.
    However the code does not work if there is no optimization at all (-O0)!
    And it works only with some certain optimization levels (-Og).

    I think that EL code does something wrong when accessing local variables on stack and corrupts the function return address.

    BR,
    Mikhail
  • In reply to mevst:

    The issue is not caused by a compiler fault.

     

    In the NetX IP helper thread there is a local structure that is allocated on the stack that is used when making calls to the underlying network driver:-

     

    NX_IP_DRIVER driver_request;

     

    Because in the example project there are 2 network interfaces (Ethernet and USB RNDIS), there are deferred driver events from the Ethernet driver, so the driver deferred event handling occurs in the ip helper thread :-

    When this deferred driver event handling calls the USB network driver (index = 1), with certain compiler optimisations certain fields for the driver_request structure have uninitialised values in them that cause problems (the element that causes the issue is nx_ip_driver_return_ptr being 0xefefefef) :-

    Then in the USB network driver, since it doesn't handle deferred processing this code is executed :-

     

    Which tries to write the value 0 to the address 0xefefefef in the highlighted line of code (which is the value held in nx_ip_driver->nx_ip_driver_return_ptr), which causes an imprecise bus fault, so the default handler is called.

    If the driver deferred event handling first cleared the structure NX_IP_DRIVER driver_request; to 0 :-

    memset(&driver_request, 0, sizeof(NX_IP_DRIVER)); :-

    then the problem would not occur.

    The other way around this issue (probably, I have not actually tested this) is to only use a single network interface, the USB network driver. Then there would not be deferred driver events from the Ethernet driver.

  • In reply to Jeremy:

    Jeremy, thanks for the deep study and solutions proposal!

    So it looks like classic bug with relying on content of none-initialized variables.

    Few comments about possible solutions:

    "If the driver deferred event handling first cleared the structure NX_IP_DRIVER driver_request; to 0"

    Is it so that the deferred processing requests are created by NetX code? Then only EL is able to apply this solution. Did I understand it incorrectly?

    "is to only use a single network interface"

    I already tried this and so far it works. I mean there is no exception at startup with any optimization level. However the product what I am developing should support both the Ethernet and the RNDIS. Hopefully we can avoid activation of both interfaces at the same time. Still it would be better to remove this limitation.

    BR,

    Mikhail

  • In reply to mevst:

    >Is it so that the deferred processing requests are created by NetX code? Then only EL is able to apply this solution. Did I understand it incorrectly?

    Yes, this is in protected SSP code, so EL will have to fix this.
  • In reply to Jeremy:

    Jeremy and Mikhail

    This situation is handled somewhat differently between NetX Duo where it is initialized, and NetX where it is not.

    However, it is not exactly a bug in NetX. The NetX User Guide has defined members used for NX_LINK_DEFERRED_PROCESSING command. The driver must not access nx_ip_driver_return_ptr. So it is really a bug in the USBX driver. However it is a good suggestion that NetX initialize this variable first. NetX will be fixed accordingly in our next release, and I will let Renesas know about the need to fix the USB driver accordingly.

    Janet