HTTP Server responsiveness

I'm serving an HTML page consisting of static resources compiled into my application (no FileX) and some callbacks which are called by Javascript on the page to fill in some dynamic values.

I noticed that the server can become unresponsive for some time, usually 1-2 minutes or so when under high load, e.g. by repeatedly pressing F5. The client will keep retransmitting packets for some time which are not acknowledged by the server (monitored with Wireshark).

The HTTP server then reports some connection failures in its nx_http_server_connection_failures property. After waiting for said time the server is responsive again.

The strange thing is that I only observed this behaviour on three notebooks running Windows 10/Firefox, on a Linux PC/Firefox I can not reproduce this behaviour and I can keep hitting F5 without disturbing the communication. Two notebooks are from our company, one from a university, so it seems unlikely that this is caused by specific settings on these devices. On the Linux PC the connections get cancelled by sending RST to the server for the outdated connections, which I cannot observe on the Windows devices when under high load.

Anything I can do about this? Maybe decrease the chance of this happening, or decrease the recovery time? I already tried to play with some of the webserver settings, e.g. timeouts max connections in queue, ... without too much success.

Below the webserver statistics after some testing on both devices:

{
    "server": {
        "connections pending": 0,
        "allocation errors": 0,
        "connection failures": 333,
        "connection successes": 2671,
        "get requests": 2661,
        "Invalid HTTP headers": 6,
        "total bytes received": 0,
        "total bytes sent": 0,
        "unknown requests": 0
    },
    "ip": {
        "invalid packets": 0,
        "invalid transmit packets": 0,
        "packets forwarded": 0,
        "packets reassembled": 0,
        "raw packet suspended count": 0,
        "raw received packet count": 0,
        "reassembly failures": 0,
        "receive checksum errors": 0,
        "receive packets dropped": 131,
        "send packets dropped": 60,
        "successful fragment requests": 0,
        "TCP active connections": 6,
        "TCP bytes received": 1571584,
        "TCP bytes sent": 8172739,
        "TCP checksum errors": 0,
        "TCP connections": 3043,
        "TCP connections dropped": 11,
        "TCP created sockets count": 1,
        "TCP disconnections": 2983,
        "TCP created sockets count": 1,
        "TCP invalid packets": 0,
        "TCP packets received": 3246,
        "TCP packets sent": 6834,
        "TCP passive connections": 3048,
        "TCP receive packets dropped": 58,
        "TCP received packets count": 0,
        "TCP resets received": 497,
        "TCP resets sent": 9,
        "TCP retransmit packets": 30
    }
}

HTTP Server configuration:

IP configuration:

Shared packet pool between IP, HTTP Server and NetX BSD Support configuration:
  • In reply to ChrisS:

    By the way, I'm also seeing this issue in my bootloader, which only serves a very simple one request page for uploading a new firmware, but I will need to refresh a few more times in a row, so it seems to be somewhat related to webpage complexity or number of requests made.
  • In reply to ChrisS:

    Hello,

    attached are four wireshark dumps, two with Firefox, two with Chrome. 

    bootloader_success_chrome.pcapng and bootloader_success_firefox.pcapng show the loading of the bootloader page once which works as expected.

    In bootloader_refresh_problems_firefox.pcapng I repeatedly refresh the page by holding F5 until I get packet retransmissions. After waiting some time the webserver works again as expected. In bootloader_success_chrome.pcapng I tried to do the same, but was not able to provoke this behaviour. However, I noticed that with Chrome I need to press the button again while in Firefox I can hold it down to continously refresh, so maybe the frequency was too low here. I also see this behaviour with my more complex web page in my firmware, so I don't really think that this is the reason.

    traces.zip

  • In reply to ChrisS:

    This sounds very much like packet pool starvation. If so, he has checked everything but hte 'right' places. He needs to check for packet pool statistics (nx_packet_pool_info) and adjust his TCP settings (reduce the transmit queue length). Ideally he should have a separate packet pool for his HTTP server (only used to transmit packets). That way, if that packet pool is depleted, NetX can still in theory receive packets. If it cannot even receive packets, such as ACKs from the browser which would free up packets sitting on the transmit queue waiting to be ACKed or retransmitted, then the system gets stuck.

    Warren Miller and I just wrote an App Note for packet pool (and TCP) management. I will look for the link. If packet pool starvation is the problem, there are several useful suggestions for diagnosing it and mitigating it.

    I think the reason Linux does not jam up the server is that by sending RSTs, that kills the connection which frees packets trapped on the transmit queue.

    Thanks,
    Janet

  • In reply to JanetC:

    Chris, never mind I just downloaded the trace file.
  • In reply to JanetC:

    Here is the Knowledge Base article referenced:
    en-support.renesas.com/.../18188850

    Warren
  • In reply to WarrenM:

    Thanks Warren. Chris, please send this link to your customer. I think it will be very helpful to them. If not, I would want to look at his project to see what might be causing the problem.

    Janet
  • In reply to JanetC:

    I now use a separate pool for the HTTP server again.
    My observations so far:
    When I limit the "Maximum number of connections in queue" to values below "Maximum number of queued transmit packets (units)" and HTTP server packet pool size the problem where the server becomes completely unresponsive for some time after holding down F5 can be fixed but the performance of the server degrades. In particular, I tried values of "Maximum number of connections in queue" = 2 and both "Maximum number of queued transmit packets (units)" and HTTP server packet pool size = 20 right now. When I do the opposite the communication breaks down for obvious reasons. I've tried various combinations of these values between 5 and 20 right now and only the attempt with "Maximum number of connections in queue" = 2 completely prevented the problem from happening, although I was able to reduce the time where the server wouldn't respond. I guess I'll have to try some more to find an optimum here.

    I initially thought that the reason why this only happens on Firefox might be because it doesn't appear to respect the caching header I transmit ("Cache-Control: public, max-age=31536000") when pressing F5 while Chrome does. Firefox only uses its cache when clicking in the URL bar and pressing return. This results in less packets being transmitted which prevents this issue.

    However, even when I disable the cache in Chrome I cannot recreate this behaviour, so it must relate to how the browsers handle their connections when refreshing.

  • In reply to ChrisS:

    Unfortunately this issue still isn't solved with Firefox. I actually moved the packet pools to an external SDRAM to be able to use very large pools (200 packets IP pool, 100 packets HTTP pool) but it does not help the problem. The pools are not getting depleted, the HTTP one only uses 1 or 2 packets in a normal use case.

    On a typical case I'll see that Syn and Syn+Ack is sent but Ack (from the Browser) is not seen in Wireshark so the device keeps retransmitting Syn+Ack to the PC for "Maximum number of retries per packet" times. In other cases I see retransmissions of SYN from the browser. After waiting for some time the device usually becomes responsive again but I've also seen cases where it didn't recover.

    For the packet pools I monitor the following usage:

       "min ip packet buffer available": 149,

       "max ip packet buffer available": 192,

       "min http packet buffer available": 99,

       "max http packet buffer available": 100,

    Right now I'm trying the following settings in HTTP server:

    As you can see I tried very large values of max connections and max number of queued transmit packets to support a larger number of connections. The browser makes up to 16 connections for all resources of the web page, not all at the same time though.

    The problem seems to occur randomly but frequently. Firefox is working sometimes and I can refresh without problems but on other times it fails on the first attempt. Refreshing three times in quick succession is also leading to a reproducible problem.

    Is it possible that the PC/Device gets confused by the device switching IPs before (because of DHCP and possibly a fallback to a static IP)? I don't really think so as the packets go to the correct IPs but who knows...

    Any other ideas?

  • In reply to ChrisS:

    Chris,

    Looking at bootloader_refresh_problems_firefox.pcapng, it looks like the HTTP server is doing what it should do. It resends the SYN ACK twice roughly 4 seconds apart (not sure why it isn't 2 seconds apart), and then moves onto the next connection request in its queue.

    The problem with this is that the Server can start falling behind on SYN requests By the time it does the retries 5 seconds apart the Client has dropped the connection request. Each time it goes through the slow retry process, the HTTP server falls further behind with each connection request in its queue.

    This problem starts around frame 673, the HTTP server responds within 2 msec to the SYN request on port 64455, but the HTTP Client seems too busy sending out 12 new connection requests over the next 500 msec to respond. In frame 686, the HTTP server retransmits to the 64455 connection request but by then the Client has long since dropped it. Now when it moves onto the next connection request, the client as already dropped it on its end. Hence the HTTP server falls further and further behind.

    There after something like 60 seconds elapses, the HTTP CLient sends another connection request on port 50245 in frame 765. This time it retransmits long enough (7 seconds) for the HTTP server to get to that connection request, and the Client and Server are finally back on track.

    In the chrome trace, the Client does not send a flurry of connection requests without checking for responses like Firefox does, hence the HTTP server is able to keep up

    Can you retry the firefox test with timeout between retries set to 1 (or as low as possible) and number of retries set to 1 or 2? I realize this might degrade performance in the normal case, but a 60 second stall is as degraded as you can get already. If my take on this is right, this should alleviate at least in part the lengthy stalls.

    Janet