I noticed that the server can become unresponsive for some time, usually 1-2 minutes or so when under high load, e.g. by repeatedly pressing F5. The client will keep retransmitting packets for some time which are not acknowledged by the server (monitored with Wireshark).
The HTTP server then reports some connection failures in its nx_http_server_connection_failures property. After waiting for said time the server is responsive again.
The strange thing is that I only observed this behaviour on three notebooks running Windows 10/Firefox, on a Linux PC/Firefox I can not reproduce this behaviour and I can keep hitting F5 without disturbing the communication. Two notebooks are from our company, one from a university, so it seems unlikely that this is caused by specific settings on these devices. On the Linux PC the connections get cancelled by sending RST to the server for the outdated connections, which I cannot observe on the Windows devices when under high load.
Anything I can do about this? Maybe decrease the chance of this happening, or decrease the recovery time? I already tried to play with some of the webserver settings, e.g. timeouts max connections in queue, ... without too much success.
Below the webserver statistics after some testing on both devices:
"connections pending": 0,
"allocation errors": 0,
"connection failures": 333,
"connection successes": 2671,
"get requests": 2661,
"Invalid HTTP headers": 6,
"total bytes received": 0,
"total bytes sent": 0,
"unknown requests": 0
"invalid packets": 0,
"invalid transmit packets": 0,
"packets forwarded": 0,
"packets reassembled": 0,
"raw packet suspended count": 0,
"raw received packet count": 0,
"reassembly failures": 0,
"receive checksum errors": 0,
"receive packets dropped": 131,
"send packets dropped": 60,
"successful fragment requests": 0,
"TCP active connections": 6,
"TCP bytes received": 1571584,
"TCP bytes sent": 8172739,
"TCP checksum errors": 0,
"TCP connections": 3043,
"TCP connections dropped": 11,
"TCP created sockets count": 1,
"TCP disconnections": 2983,
"TCP created sockets count": 1,
"TCP invalid packets": 0,
"TCP packets received": 3246,
"TCP packets sent": 6834,
"TCP passive connections": 3048,
"TCP receive packets dropped": 58,
"TCP received packets count": 0,
"TCP resets received": 497,
"TCP resets sent": 9,
"TCP retransmit packets": 30
}HTTP Server configuration:IP configuration:Shared packet pool between IP, HTTP Server and NetX BSD Support configuration:
In reply to ChrisS:
attached are four wireshark dumps, two with Firefox, two with Chrome.
bootloader_success_chrome.pcapng and bootloader_success_firefox.pcapng show the loading of the bootloader page once which works as expected.
In bootloader_refresh_problems_firefox.pcapng I repeatedly refresh the page by holding F5 until I get packet retransmissions. After waiting some time the webserver works again as expected. In bootloader_success_chrome.pcapng I tried to do the same, but was not able to provoke this behaviour. However, I noticed that with Chrome I need to press the button again while in Firefox I can hold it down to continously refresh, so maybe the frequency was too low here. I also see this behaviour with my more complex web page in my firmware, so I don't really think that this is the reason.
This sounds very much like packet pool starvation. If so, he has checked everything but hte 'right' places. He needs to check for packet pool statistics (nx_packet_pool_info) and adjust his TCP settings (reduce the transmit queue length). Ideally he should have a separate packet pool for his HTTP server (only used to transmit packets). That way, if that packet pool is depleted, NetX can still in theory receive packets. If it cannot even receive packets, such as ACKs from the browser which would free up packets sitting on the transmit queue waiting to be ACKed or retransmitted, then the system gets stuck. Warren Miller and I just wrote an App Note for packet pool (and TCP) management. I will look for the link. If packet pool starvation is the problem, there are several useful suggestions for diagnosing it and mitigating it. I think the reason Linux does not jam up the server is that by sending RSTs, that kills the connection which frees packets trapped on the transmit queue. Thanks, Janet
In reply to JanetC:
In reply to WarrenM:
I now use a separate pool for the HTTP server again. My observations so far: When I limit the "Maximum number of connections in queue" to values below "Maximum number of queued transmit packets (units)" and HTTP server packet pool size the problem where the server becomes completely unresponsive for some time after holding down F5 can be fixed but the performance of the server degrades. In particular, I tried values of "Maximum number of connections in queue" = 2 and both "Maximum number of queued transmit packets (units)" and HTTP server packet pool size = 20 right now. When I do the opposite the communication breaks down for obvious reasons. I've tried various combinations of these values between 5 and 20 right now and only the attempt with "Maximum number of connections in queue" = 2 completely prevented the problem from happening, although I was able to reduce the time where the server wouldn't respond. I guess I'll have to try some more to find an optimum here. I initially thought that the reason why this only happens on Firefox might be because it doesn't appear to respect the caching header I transmit ("Cache-Control: public, max-age=31536000") when pressing F5 while Chrome does. Firefox only uses its cache when clicking in the URL bar and pressing return. This results in less packets being transmitted which prevents this issue.
However, even when I disable the cache in Chrome I cannot recreate this behaviour, so it must relate to how the browsers handle their connections when refreshing.
Unfortunately this issue still isn't solved with Firefox. I actually moved the packet pools to an external SDRAM to be able to use very large pools (200 packets IP pool, 100 packets HTTP pool) but it does not help the problem. The pools are not getting depleted, the HTTP one only uses 1 or 2 packets in a normal use case.
On a typical case I'll see that Syn and Syn+Ack is sent but Ack (from the Browser) is not seen in Wireshark so the device keeps retransmitting Syn+Ack to the PC for "Maximum number of retries per packet" times. In other cases I see retransmissions of SYN from the browser. After waiting for some time the device usually becomes responsive again but I've also seen cases where it didn't recover.
For the packet pools I monitor the following usage:
"min ip packet buffer available": 149,
"max ip packet buffer available": 192,
"min http packet buffer available": 99,
"max http packet buffer available": 100,
Right now I'm trying the following settings in HTTP server:
As you can see I tried very large values of max connections and max number of queued transmit packets to support a larger number of connections. The browser makes up to 16 connections for all resources of the web page, not all at the same time though.
The problem seems to occur randomly but frequently. Firefox is working sometimes and I can refresh without problems but on other times it fails on the first attempt. Refreshing three times in quick succession is also leading to a reproducible problem.
Is it possible that the PC/Device gets confused by the device switching IPs before (because of DHCP and possibly a fallback to a static IP)? I don't really think so as the packets go to the correct IPs but who knows...
Any other ideas?