Traditionally programmers have assumed that a TCP packet will either make it to the remote peer error-free or the TCP socket will be detected as failed. However, this has proven a disastrous assumption in the world of cellular networks.
Cellular networks seem to suffer a kind of burst-error mode where whole groups of TCP packets get lost or delayed, while another group makes it through. This seems to confuse the TCP state machines within OS which are optimized for the more rare, single-packet loss of Ethernet. We have Ethereal traces where one can see the application send a TCP packet, the OS retries once, a collect of old stale TCP acknowledgements return from the remote - then nothing. Eleven hours later there have been no more TCP retries, no TCP keepalive, no response from the remote, and no TCP stack error from the OS to abort the application block. The host application is still hung, blocking waiting for either a response or socket failure which never come.
So is this a bug in the OS? Does it matter? It is your application and "our" customer who pays the price. For example, Digi had to go through our RealPort driver and literally add an OS timer to abort every TCP socket call if it did not return in 60 seconds. Yes, this sounds like a royal pain but it was the only way to avoid this failure every few weeks when running across cellular IP.
Recommendation: applications must NEVER block on a socket waiting for a response or socket failure. Applications must always use an OS or external timer to abort socket functions that take longer than 1 minute. Sadly, running in non-blocking mode is NOT enough since at times it will be the API call which fails to return regardless of the block/non-block setting. So even using API calls with explicit timeouts is not safe.
No comments:
Post a Comment