Monday, October 23, 2006

Cellular-IP Friendly Apps - Socket Open

Most applications attempt to open a TCP socket using the OS/Windows default timeout. This results in an unpredictable timeout. I looked through Microsoft's VS.NET documentation looking for the "How long?" answer ... and never found an answer. I suspect it depends on your version and service-pack levels. I did a web search to discover the truth and found people claiming Windows timed out in 2 seconds, 5 seconds, 10 seconds, 20 seconds, 20-30 seconds, and even one claiming 5 minutes. Sadly, most of these people were looking for a way to force Windows to use a connection timeout of 1 second or less - which will prevent their applications from working on normal wide-area networks.

Such short connection attempts are not suitable for cellular network where the first response packet from an idle remote tends to complete in 3-4 seconds during average conditions. Therefore even a 5 second timeout is too close to the norm to be suitable.

Recommendation: all applications must use an explicit, predictable timeout during a TCP socket open request. This value can be user-settable higher or lower, but for cellular should default to 20 seconds and be settable to at least 60 seconds for satellite.

Impact: On Ethernet this should have limited direct consequences since the timeout only has affect if the remote is not available. If having your application wait 20 seconds for an inaccessable remote is a problem, then enable a user setting to select either "local-area-network" or "wide-area-network" mode and adjust the default connection timeout as appropriate.

In a best case scenario, failing to wait long enough to open a TCP socket when the network is sluggish could prevent connecting for many minutes as sockets succeed to open, but the OS aborts the open before the successful response can come back from the remote. Keep in mind that over cellular the end user is paying for at least 120 bytes of data for every open attempt, and that TCP retransmissions likely make this 160 or 200 bytes.

In a worst case scenario, this aborting of opened sockets on a remote with limited resources risks tying up all resources with past failed opens. Remember, just because your OS timed out the open does not mean the remote device didn't allocate the connection resource and send a successful response. The lack of resources blocks new attempts by the application to reconnect until TCP keepalive or some other mechanism detects the broken sockets and frees up the resources.

Be warned that under cellular - as if in defiance of traditional faith in the reliability of the TCP state machine - TCP sockets break in rare occasions in ways that common OS will fail to detect! During cellular network hiccups, I have seen machines "hang" for 11 hours waiting for a TCP Acknowledgement that never comes! This is with TCP Keepalive enabled for 5 minutes even.

Some Visual Studio discussion: Just out of curiosity, I did some snooping around inside the Visual Studio .NET documentation. I didn't find a good answer, so cannot explain how to solve this problem.

Here is example VB.NET code to opean a TCP socket:
  • Dim tcpClient As New TcpClient
  • Dim ipAddress As IPAddress = dns.GetHostEntry("www.digi.com").AddressList(0)
  • TcpClient.Connect(ipAddress, 11003)
Notice we cannot ask the OS to wait "longer" or "shorter". The documentation says "The Connect method will block until it either connects or fails." A connection timeout results in a SocketException failure being thrown. The TcpClient class has ReceiveTimeout and SendTimeout properties, but these only relate to reads and writes on the connection and have no impact on the initial connection open. Suggestions to use the System.Net.Sockets.Socket class instead aren't helpful since this class also doesn't offer a simple connection timeout mechanism.

The only wait to define a predictable TCP socket connection timeout appears to be use an asynchronous design with BeginConnect and some form of external timer to call EndConnect at the desired timeout.

To rephrase myself, I am not saying the default Windows connection timeout is incorrect - I am saying evidence is that you cannot predict what timeout your customer will see if you don't explicitly define one. So while your application running on your computer may default to a nice 20 seconds timeout, what happens if your customer runs the same application on an older computer and sees a 3 second or 5 second timeout? The answer is they won't be able to reliably connect to cellular or satellite remote IPs, and either won't buy your product again or will call Tech Support.

No comments: