Friday, February 09, 2007

Cellular IP-Friendly Apps - Response Delays

Back to my series of entries on creating graceful IP apps

Many newly written Ethernet-enabled applications incorrectly equate "Ethernet = Fast". They overlook that Ethernet is often just a path into other slower IP-based networks. Worse, some well meaning programmers set the response default to 250 milliseconds and limit the user configuration to a maximum of 5 seconds - I'd say so far about 20% of the applications I've had to help customers will limit Ethernet timeouts to 5 seconds or less.

But cellular networks have a high end-to-end latency - especially if the line has been idle for a few minutes. Normal slave response times will be near 2 seconds with round trip delays up between 10 to 12 seconds common each day (see my entry on real world Modbus numbers). Interestingly enough, every cellular "expert" I talk to keeps correcting me that cellular latencies are in the 50 to 100msec range and getting better every new "gen". Well, I guess my Saturn ION can do 400 miles-per-hour also ... if you drop it out of an airplane! Well, regardless of what these "experts" are smoking, my simple tests show otherwise where it really counts ... in actual real world tests run over the Internet to cellular-based IP devices.

Recommendation: IP applications should default to a 3 second response timeout. Applications must allow users to configure this timeout to be lower (perhaps to 250msec) and also higher to at least 60 seconds.

Impact: On Ethernet this should have no direct consequences since the timeout only has affect if the remote is no longer available - in which case the remote is going 'offline' anyway. The minority of users who really want a 250 millisecond timeout can set it manually, while cellular users who want a more reasonable timeout of 15 seconds can also set it also.

For cellular networks, the real problem with premature timeout is the customer has already paid for the request and very likely will also pay for the response - even if the response comes after the application gave up on the response and did a request retry. Assuming the user is polling the remote at a moderate pace to control costs, there is no harm is waiting longer for the response to maximize the value of the traffic paid for.

Another simple example is an application that sends a request, then timeouts twice and retries twice. How will the application react when it receives three responses at the same time? Remember, the first two requests probably were not lost; they still likely reached the remote device and created responses. Their responses may have been just delayed longer than expected. Since serial Modbus doesn't include enough information in a response to match it up to a request, this can cause serious misoperation of the system. Protocols including a sequence number should handle this more gracefully, but it will still be a waste of money.

We have also seen protocols which treat unexpected responses as a reason to abort and reset the communication channel, which further adds to cost. For example, we had one super headache with a big-name seller of "energy curtailment" systems. The end user insisted a 5 second timeout was the maximum they could tolerate (ie: wishful thinking - set a 5 second timeout regardless of reality). So lets just see what happens when we hit one of the rare but expected latencies over 10 seconds.
  1. SCADA software sends out request sequence 74
  2. 5 seconds later, SCADA times out 74 and sends out 75
  3. 5 seconds later, SCADA times out 75 and sends out 76
  4. 1 second later - since TCP/IP is reliable - all three responses return
  5. SCADA is expecting response 76, but sees 74 ... Oh, big problem ... need to reset comm subsystem
  6. SCADA sends reset to remote RTU, expects response 1 but ... da da ... sees response 75 since they never flushed the old info and TCP/IP is reliable.
  7. SCADA sends a 2nd reset to remote RTU, expects response 1 but sees response 76 since they never flushed the old info and TCP/IP is reliable
  8. At this point, I hope you see that there are still 2 responses to the comms reset in the receive queue!

Anyway, whenever this reset "temper-tantrum" occurred it would take 10 to 15 minutes to get the connection back up. Of course one problem was the stupid customer unwilling to set the correct timeout, but the SCADA software was defective since it wasn't smart enough to just discard old responses with timed out sequence numbers. In the above example, life would have been fine and dandy had the SCADA system just discarded responses 74 and 75 since it expected 76.

No comments: