Friday, January 05, 2007

Real Numbers - Modbus/RTU over cellular

I'm working on a fuller set of numbers, but here is a real-world example of Modbus/RTU poll-response times over cellular.

I am running a Modbus/TCP poller (ModScan32), which polls a Digi PortServer TS4H by Ethernet, and the TS4H in turn is using Modbus/RTU-in-TCP/IP via the Digi corporate backbone to poll a Digi Connect WAN on Cingular GSM with a Twido PLC on the serial port.

Why use the TS4H here? Why not go directly from a Windows computer to cellular? Well, three reasons:
1) The Digi TCP/IP stack is much more graceful over cellular than Windows' TCP/IP stack - the Windows stack retries too hard and wastes bandwidth that some patience would save. With Windows you'll commonly see 2 to 5 percent retransmissions which - given how cellular is VERY reliable - ends up doing nothing but create duplicated TCP acknowledgments you must pay for. This is actually a weakness in the design of TCP/IP; which proponents claim is SOLVED by TCP/IP as-is. TCP allows hosts to auto-adjust timing behavior to match real-world performance. Unfortunately, the standard TCP algorithms keep timing too close to the "average" behavior which wastes CASH over cellular links with high and variable latency.
2) I am testing a slave timeout algorithm in the Modbus Bridge code for the Digi One IAP and TSx family related to "stale" responses arriving after the slave timeout. This is a common weakness in Modbus/RTU hosts which assume either a timely response or NO response.
3) The Digi Modbus Bridge keeps nice timing statistics such as min/max/average round-trip delay and most Windows tools do NOT.

So for example, my Master polls once per 30 seconds. This means the GSM modem in the Digi Connect WAN maintains a constant data slot allocation with the cell tower. After 370 polls, 352 have had a round-trip of 2500 msec or less and 18 polls have had a round-trip above 2500 msec (ie: I have seen 18 timed out requests with a "stale response" arriving AFTER the timeout period - behavior I am investigating).

The Digi PortServer TS4H telnet trace includes this info:
01:38:15 IA INFO: mbrtu:s02 complete rsp min:467 avg:1565 max:9142 msec

This means the fastest poll took only 467 msec, the slowest took 9142 msec (nearly 10 seconds!) and the moving average round-trip time is about 1.5 seconds. So the minimum round-trip was only one-third of the average, while the maximum round-trip was nearly six-times the average. Since every poll is exactly the same, one would NEVER see such variance on a direct RS-232 or RS-485 serial line. I'd be surprised if a PLC would have even a +/- 10% variance in response times. This is one of the problems with using off-the-shelf tools with cellular - the vendors have just NOT designed the tools for such variation in response performance.

As a side note, the polls being less than 40-50 seconds is of interest here because the cell tower (with 2.5G GSM/CDMA) will take away the data slots from the modem if it has been idle too long - the time varies but is often in the 40-50 seconds. When this happens, the modem is still connected to the tower but requires some control traffic to be reassigned bandwidth to move data. So using a poll rate slower than this idle time would shift the average round-trip time up. Once cellular system make the move to 3G this "idle period" will drop to a few seconds only, meaning telemetry systems may perform much WORSE in the new faster networks.

1 comment:

Anonymous said...
This comment has been removed by a blog administrator.