Wednesday, June 02, 2010

Increasing cellular robustness for Modbus

A customer of ours is creating a custom Python program in one of Digi's cellular gateways. The gateway uses Modbus/RTU to poll a flow-computer, then based upon a time cycle or abnormal events, the gateway uses HTTPS and XML to push data up to a central server.

They have noticed that at times the gateway has trouble connecting to the server for hours at a time, so wondered what the best solutions are to increase success. Their code buffers the data, so no data is lost.

The customer had asked about using a Windows PC to PING the remotes, but I don't think that will help.

Mobile-originated (meaning the gateway calling out to cellular) will be more robust than mobile-terminated (meaning Windows calling remote). Therefore the best solution will be for the cellular gateway to work a bit harder to connect out.

First, a few technical details:
  • Most IP-based cellular devices maintain a full-time IP connection. However, the cell tower 'parks' the device when it is not talking, which means the device loses any radio channels and can only talk after requesting and WAITING for reallocation of bandwidth. And no, there is no way to compel your cellular provider to preallocate full-time a scarce radio resource to a device that is not paying $4000 per year for full-time data movement.
  • This means there is always a delay when a device starts to send data after being idle. I have witnessed this delay being up to 18 seconds long, but it could be longer. Not all TCP/IP clients expect to see such long delays in the connect/socket-open phase.
So assuming the customer's problem is unexpected long delays causing the gateway to give up too soon, the best suggestion is to prime the pump early. Lets say your gateway sends data once per hour, so when your Python task wakes up to send data, then do NOT try to open a TCP socket right away. Instead, first send a garbage UDP/IP packet to your host, then sleep another 15 seconds. It does not matter if the UDP packet fails. You do not need to enable UDP on your host, nor watch for a good or error response. The only goal is to force the cell tower to start allocating radio bandwidth 15 seconds before your task actually tries to create your TCP/IP socket.

This pre-allocation may solve your problem when the problem is caused by slower allocation of radio bandwidth. Since most cellular providers now round up data traffic to the 1K charger per hour, sending an extra 30 bytes per transaction should not increase costs.

However, the problem may NOT be radio bandwidth, but authorization through a chain of roaming service providers. This is especially true for low-cost GPRS/GSM providers. Your data is passing through 2, 3 or more companies, and although they may all have agreed to allow the IP session (the PPP session) to open, one of them might drop the ball and block your data access in a way which the PPP session doesn't detect. The only way to recover is to completely flush and rebuild the PPP session - which in plan English means renegotiate with the base carrier to be assigned your IP address. It simulates a reboot of the product.

Fortunately, Digi includes a low-level feature called SureLink, which can use a PING packet to confirm end-to-end connectivity. You can (as example) configure the Digi to PING an internet IP once per hour (or 'X' hours), then physically reboot the cellular module if access fails. This nicely forces a complete renegotiation of the PPP session, which costs you nothing.

Just be aware that use of DNS can be costly and error-prone. A single DNS lookup can cost over 1K of data plan. For SureLink, you literally want to use a static IP as the target to minimize the overhead costs and false-negatives where things other than PPP health block success.