Wednesday, December 06, 2006

Cellular IP-Friendly Apps - It Costs to Talk

Summary: All communication must be "under control". All data sent into the cellular system costs money; even if the remote cellular device is powered off, the customer still pays for data set to it.

As a follow-on to the discussion of Retrying TCP Socket Opens, applications must allow the user to both understand and limit all aspects of protocol usage and retry. Users must be allowed to limit and predict a reasonable worst case traffic cost. For example, some protocols include large blocks of initial connection negotiation, which means talking once per minute over an continuously open socket could result in much less cost than talking once per 10 minutes over a socket opened just for one transaction. I have seen applications that allow users to set a maximum desired retry setting - then not always follow that setting and do retries anyway in certain fault conditions.

Recommendation: application-writers must step back and examine every place within the application they create traffic and confirm users have the ability to limit the traffic created.

Example and Numbers: now most of you will be saying "Yah, dahh - so obvious why is this even mentioned?". Well, I'll give you an all too typical example of how this affects real customers. A customer (call him Joe) running a pilot on cellular data access calls to complain his costs are higher than expected. He says he's just polling 3 Modbus registers every 5 minutes. Being no dummy, Joe has already calculated that each request should be 12 bytes of data (One Modbus/TCP function 3 read) and each response should be 17 bytes of data (One Modbus/TCP response with 4 registers since he is reading 4x00003, 4x00004 and 4x00006 so one assumes 4x00005 comes along for the ride). One poll each 5 minutes works out to be 8640 poll per 30 days, so he had hoped to see only about a quarter-megabyte of traffic a month. Yet Joe was seeing data bill for 6 to 10 MB of traffic a month. This means his $20 per month 5MB plan was costing him closer to $60 per month with data overages.

First, Joe overlooked the fact that he has to pay for not only his Modbus data, but also the TCP and IP overhead used to move it. Standard Windows-generated TCP headers are 20 bytes and so are the IP headers. Linux tends to defaults to use TCP time-stamps and thus creates 28-byte TCP headers. So each request is NOT 12 bytes, but 52-60 bytes ... plus the TCP Acknowledge frame will add an additional 40-48 bytes. Yes, YOU pay for the TCP Acknowledgements as well! With headers and TCP Acknowledgement, his responses will be 97-113 bytes not 17. So right off the bat, I can see that he has been under-estimating his monthly traffic. Since he is using Windows, he should be seeing at least 1.6MB of traffic and never 0.25MB.

So I vist Joe and do a network trace of his OPC server traffic. We see that OPC is issuing 3 Modbus polls every 5 minutes - not 1. Hmmm, of course Joe's first reaction is "Heck no - I'm not polling 3 - just 1" but the proof is there as colored pixels on my notebook display. We decode the polls and see the OPC server is polling 3 blocks of 32-registers each. After decoding the Modbus/TCP bytes we learn the exact registers being polled and Joe eventually discovers why these are being polled:

  • One block of 32 registers is fetching his 3 desired value of 4x00003, 4x00004 and 4x00006. Reading the fine print in the OPC manual we see that the OPC server decided this was a "scattered poll" of 2 separate memory areas so it bumped the size up to 32 registers. So just for this one poll, his monthly budget is up to 2.8MB instead of 0.25MB
  • A second block was caused by Joe programming an HMI display to pop-up if a certain alarm condition where true in the field. This was a demo he'd done to impress a customer, but Joe hadn't thought to disable it nor had realized the exact "cost" of such a feature. So the OPC server needs a single register off somewhere else in the PLC memory to satisfy the HMI's alarm/event function. We don't know why this is polled as 32 registers instead of 1 - it is not a "scattered poll" as defined by the OPC vendor's documentation. Perhaps his HMI or OPC server software has a bug in it. Since this is Modbus/TCP (not serial) it is unlikely anyone else has noticed or cared that the application is moving 62 bytes of extra data in every poll. After all, Ethernet is fast and costs nothing to use. It is possible the programmers at the OPC vendor just decided there was no reason to ever poll less than 32 registers when using fast, free "Ethernet".
  • The final block was being caused by Joe's boss leaving open an HMI display in another room that wasn't supposed to be left open - human error (or is it?). Joe learned instanty how important it was for him to properly configure the HMI display settings which timed out displays - either closing the window or just stopping the supporting data polls. He had done that for the normal "user display", but had been lazy and not put such settings into the various diagnostic displays users weren't expected to use!
So now his 6 to 10MB of traffic a month begins to make sense. Each distinct poll is creating nearly 3MB of traffic per month, and his traffic is influenced by which HMI displays users open. Multiple 3MB by 3 polls and you roughly 9MB per month.

What has been learned here?

  • With the overhead of TCP/IP, Joe learned that he had to pay for over 4 times more traffic than his raw Modbus byte calculates had led him to believe.
  • Joe learned that he should be looking at using UDP/IP instead of TCP/IP for his Modbus/TCP since this would cut 40-60% off his bill instantly. Modbus doesn't really require the TCP Acknowledgement and my own tests of UDP/IP over cellular shows it to be about 99.99% reliable - or put another way, I only see about 1 packet lost per 10,000 sent.
  • Joe learned how to review his OPC server's data statistics page. His OPC server had been (indirectly) giving him the answer as to why his data usage was so high. While his OPC server never totaled up the data bytes to include TCP/IP overhead, it was able to show him the 36 polls per hour he was moving instead of his expected 12 (one per 5 minutes).
  • Joe learned that perhaps he needs to look for a new OPC supplier, since his present vendor just doesn't seem to see the big picture of IP-enabled protocols; that Ethernet is not the only media using TCP/IP. Increasingly people expect TCP/IP to move through diverse media which is not always "fast and free" like Ethernet. Joe's present OPC supplier didn't give him the ability to reduce the poll block size below 32 registers when the OPC system thought "Ethernet" was being used.
  • Joe learned he had to be more aggressive in his HMI display design. He couldn't assume users would only look at certain displays and not leave open displays unexpectedly. Joe needed to actively set every possible display to automatically close or stop generating new polls. In fact, after review he discovered that most of his displays had no need for "real-time" update and he could just set them to display the data once as read without any refresh. Users always had to the option to manually redisplay the page.
  • Joe learned that maybe just reading data from the RTU program directly was not such a wise idea. His RTU had the ability to copy and repack data into special polling areas to eliminate "scattered polls". In fact, in the above example we traced at Joe's site, all of the data in those 3 polls could have easily fit within a single 13 register block. So Joe is reviewing his RTU program design to repack ALL data of interest - even data supporting rare HMI displays - into a dedicated memory area. While Joe had previously hoped to avoid this work, he now sees the potential dollar saving or cost penalty his company could face if he avoided this work.

So really in summary I have to say Your data polling needs to be UNDER CONTROL, as in being controlled. You need both the tools and the investment in effort to define as exactly as possible each and every data poll.

No comments: